k-Nearest Neighbors (k-NN) classification is a non-parametric, instance-based learning algorithm used in machine learning to classify data points based on the classes of their nearest neighbors in the feature space. It assigns a class to a data point by considering the classes of its k closest neighbors. The main purpose of k-NN classification is to predict the class of new data points by leveraging the similarity to existing labeled data.
1. Distance Metric: The algorithm uses a distance metric (commonly Euclidean distance) to determine the "closeness" of data points.
2. Choosing k: The parameter k specifies the number of nearest neighbors to consider for making the classification decision.
3. Majority Voting: The predicted class for a new data point is the class that is most common among its k nearest neighbors.
4. Weighted Voting: In some cases, neighbors are weighted according to their distance, with closer neighbors having more influence on the classification.
Non-Parametric: k-NN is a non-parametric method, meaning it makes no assumptions about the underlying distribution of the data. This makes it flexible in handling various types of data.
Instance-Based Learning: The algorithm stores the entire training dataset and makes predictions based on the local patterns in the data. It is also known as a "lazy" learning algorithm because it delays processing until a query is made.
Distance Calculation: The choice of distance metric can significantly affect the model's performance. Common metrics include Euclidean, Manhattan, and Minkowski distances.
Choice of k: The value of k is a critical hyperparameter. Cross-validation is often used to determine the optimal value of k for a given dataset.
k-Nearest Neighbors (k-NN) classification is a non-parametric, instance-based learning algorithm used to classify data points based on the classes of their nearest neighbors. This example demonstrates how to implement k-NN for multiclass classification using synthetic data, evaluate the model's performance, and visualize the decision boundary for three classes.
1. Import Libraries
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score, classification_report
This block imports the necessary libraries for data manipulation, plotting, and machine learning.
2. Generate Sample Data with 3 Classes
np.random.seed(42) # For reproducibility n_samples = 300 # Class 0: Cluster at the top-left corner X0 = np.random.randn(n_samples // 3, 2) * 0.5 + [-2, 2] # Class 1: Cluster at the top-right corner X1 = np.random.randn(n_samples // 3, 2) * 0.5 + [2, 2] # Class 2: Cluster at the bottom-center X2 = np.random.randn(n_samples // 3, 2) * 0.5 + [0, -2] # Combine all classes X = np.vstack((X0, X1, X2)) y = np.array([0] * (n_samples // 3) + [1] * (n_samples // 3) + [2] * (n_samples // 3))
This block generates synthetic data for three classes located in different regions of the feature space.
3. Split the Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
This block splits the dataset into training and testing sets for model evaluation.
4. Create and Train the k-NN Classifier
k = 5 # Number of neighbors knn_classifier = KNeighborsClassifier(n_neighbors=k) knn_classifier.fit(X_train, y_train)
This block initializes the k-NN classifier with the specified number of neighbors and trains it using the training dataset.
5. Make Predictions
y_pred = knn_classifier.predict(X_test)
This block uses the trained model to make predictions on the test set.
6. Evaluate the Model
accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy:.2f}") print("\nClassification Report:") print(classification_report(y_test, y_pred))
Output:
Accuracy: 1.00 Classification Report: precision recall f1-score support 0 1.00 1.00 1.00 22 1 1.00 1.00 1.00 16 2 1.00 1.00 1.00 22 accuracy 1.00 60 macro avg 1.00 1.00 1.00 60 weighted avg 1.00 1.00 1.00 60
This block calculates and prints the accuracy and classification report, providing insights into the model's performance.
7. Visualize the Decision Boundary
h = 0.02 # Step size in the mesh x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = knn_classifier.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.figure(figsize=(12, 8)) plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu, alpha=0.8) plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu, edgecolors='black') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.title(f'k-NN Classification (k={k})') plt.colorbar() plt.show()
This block visualizes the decision boundaries created by the k-NN classifier, illustrating how the model separates the three classes in the feature space.
Output:
This structured approach demonstrates how to implement and evaluate k-NN for multiclass classification tasks, providing a clear understanding of its capabilities and the effectiveness of visualizing decision boundaries.
The above is the detailed content of K Nearest Neighbors Classification, Classification: Supervised Machine Learning. For more information, please follow other related articles on the PHP Chinese website!