K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a simple yet effective supervised learning algorithm used for classification and regression tasks. It operates on the principle that similar data points are located close to each other in the feature space.

How KNN Works

  1. Choose the Number of Neighbors (K): Determine the number of nearest neighbors to consider for predictions.
  2. Calculate Distance: Compute the distance between the test instance and all training instances. A common metric is Euclidean distance. $$ d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} $$
  3. Identify Neighbors: Sort the distances and select the K closest data points from the training set.
  4. Vote for Classification: For classification tasks, the majority class among the neighbors determines the predicted class.
  5. Average for Regression: For regression tasks, the prediction is the average of the target values of the K neighbors.

KNN Using the Iris Dataset

# Import libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split

# Load Iris dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Create and train the model
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X_train, y_train)

# Predict on test data
y_predict = classifier.predict(X_test)

# Print sample predictions
for i in range(10):
    print(f"Predicted: {iris.target_names[y_predict[i]]}, Actual: {iris.target_names[y_test[i]]}")
        
Predicted: virginica, Actual: virginica
Predicted: versicolor, Actual: versicolor
Predicted: setosa, Actual: setosa
Predicted: virginica, Actual: virginica
Predicted: setosa, Actual: setosa
Predicted: versicolor, Actual: versicolor
Predicted: versicolor, Actual: versicolor
Predicted: setosa, Actual: setosa
Predicted: setosa, Actual: setosa
Predicted: versicolor, Actual: versicolor

Model Evaluation

Use Classification Metrics to evaluate the performance of the KNN model.
# Evaluate model accuracy
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_predict)
print(f"Accuracy: {(accuracy * 100):.2f}%")
Accuracy: 97.78%

Advantages and Disadvantages

Advantages Disadvantages
Simplicity: Easy to understand and implement. Computational Complexity: High costs for large datasets due to distance calculations.
No Training Phase: Directly stores the dataset for predictions. Storage Requirements: Requires storing the entire dataset in memory.
Adaptability: Works for both classification and regression tasks. Sensitivity to Noise: Can be affected by noisy data or outliers.
Flexibility: Can handle multi-class problems and datasets of arbitrary shapes. Choosing K: A small K may lead to overfitting, while a large K may smooth over class boundaries.

Applications

  • Recommendation Systems: Suggesting items based on user preferences.
  • Image Classification: Categorizing images based on visual features.
  • Anomaly Detection: Identifying outliers in data.
  • Pattern Recognition: Recognizing handwriting, speech, etc.