KNN
Core idea: it finds the K nearest neighbours to a test data point and assigns the most common label as the label of the test point
pseudo code for classification
1
2
3
4
5function KNN(X_train, y_train, X_test, k):
distances = compute_distances(X_train, X_test) # calculate distances between training and test data
nearest_neighbors = get_k_nearest_neighbors(distances, k) # get indices of k nearest neighbors
labels = y_train[nearest_neighbors] # get labels of nearest neighbors
return most_common_label(labels) # return most common label among neighborsUsage: KNN is computationally expensive for large datasets as it calculates the distances between all pairs of data points. It can be useful when the number of classes is small, the data is not very high dimensional and there are enough training samples.
pseudo code of regression
1
2
3
4
5
6function KNN(X_train, y_train, X_test, k):
distances = compute_distances(X_train, X_test) # calculate distances between training and test data
nearest_neighbors = get_k_nearest_neighbors(distances, k) # get indices of k nearest neighbors
y_pred = average(y_train[nearest_neighbors]) # compute average output value of nearest neighbors
return y_pred
k-NN behaviour vs. neighbour number and training sample size.
- neighbour number k: if k is too small, sensitive to noise, overfitting. Too large, missing important patterns, underfitting
- Training sample size: small training set, difficult to generalize new data, result in high variance and overfitting. Large training set, capture the underlying patterns, result in lower variance and better generalisation, but computational cost increases