2023-05-10

ML-week2

KNN

Core idea: it finds the K nearest neighbours to a test data point and assigns the most common label as the label of the test point

pseudo code for classification

function KNN(X_train, y_train, X_test, k):
  distances = compute_distances(X_train, X_test) # calculate distances between training and test data
  nearest_neighbors = get_k_nearest_neighbors(distances, k) # get indices of k nearest neighbors
  labels = y_train[nearest_neighbors] # get labels of nearest neighbors
  return most_common_label(labels) # return most common label among neighbors

Usage: KNN is computationally expensive for large datasets as it calculates the distances between all pairs of data points. It can be useful when the number of classes is small, the data is not very high dimensional and there are enough training samples.

pseudo code of regression

function KNN(X_train, y_train, X_test, k):
  distances = compute_distances(X_train, X_test) # calculate distances between training and test data
  nearest_neighbors = get_k_nearest_neighbors(distances, k) # get indices of k nearest neighbors
  y_pred = average(y_train[nearest_neighbors]) # compute average output value of nearest neighbors
  return y_pred

k-NN behaviour vs. neighbour number and training sample size.

neighbour number k: if k is too small, sensitive to noise, overfitting. Too large, missing important patterns, underfitting
Training sample size: small training set, difficult to generalize new data, result in high variance and overfitting. Large training set, capture the underlying patterns, result in lower variance and better generalisation, but computational cost increases