ML-week2

KNN

  1. Core idea: it finds the K nearest neighbours to a test data point and assigns the most common label as the label of the test point

  2. pseudo code for classification

    1
    2
    3
    4
    5
    function KNN(X_train, y_train, X_test, k):
    distances = compute_distances(X_train, X_test) # calculate distances between training and test data
    nearest_neighbors = get_k_nearest_neighbors(distances, k) # get indices of k nearest neighbors
    labels = y_train[nearest_neighbors] # get labels of nearest neighbors
    return most_common_label(labels) # return most common label among neighbors
  3. Usage: KNN is computationally expensive for large datasets as it calculates the distances between all pairs of data points. It can be useful when the number of classes is small, the data is not very high dimensional and there are enough training samples.

  4. pseudo code of regression

    1
    2
    3
    4
    5
    6
    function KNN(X_train, y_train, X_test, k):
    distances = compute_distances(X_train, X_test) # calculate distances between training and test data
    nearest_neighbors = get_k_nearest_neighbors(distances, k) # get indices of k nearest neighbors
    y_pred = average(y_train[nearest_neighbors]) # compute average output value of nearest neighbors
    return y_pred

k-NN behaviour vs. neighbour number and training sample size.

  1. neighbour number k: if k is too small, sensitive to noise, overfitting. Too large, missing important patterns, underfitting
  2. Training sample size: small training set, difficult to generalize new data, result in high variance and overfitting. Large training set, capture the underlying patterns, result in lower variance and better generalisation, but computational cost increases