Lesson 03: From Clustering To Classification

Learning Objectives

  • remember how clustering works

  • Apply clustering for classification

  • recognize that once clustering is trained/used, it can be used for classification on new data

  • use accuracy, precision and recall to describe the classifiers performance

  • reflect that classification performance depends on the choice of hyperparameters

  • discuss that accuracy, precision and recall can depend the size of the testset, choice of hyperparameters

Content

This lesson will be shared as a video.

  • for learners: a stub notebook to get you started can be obtained from the lesson03 repo.

  • for instructors: the video script is available here.

Check your Learning

The following questions serve as a help for learners to reflect on the content of the videos. Answer at least one question. At best you want to answer these questions as a team.

Question 1

When using the k-Nearest-Neighbor (kNN) algorithm for classifying a query point x_q, the k stands for:

  1. the number of neighbors that must have a given label for the query point to get this label assigned

  2. the number of classes occurring in the data set

  3. the number of observations that define a neighborhood

  4. the number of clusters in the data set

Solution
1. no, this is implicetely set once you have the size of the neighborhood defined
2. no, this is just a property of the dataset which the algorithm does not rely on
3. yes, the number of observations that define a neighborhood (see also `KNeighborsClassifier documentation <https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier>`_)
4. no, similar to the number of classes present, this is a property of the dataset

Question 2

When going through tutorials and exercises that discuss the k-Nearest-Neighbor (kNN) method, you observe that k is typically chosen to be an odd number. Checking the code, sklearn also allows even numbers for k! Why do people tend to choose odd numbers?

  1. tradition that often works best in practice

  2. odd numbers prevent ties from happening with the majority vote

  3. this way, the total number of samples in the neighborhood is always even as one has to add the query sample

  4. odd numbers prevent ties from happening with the plurarity vote

Solution
1. no, while perhaps likely, this is not correct
2. no, ``kNeighborsClassifier`` work with a plurality vote
3. no, the query sample is never part of the voting as we want to predict its class in the first place
4. yes, ``kNeighborsclassifier`` relies on a plurality vote and odd-number neighborhoods decrease the probability to create ties (if they occur, the winner is randomly chosen)

Question 3

What is the majority vote and the plurality vote if the 8 nearest neighbors to your unknown data point are of the following classes:

  • class 1: 3

  • class 2: 2

  • class 3: 2

  • class 4: 1

majority vote: ____ plurality vote: ____

  • class 1: 5

  • class 2: 2

  • class 3: 1

majority vote: ____ plurality vote: ____

Solution
a) majority vote: no class exceeds ``50%`` of the votes (a random choice would have to be made), plurality vote: ``class 1``
b) majority vote: ``class 1`` exceeds ``50%`` of votes, plurality vote: ``class 1`` as well

Question 4

Find the four hidden bug(s)!

from sklearn.neighbors import KNeighborsClassifier as knn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# ... load data set ...
# ... load data set ...

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                              test_size = 1.5,
                                              random_state = 42)
kmeans = knn(n_neighbors=5)
kmeans = kmeans.fit(X_train, y_train)
y_test_hat = kmeans.predict(X_train)

cm = confusion_matrix(y_train, y_test_hat)

accuracy = (cm[0,0]+cm[0,1]) / cm.sum()
Solution
from sklearn.neighbors import KNeighborsClassifier as knn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# ... load data set ...
# ... load data set ...

X_train, X_test, y_train, y_test = train_test_split(X, y,
#yikes a value above 1. for test_size! Better choose 0.5 or 0.1 depending on the training set size
                                                    test_size = 1.5,
                                                    random_state = 42)
kmeans = knn(n_neighbors=5)
kmeans = kmeans.fit(X_train, y_train)

#yikes, the prediction is performed on the training set!
#better: y_test_hat = kmeans.predict(X_test)
y_test_hat = kmeans.predict(X_train)

#yikes, the training set is used for the confusion matrix
#better: cm = confusion_matrix(y_test, y_test_hat)
cm = confusion_matrix(y_train, y_test_hat)

#yikes, accuracy is (tp + tn) / total
#better: accuracy = (cm[0,0]+cm[1,1]) / cm.sum()
accuracy = (cm[0,0]+cm[0,1]) / cm.sum()

Exercises

Use the data set you used for lesson02 or be brave and choose a different one. Complete the following steps in order:

For part a

  • Split your data set into train and test set at a fixed ratio.

  • Train a kNN classification on the training set with a fixed value of k.

For part b

  • Run the prediction and compute accuracy, precision, recall.

  • Let’s vary parameters now and recompute accuracy, precision, recall for each variant:

    • rerun everything with a smaller and a bigger testset for a fixed k

    • rerun everything with a different values of k with a fixed testset

  • See for yourself: how does accuracy, precision, recall change?

  • Discuss your finding with the other team members. Some prompts for the discussion:

    • should accuracy, precision, recall depend on the size of the testset? What happens in the asymptotic case (infinite testset)?

    • should accuracy, precision, recall depend on k?

Data sets

import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                  columns= iris['feature_names'] + ['target'])