Content¶
This lesson will be shared as a video.
for learners: a stub notebook to get you started can be obtained from the lesson04 repo.
for instructors: the video script is available here.
discuss true positive rate and false positive rate
explain how a cut-off value (for kmeans classification) is used to produce a ROC curve
assert that a ROC curve is an envelope produced for a fixed test set
demo how to construct a ROC curve
contrast the difference between adding individual points into a ROC plot and producing a ROC curve
show RocCurveDisplay in sklearn
discuss extremes of a ROC curve
This lesson will be shared as a video.
for learners: a stub notebook to get you started can be obtained from the lesson04 repo.
for instructors: the video script is available here.
The following questions serve as a help for learners to reflect on the content of the videos. Answer at least one question. At best you want to answer these questions as a team.
Question 1
The ROC acronym stands for
Receiver Operator Curve
Receiving Operates Curves
Receiver Operating Characteristic
Reception Occlusion Characteristic
3. yes, ``Receiver Operating Characteristic``
I made up the rest.
Question 2
Fill in the blanks!
A k-Nearest-Neighbor (kNN) classifier can produce a probability when predicting the class label of an unseen sample x_q
. This can be achieved by counting class _______
in the training set neighborhood of this query point.
For a k=7
neighborhood, the threshold to decide for any given class in this neighborhood is calculated as 4/__
. In the same setting (k=7
), let’s assume we find 5
labels for class 1
and 2
labels for class 0
. This means, that we get two probabilities, which are _____
for class 1
and _____
for class 0
.
A k-Nearest-Neighbor (kNN) classifier can produce a probability when predicting the class label of an unseen sample ``x_q``. This can be achieved by counting class ``frequencies`` in the training set neighborhood of this query point.
For a ``k=7`` neighborhood, the threshold to decide for any given class in this neighborhood is calculated as ``4/7``. In the same setting (``k=7``), let's assume we find ``5`` labels for class ``1`` and ``2`` labels for class ``0``. This means, that we get two probabilities, which are ``5/7 = 0.7143`` for class ``1`` and ``2/7 = 0.2857`` for class ``0``.
For this lesson, please complete the following steps in order:
produce a ROC curve for the classifier you trained in lesson03.
take another NN based classifier, e.g. sklearn.neighbors.RadiusNeighborsClassifier or sklearn.neighbors.NearestCentroid and train it
make predictions on the test set with it and produce a ROC
combine the 2 ROC curves in a plot and discuss which classifier is better!
Data sets for clustering. Each of the following synthetic data sets contains several features x1, x2, … and a label column which comprises (2 classes).
iris plants data set. Use the columns petal_length vs. petal_width. The class label is provided as the target column. To obtain the data frame from this data set do the following:
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])