Skip to content
Snippets Groups Projects
Commit 7ebbe57e authored by Franziska Oschmann's avatar Franziska Oschmann
Browse files

Merge branch 'master' of...

parents 87f0ed9d 04759b22
No related branches found
No related tags found
No related merge requests found
%% Cell type:code id: tags:
``` python
# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings = lambda *a, **kw: None
from IPython.core.display import HTML; HTML(open("custom.html", "r").read())
```
%% Output
<IPython.core.display.HTML object>
%% Cell type:markdown id: tags:
# Chapter 4: Metrics for evaluating the performance of a classifier
%% Cell type:code id: tags:
``` python
import sklearn.metrics as metrics
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
```
%% Cell type:markdown id: tags:
Up to now we used _accuracy_, the percentage of correct classifcations, to evaluate the quality of a classifier.
Up to now we used *accuracy*, the percentage of correct classifcations, to evaluate the quality of a classifier.
Regrettably _accuracy_ can produce very misleading results.
This and the next chapter will discuss other metrics to asses the quality of a classifier including the possible pitfalls.
This chapter will discuss other metrics used to asses the quality of a classifier, including the possible pitfalls.
%% Cell type:markdown id: tags:
## The confusion matrix
%% Cell type:markdown id: tags:
Before we define the **confusion matrix** we must introduce some additional terms.
%% Cell type:markdown id: tags:
After applying a classifier to a data set with known labels `0` and `1`:
<div class="alert alert-block alert-warning">
<div style="font-size: 150%;"><i class="fa fa-info-circle"></i>&nbsp;Definition</div>
<ul>
<li><strong>TP (true positives)</strong>: labels which were predicted as <code>1</code> and actually are <code>1</code>. <br/><br/>
<li><strong>TN (true negatives)</strong>: labels which were predicted as <code>0</code> and actually are <code>0</code>.<br/><br/>
<li><strong>FP (false positives)</strong>: labels which were predicted as <code>1</code> and actually are <code>0</code>.<br/><br/>
<li><strong>FN (false negatives)</strong>: labels which were predicted as <code>0</code> and actually are <code>1</code>.<br/><br/>
</ul>
To memorize this:
<ul>
<li>The second word "positives"/"negatives" refers to the prediction computed by the classifier.
<li>The first word "true"/"false" expresses if the classification was correct or not.
</ul>
This is the so called <strong>Confusion Matrix</strong>:
<table style="border: 1px; font-family: 'Source Code Pro', monocco, Consolas, monocco, monospace;
font-size:110%;">
<tbody >
<tr>
<td style="padding: 10px; background:#f8f8f8;"> </td>
<td style="padding: 10px; background:#f8f8f8;">Actual P</td>
<td style="padding: 10px; background:#f8f8f8;">Actual N</td>
</tr>
<tr>
<td style="padding: 10px; background:#f8f8f8;">Predicted P</td>
<td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">TP </td>
<td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">FP </td>
</tr>
<tr>
<td style="padding: 10px; background:#f8f8f8;">Predicted N</td>
<td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">FN </td>
<td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">TN </td>
</tr>
</tbody>
</table>
</div>
<img src="./images/305c8j.jpg" title="made at imgflip.com" width=40%/>
%% Cell type:markdown id: tags:
- So the total number of predictions can be expressed as `TP` + `FP` + `FN` + `TN`.
- The number of correct predictions is `TP` + `TN`.
- `TP` + `FN` is the number of positive examples in our data set,
- `FP` + `TN` is the number of negative examples.
<div class="alert alert-block alert-warning">
<div style="font-size: 150%;"><i class="fa fa-info-circle"></i>&nbsp;Definition</div>
This allows us to define <strong>accuracy</strong> as (<code>TP</code> + <code>TN</code>) / (<code>TP</code> + <code>FP</code> + <code>FN</code> + <code>TN</code>).
</div>
%% Cell type:markdown id: tags:
## Pitfalls
<div class="alert alert-block alert-warning">
<i class="fa fa-info-circle"></i>&nbsp; Accuracy can be very misleading if classe sizes are imbalanced
</div>
Let us demonstrate this with an extreme example:
- On average 10 out of 10000 people are infected with a disease `X`.
- A medical test `Z` diagnoses 50 % of infected people as `not infected` ?
- The test is correct on all not-infected people.
Among $10000$ people
- $10$ will be infected, $5$ gets a correct result.
- $9990$ will be not infected with a correct test result.
Thus accuracy is $\frac{9995}{10000} = 99.95 \% $
This is also called the **accuracy paradox** (<a href="https://en.wikipedia.org/wiki/Accuracy_paradox">see also here</a>).
<img src="./images/303wyp.jpg" title="made at imgflip.com" width=50%/>
To evaluate this test on such an unbalanced dataset we need different numbers:
1. Does our test miss infected people: How many infected people are actually discovered to be infected ?
2. Does our test predict people as infected which are actually not: How many positive diagnoses are correct ?
We come back to this example later.
%% Cell type:markdown id: tags:
## Exercise block 1
1. A classifier predicts labels `[0, 1, 0, 1, 1, 0, 1, 0]` whereas true labels are `[0, 0, 1, 1, 1, 0, 1, 1]`. First write these values as a two columned table using pen & paper and assign `FP`, `TP`, ... to each row. Now create the confusion matrix and compute accuracy.
2. A random classfier just assign a randomly chosen label `0` or `1` for a given feature. What is the average accuracy of such a classifier?
2. A random classfier just assign a randomly chosen label `0` or `1` to a given sample. What is the average accuracy of such a classifier?
%% Cell type:markdown id: tags:solution
SOLUTION 1.1
<pre>
TRUE PREDICTED THIS IS
0 0 TN
0 1 FP
1 0 FN
1 1 TP
1 1 TP
0 0 TN
1 1 TP
1 0 FP
1 0 FN
TP = 3 FP = 2
FN = 1 TN = 2
TP = 3 FP = 1
FN = 2 TN = 2
accuracy = 5 / 8 = 62.5 %
</pre>
SOLUTION 1.2
On average all fields of the confusion matrix should contain same values, thus the accuracy would be 50 %.
%% Cell type:markdown id: tags:
### Optional exercise
Assume the previously described test also produces wrong results on not-infected people, such that 5 out of 10000 will be diagnosed as infected. Compute the confusion matrix and the accuracy of this test.
%% Cell type:markdown id: tags:solution
This is the new situation:
- On average 10 out of 10000 people are infected with a disease `X`.
- A medical test `Z` diagnoses 50 % of infected people as `not infected` ?
- The test is correct on 95% of all people.
<pre>
Infected people = 10, diagnosed as X: 5 (TP)
Not infected people = 9990, diagnosed as X: 0.05 * 9990 = 499.5 (FP)
TP = 5 FP = 499.5
FN = 5 TN = 9490.5
accuracy = 9495.5 / 10000 = 94.96 %
</pre>
%% Cell type:markdown id: tags:
## Precision and Recall
In order to understand the concept of **precision** and **recall**, imagine the following scenario:
A few days before thanksgiving you open an online recipe website and enter "turkey thanksgiving". You see some suitable recommendations but also unusable results related to Turkish recipes.
Such a search engine works like a filter applied on a collection of documents.
As a scientist you want to assess the reliablity of this service:
1. What fraction of relevant recipes stored in the underlying database do I see?
2. How many of the shown results are relevant recipes and not the recipes from Turkey?
In this context, **recall** is the fraction of all the relevant documents found by the engine.
And **precision** is the fraction of shown results that are correct.
%% Cell type:markdown id: tags:
### How to compute precision and recall for a classifier
To transfer this concept to classification, we can interpret a classifier as a filter. The classifier classifies every document in a collection as relevant or not relevant.
<div class="alert alert-block alert-warning">
<div style="font-size: 150%;"><i class="fa fa-info-circle"></i>&nbsp;Definition</div>
To remember:
<table style="border: 1px; font-family: 'Source Code Pro', monocco, Consolas, monocco, monospace;
font-size:110%;">
<tbody >
<tr>
<td style="padding: 10px; background:#f8f8f8;"> </td>
<td style="padding: 10px; background:#f8f8f8;">Actual P</td>
<td style="padding: 10px; background:#f8f8f8;">Actual N</td>
</tr>
<tr>
<td style="padding: 10px; background:#f8f8f8;">Predicted P</td>
<td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">TP </td>
<td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">FP </td>
</tr>
<tr>
<td style="padding: 10px; background:#f8f8f8;">Predicted N</td>
<td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">FN </td>
<td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">TN </td>
</tr>
</tbody>
</table>
The number of shown documents is <code>TP + FP </code>, the number of relevant documents is <code>TP + FN</code>
Thus:
- **precision** is computed as <code>TP / (TP + FP)</code>.
- **recall** is computed as <code>TP / (TP + FN)</code>.
</div>
The confusion matrix for the medical test `Z` is then:
<table style="border: 1px solid black">
<tr style="border: 1px black">
<td style="border: 1px solid black; background: white; padding: 1em">TP = 5</td>
<td style="border: 1px solid black; background: white; ">FP = 0</td>
</tr>
<tr style="border: 1px black">
<td style="border: 1px solid black; background: white; padding: 1em ">FN = 5</td>
<td style="border: 1px solid black; background: white; ">TN = 9900</td>
<td style="border: 1px solid black; background: white; ">TN = 9990</td>
</tr>
</table>
Here precision is `1.0` and recall is `0.5`.
### Trade-off between precision and recall.
The more results the search engine delivers, the lesser will be the number of relevant documents which are ignored. But at the same time the fraction of wrong results will increase.
### F1-score
Sometimes we want a single number instead of two numbers to compare the performace of multiple classifiers.
<div class="alert alert-block alert-warning">
<div style="font-size: 150%;"><i class="fa fa-info-circle"></i>&nbsp;Definition</div>
The **F1 score** is computed as
<code>F1 = 2 * (precision * recall) / (precision + recall)</code>.
This is the *harmonic mean* of precision and recall.
</div>
For the medical test `Z` the `F1` score is `1 / 1.5 = 0.6666..`.
%% Cell type:markdown id: tags:
## Exercise block 2
Use your results from exercise 1.1 to compute precision, recall and F1 score.
%% Cell type:markdown id: tags:solution
<pre>
TP = 3 FP = 2
FN = 1 TN = 2
TP = 3 FP = 1
FN = 2 TN = 2
precision = 3 / (3 + 2) = 60 %
recall = 3 / (3 + 1) = 75 %
precision = 3 / (3 + 1) = 75 %
recall = 3 / (3 + 2) = 60 %
F1 = 2 * (0.6 * 0.75) / (0.6 + 0.75) = 66.66%
</pre>
%% Cell type:markdown id: tags:
### Optional exercise:
Compute precision, recall and F1-score for the test described in exercise 1.2.
%% Cell type:markdown id: tags:solution
<pre>
TP = 5 FP = 499.5
FN = 5 TN = 9490.5
precision = 5 / (5 + 499.5) = 0.0099
recall = 5 / (5 + 5) = 0.5
F1 = 2 * (0.099 * 0.5) / (0.0099 + 0.5) = 0.194
</pre>
%% Cell type:markdown id: tags:
## Other metrics
The discussion above was just a quick introduction to measuring the accuracy of a classifier. We skipped other metrics such as `ROC` and `AUC` amongst others.
A good introduction to `ROC` <a href="https://classeval.wordpress.com/introduction/introduction-to-the-roc-receiver-operating-characteristics-plot/">can be found here.</a>
%% Cell type:markdown id: tags:
## Metrics in scikit-learn
%% Cell type:markdown id: tags:
`sklearn.metrics` contains many metrics like `precision_score` etc., `confusion_matrix` prints what it means.
%% Cell type:code id: tags:
``` python
from sklearn.metrics import (precision_score, recall_score, f1_score,
confusion_matrix, accuracy_score)
# these numbers are from exercise 1.1:
predicted = [0, 1, 0, 1, 1, 0, 1, 0]
labels = [0, 0, 1, 1, 1, 0, 1, 1]
print(confusion_matrix(labels, predicted))
# Attention: different order of entries, e.g. TPs are bottom-right
print()
#
# The first argument of the metrics functions is the exact labels,
# the second argument is the predictions:
#
print("{:20s} {:.3f}".format("precision", precision_score(labels, predicted)))
print("{:20s} {:.3f}".format("recall", recall_score(labels, predicted)))
print("{:20s} {:.3f}".format("f1", f1_score(labels, predicted)))
print("{:20s} {:.3f}".format("accuracy", accuracy_score(labels, predicted)))
```
%% Output
[[2 1]
[2 3]]
precision 0.750
recall 0.600
f1 0.667
accuracy 0.625
%% Cell type:markdown id: tags:
The function `cross_val_score` (introduced in the last script) allows to use other metrics than `accuracy`.
We demonstrate usage of different metrics on two data sets:
- the known beer data samples in which labels distribution is almost 50:50.
- an unbalanced subset of the beer data samples.
%% Cell type:code id: tags:
``` python
import pandas as pd
beer_data = pd.read_csv("data/beers.csv")
print(beer_data.shape)
```
%% Output
(225, 5)
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, confusion_matrix
from sklearn.linear_model import LogisticRegression
def assess(classifier, beer_data):
features = beer_data.iloc[:, :-1]
labels = beer_data.iloc[:, -1]
n = len(labels)
print("{:.1f} % of the beers are yummy".format(100 * sum(labels == 1) /n))
print()
# NOTE: metrics given in `cross_val_score` as strings (names).
# (In order to use metric functions, as these imported from `sklearn.metrics`,
# you need to transform them first into estimator scorer function using
# `sklearn.metrics.make_scorer()` function, e.g. `make_scorer(f1_score)`.)
for metric in ["accuracy", "f1", "precision", "recall"]:
scores = cross_val_score(classifier, features, labels, scoring=metric, cv=5)
print(" {:12s}: mean value: {:.2f}".format(metric, scores.mean()))
print()
classifier = LogisticRegression(C=1)
print("balanced data")
assess(classifier, beer_data)
# we sort by label, then removing samples| is easier:
# we sort by label, then removing samples of one class is easy:
beer_data = beer_data.sort_values(by="is_yummy")
print("unbalanced data")
beer_data_unbalanced = beer_data.iloc[:-80, :]
assess(classifier, beer_data_unbalanced)
```
%% Output
balanced data
52.9 % of the beers are yummy
accuracy : mean value: 0.80
f1 : mean value: 0.83
precision : mean value: 0.78
recall : mean value: 0.89
unbalanced data
26.9 % of the beers are yummy
accuracy : mean value: 0.79
f1 : mean value: 0.41
precision : mean value: 0.87
recall : mean value: 0.28
%% Cell type:markdown id: tags:
You can see that for the balanced data set the values for `f1` and for `accuracy` are almost equal, but differ significantly for the unbalanced data set.
%% Cell type:markdown id: tags:
## Exercise section 3
1. Play with the previous examples, use different classifiers with different settings
1. Play with the previous examples; for beer data try `SVC` classifier with different `C` and `gamma` settings.
### Optional exercise
2. Modify the code from section *Training the final classifier* from the previous script to use different metrics.
%% Cell type:code id: tags:solution
``` python
beer_data = pd.read_csv("data/beers.csv")
# all columns up to the last one:
features = beer_data.iloc[:, :-1]
labels = beer_data.iloc[:, -1]
eval_data = pd.read_csv("data/beers_eval.csv")
eval_features = eval_data.iloc[:, :-1]
eval_labels = eval_data.iloc[:, -1]
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
# OPT, cf. from sklearn.metrics import classification_report
results = []
print("OPTIMIZE SETTINGS")
for C in (5, 10):
for gamma in (.5, 1, 2):
classifier = SVC(C=C, gamma=gamma)
test_scores = cross_val_score(classifier, features, labels, scoring="f1", cv=5)
print("score = {:.3f} C={:.1f} gamma={:.1f}".format(test_scores.mean(), C, gamma))
print("f1 score = {:.3f} C={:.1f} gamma={:.1f}".format(test_scores.mean(), C, gamma))
results.append((test_scores.mean(), C, gamma))
# max of list of tuples considers value of first entry
# to compare tuples. This we look for test_scores.mean() value:
best_result = max(results)
best_score, C, gamma = best_result
print()
print("BEST RESULT CROSS VALIDATION")
print("score = {:.3f} C={:.1f} gamma={:.1f}".format(best_score, C, gamma))
print("f1 score = {:.3f} C={:.1f} gamma={:.1f}".format(best_score, C, gamma))
# EVALUATE CLASSIFIER ON VALIDATION DATASET
classifier = SVC(C=C, gamma=gamma)
classifier.fit(features, labels)
predicted = classifier.predict(eval_features)
```
%% Output
OPTIMIZE SETTINGS
score = 0.921 C=5.0 gamma=0.5
score = 0.913 C=5.0 gamma=1.0
score = 0.925 C=5.0 gamma=2.0
score = 0.943 C=10.0 gamma=0.5
score = 0.933 C=10.0 gamma=1.0
score = 0.933 C=10.0 gamma=2.0
f1 score = 0.921 C=5.0 gamma=0.5
f1 score = 0.913 C=5.0 gamma=1.0
f1 score = 0.925 C=5.0 gamma=2.0
f1 score = 0.943 C=10.0 gamma=0.5
f1 score = 0.933 C=10.0 gamma=1.0
f1 score = 0.933 C=10.0 gamma=2.0
BEST RESULT CROSS VALIDATION
score = 0.943 C=10.0 gamma=0.5
f1 score = 0.943 C=10.0 gamma=0.5
%% Cell type:markdown id: tags:
Copyright (C) 2019 ETH Zurich, SIS ID
......
Source diff could not be displayed: it is too large. Options to address this: view the blob.
Source diff could not be displayed: it is too large. Options to address this: view the blob.
......@@ -88,13 +88,41 @@
h1, h2, h3, h4 {
font-family: Kameron, arial;
}
div#maintoolbar {display: none !important;}
div#site {
border-top: 20px solid #1F407A;
border-right: 20px solid #1F407A;
margin-bottom: 0;
padding-bottom: 0;
}
div#toc-wrapper {
border-left: 20px solid #1F407A;
border-top: 20px solid #1F407A;
}
body {
margin-botton:10px;
}
</style>
<script>
IPython.OutputArea.prototype._should_scroll = function(lines) {
return false;
}
</script>
<footer id="attribution" style="float:left; color:#1F407A; background:#fff; font-family: helvetica;">
Copyright (C) 2019 Scientific IT Services of ETH Zurich,
<p>
Contributing Authors:
Dr. Tarun Chadha,
Dr. Franziska Oschmann,
Dr. Mikolaj Rybinski,
Dr. Uwe Schmitt.
</p<
</footer>
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment