This allows us to define <strong>accuracy</strong> as (<code>TP</code> + <code>TN</code>) / (<code>TP</code> + <code>FP</code> + <code>FN</code> + <code>TN</code>).
</div>
%% Cell type:markdown id: tags:
## Pitfalls
<divclass="alert alert-block alert-warning">
<iclass="fa fa-info-circle"></i> Accuracy can be very misleading if classe sizes are imbalanced
</div>
Let us demonstrate this with an extreme example:
- On average 10 out of 10000 people are infected with a disease `X`.
- A medical test `Z` diagnoses 50 % of infected people as `not infected` ?
- The test is correct on all not-infected people.
Among $10000$ people
- $10$ will be infected, $5$ gets a correct result.
- $9990$ will be not infected with a correct test result.
Thus accuracy is $\frac{9995}{10000} = 99.95 \% $
This is also called the **accuracy paradox** (<ahref="https://en.wikipedia.org/wiki/Accuracy_paradox">see also here</a>).
<imgsrc="https://i.imgflip.com/303wyp.jpg"title="made at imgflip.com"width=50%/>
To evaluate this test on such an unbalanced dataset we need different numbers:
1. Does our test miss infected people: How many infected people are actually discovered to be infected ?
2. Does our test predict people as infected which are actually not: How many positive diagnoses are correct ?
We come back to this example later.
%% Cell type:markdown id: tags:
## Exercise block 1
1.1 A classifier predicts labels `[0, 1, 0, 1, 1, 0, 1, 0]` whereas true labels are `[0, 0, 1, 1, 1, 0, 1, 1]`. First write these values as a two columned table using pen & paper and assign `FP`, `TP`, ... to each row. Now create the confusion matrix and compute accuracy.
1.2 A random classfier just assign a randomly chosen label `0` or `1` for a given feature. What is the average accuracy of such a classifier?
%% Cell type:markdown id: tags:solution
SOLUTION 1.1
<pre>
TRUE PREDICTED THIS IS
0 0 TN
0 1 FP
1 0 FN
1 1 TP
1 1 TP
0 0 TN
1 1 TP
1 0 FP
TP = 3 FP = 2
FN = 1 TN = 2
accuracy = 5 / 8 = 62.5 %
</pre>
SOLUTION 1.2
On average all fields of the confusion matrix should contain same values, thus the accuracy would be 50 %.
%% Cell type:markdown id: tags:
### Optional exercise
1.3 Assume the previously described test also produces wrong results on not-infected people, such that 5 out of 10000 will be diagnosed as infected. Compute the confusion matrix and the accuracy of this test.
%% Cell type:markdown id: tags:solution
Solution 1.3
This is the new situation:
- On average 10 out of 10000 people are infected with a disease `X`.
- A medical test `Z` diagnoses 50 % of infected people as `not infected` ?
- The test is correct on 95% of all people.
<pre>
Infected people = 10, diagnosed as X: 5 (TP)
Not infected people = 9990, diagnosed as X: 0.05 * 9990 = 499.5 (FP)
TP = 5 FP = 499.5
FN = 5 TN = 9490.5
accuracy = 9495.5 / 10000 = 94.96 %
</pre>
%% Cell type:markdown id: tags:
## 2. Precision and Recall
In order to understand the concept of **precision** and **recall**, imagine the following scenario:
A few days before thanksgiving you open an online recipe website and enter "turkey thanksgiving". You see some suitable recommendations but also unusable results related to Turkish recipes.
Such a search engine works like a filter applied on a collection of documents.
As a scientist you want to assess the reliablity of this service:
1. What fraction of relevant recipes stored in the underlying database do I see?
2. How many of the shown results are relevant recipes and not the recipes from Turkey?
In this context, **recall** is the fraction of all the relevant documents found by the engine.
And **precision** is the fraction of shown results that are correct.
%% Cell type:markdown id: tags:
### How to compute precision and recall for a classifier
To transfer this concept to classification, we can interpret a classifier as a filter. The classifier classifies every document in a collection as relevant or not relevant.
The more results the search engine delivers, the lesser will be the number of relevant documents which are ignored. But at the same time the fraction of wrong results will increase.
### F1-score
Sometimes we want a single number instead of two numbers to compare the performace of multiple classifiers.
This is the *harmonic mean* of precision and recall.
</div>
For the medical test `Z` the `F1` score is `1 / 1.5 = 0.6666..`.
%% Cell type:markdown id: tags:
## Exercise block 2
Use your results from exercise 1.1 to compute precision, recall and F1 score.
%% Cell type:markdown id: tags:solution
<pre>
TP = 3 FP = 2
FN = 1 TN = 2
precision = 3 / (3 + 2) = 60 %
recall = 3 / (3 + 1) = 75 %
F1 = 2 * (0.6 * 0.75) / (0.6 + 0.75) = 66.66%
</pre>
%% Cell type:markdown id: tags:
### Optional exercise:
Compute precision, recall and F1-score for the test described in exercise 1.2.
%% Cell type:markdown id: tags:solution
<pre>
TP = 5 FP = 499.5
FN = 5 TN = 9490.5
precision = 5 / (5 + 499.5) = 0.0099
recall = 5 / (5 + 5) = 0.5
F1 = 2 * (0.099 * 0.5) / (0.0099 + 0.5) = 0.194
</pre>
%% Cell type:markdown id: tags:
## Other metrics
The discussion above was just a quick introduction to measuring the accuracy of a classifier. We skipped other metrics such as `ROC` and `AUC` amongst others.
A good introduction to `ROC`<ahref="https://classeval.wordpress.com/introduction/introduction-to-the-roc-receiver-operating-characteristics-plot/">can be found here.</a>
%% Cell type:markdown id: tags:
## 3. Metrics in scikit-learn
%% Cell type:markdown id: tags:
`sklearn.metrics` contains many metrics like `precision_score` etc., `confusion_matrix` prints what it means.
print(" {:12s}: mean value: {:.2f}".format(metric,scores.mean()))
print()
classifier=LogisticRegression(C=1)
print("balanced data")
assess(classifier,beer_data)
# we sort by label, then removing samples| is easier:
beer_data=beer_data.sort_values(by="is_yummy")
print("unbalanced data")
beer_data_unbalanced=beer_data.iloc[:-80,:]
assess(classifier,beer_data_unbalanced)
```
%% Output
balanced data
52.9 % of the beers are yummy
accuracy : mean value: 0.80
f1 : mean value: 0.83
precision : mean value: 0.78
recall : mean value: 0.89
unbalanced data
26.9 % of the beers are yummy
accuracy : mean value: 0.79
f1 : mean value: 0.41
precision : mean value: 0.87
recall : mean value: 0.28
%% Cell type:markdown id: tags:
You can see that for the balanced data set the values for `f1` and for `accuracy` are almost equal, but differ significantly for the unbalanced data set.
%% Cell type:markdown id: tags:
## Exercise section 3
1. Play with the previous examples, use different classifiers with different settings
### Optional exercise
2. Modify the code from section 5 of the previous script ("Training the final classifier") to use different metrics.