Skip to content
Snippets Groups Projects
Commit 566298c7 authored by schmittu's avatar schmittu :beer:
Browse files

simplified custom.css handling in script 04

parent fbb5a01a
No related branches found
No related tags found
No related merge requests found
%% Cell type:code id: tags:
``` python
# IGNORE THIS LINE WHICH MODIFIES LAYOUT AND STYLING OF THE NOTEBOOK !
from IPython.core.display import HTML; HTML(open("custom.html", "r").read())
```
%% Output
<IPython.core.display.HTML object>
%% Cell type:markdown id: tags:
# Chapter 4: Metrics for evaluating the performance of a classifier
%% Cell type:code id: tags:
``` python
import sklearn.metrics as metrics
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
```
%% Cell type:markdown id: tags:
Up to now we used _accuracy_, the percentage of correct classifcations, to evaluate the quality of a classifier.
Regrettably _accuracy_ can produce very misleading results.
This and the next chapter will discuss other metrics to asses the quality of a classifier including the possible pitfalls.
%% Cell type:markdown id: tags:
## 1. The confusion matrix
%% Cell type:markdown id: tags:
Before we define the **confusion matrix** we must introduce some additional terms.
After applying a classifier to a data set with known labels `0` and `1`:
**TP (true positives)**: labels which were predicted as `1` and actually are `1`.
**TN (true negatives)**: labels which were predicted as `0` and actually are `0`.
**FP (false positives)**: labels which were predicted as `1` and actually are `0`.
**FN (false negatives)**: labels which were predicted as `0` and actually are `1`.
%% Cell type:markdown id: tags:
To memorize this: the second word "positives"/"negatives" refers to the prediction computed by the classifier.
The first word "true"/"false" expresses if the classification was correct or not.
Using these terms we can now define the so called **confusion matrix**:
%% Cell type:code id: tags:
``` python
pd.DataFrame(np.array([["TP", "FP"], ["FN", "TN"]]) ,
index=["Predicted T", "Predicted F"],
columns=["Actual T", "Actual F"])
```
%% Output
Actual T Actual F
Predicted T TP FP
Predicted F FN TN
%% Cell type:markdown id: tags:
So the total number of predictions can be expressed as `TP` + `FP` + `FN` + `TN`.
The number of correct predictions is `TP` + `TN`.
This allows us to define **accuracy** as (`TP` + `TN`) / (`TP` + `FP` + `FN` + `TN`).
Beyond that: `TP` + `FN` is the number of positive examples in our data set, `FP` + `TN` is the number of negative examples.
%% Cell type:markdown id: tags:
## Pitfalls
**Accuracy can be very misleading if classe sizes are imbalanced.**
Let us demonstrate this with an extreme example:
- On average 10 out of 10000 people are infected with a disease `X`.
- A medical test `Z` diagnoses 50 % of infected people as `not infected` ?
- The test is correct on all not-infected people.
Among $10000$ people
- $10$ will be infected, $5$ gets a correct result.
- $9990$ will be not infected with a correct test result.
Thus accuracy is $\frac{9995}{10000} = 99.95 \% $
This is also called the **accuracy paradox** (<a href="https://en.wikipedia.org/wiki/Accuracy_paradox">see also here</a>).
To evaluate this test on such an unbalanced dataset we need different numbers:
1. Does our test miss infected people: How many infected people are actually discovered to be infected ?
2. Does our test predict people as infected which are actually not: How many positive diagnoses are correct ?
We come back to this example later.
**TODO**: in a later chapter or in a extra box provide links to strategies for imbalanced data sets.
%% Cell type:markdown id: tags:
## Exercise block 1
1.1 A classifier predicts labels `[0, 0, 1, 1, 1, 0, 1, 1]` whereas true labels are `[0, 1, 0, 1, 1, 0, 1, 1]`. Write these values as a two columned table using pen & paper and assign `FP`, `TP`, ... to each pair. Determine confusion matrix and accuracy.
1.2 A random classfier just assign a randomly chosen label `0` or `1` for a given feature. What is the average accuracy of such a classifier?
### Optional exercise
1.3 Assume the previously described test also produces wrong results on not-infected people, such that 5 out of 10000 will be diagnosed as infected. Compute the confusion matrix and the accuracy of this test.
%% Cell type:markdown id: tags:
## 2. Precision and Recall
In order to understand the concept of **precision** and **recall**, imagine the following scenario:
A few days before thanksgiving you open an online recipe website and enter "turkey thanksgiving". You see some suitable recommendations but also unusable results related to Turkish recipes.
Such a search engine works like a filter applied on a collection of documents.
As a scientist you want to assess the reliablity of this service:
1. What fraction of relevant recipes stored in the underlying database do I see?
2. How many of the shown results are relevant recipes and not the recipes from Turkey?
In this context,
**recall**: is the fraction of all the relevant documents found by the engine.
And
**precision**: is the fraction of shown results that are correct.
### Trade-off between precision and recall.
The more results the search engine delivers, lesser will be the number of relevant documents which are ignored. But at the same time the fraction of wrong results will increase.
%% Cell type:markdown id: tags:
### How to compute precision and recall
To transfer this concept to classification, we can interpret a classifier as a filter. The classifier classifies every document in a collection as relevant or not relevant.
The number of shown documents is TP + FP, thus **precision** is computed as TP / (TP + FP).
The number of relevant documents is TP + FN, thus **recall** is computed as TP / (TP + FN).
The confusion matrix for the medical test `Z` is then:
<table style="border: 1px solid black">
<tr style="border: 1px black">
<td style="border: 1px solid black; background: white; padding: 1em">TP = 5</td>
<td style="border: 1px solid black; background: white; ">FP = 0</td>
</tr>
<tr style="border: 1px black">
<td style="border: 1px solid black; background: white; padding: 1em ">FN = 5</td>
<td style="border: 1px solid black; background: white; ">TN = 9900</td>
</tr>
</table>
Here precision is `1.0` and recall is `0.5`.
### F1-score
Sometimes we want a single number instead of two numbers to compare the performace of multiple classifiers.
A common approach to combine precision and recall is to compute their harmonic mean. This metric is called **F1 score**.
`F1 = 2 * (precision * recall) / (precision + recall)`
For the medical test `Z` the `F1` score is `1 / 1.5 = 0.6666..`.
%% Cell type:markdown id: tags:
## Exercise block 2
Use your results from exercise 1.1 to compute precision, recall and F1 score.
### Optional exercise:
Compute precision, recall and F1-score for the test described in exercise 1.2.
%% Cell type:markdown id: tags:
## Other metrics
The discussion above was just a quick introduction to measuring the accuracy of a classifier. We skipped other metrics such as `ROC` and `AUC` amongst others.
A good introduction to `ROC` <a href="https://classeval.wordpress.com/introduction/introduction-to-the-roc-receiver-operating-characteristics-plot/">can be found here.</a>
%% Cell type:markdown id: tags:
## 3. Metrics in scikit-learn
%% Cell type:markdown id: tags:
`sklearn.metrics` contains many metrics like `precision_score` etc., `classification_report` prints an overall report.
%% Cell type:code id: tags:
``` python
from sklearn.metrics import (precision_score, recall_score, f1_score,
confusion_matrix, accuracy_score, classification_report)
# these numbers are from exercise 1.1:
predicted = [0, 0, 1, 1, 1, 0, 1, 1]
labels = [0, 1, 0, 1, 1, 0, 1, 1]
print(confusion_matrix(labels, predicted))
print()
#
# The first argument of the metrics functions is the exact labels,
# the second argument is the predictions:
#
print("{:20s} {:.3f}".format("precision", precision_score(labels, predicted)))
print("{:20s} {:.3f}".format("recall", recall_score(labels, predicted)))
print("{:20s} {:.3f}".format("f1", f1_score(labels, predicted)))
print("{:20s} {:.3f}".format("accuracy", accuracy_score(labels, predicted)))
print()
print(classification_report(labels, predicted))
```
%% Output
[[2 1]
[1 4]]
precision 0.800
recall 0.800
f1 0.800
accuracy 0.750
precision recall f1-score support
0 0.67 0.67 0.67 3
1 0.80 0.80 0.80 5
micro avg 0.75 0.75 0.75 8
macro avg 0.73 0.73 0.73 8
weighted avg 0.75 0.75 0.75 8
%% Cell type:markdown id: tags:
Comment: The `micro avg` and `macro avg` outputs account for class inbalances, in case you want to learn more about this [read here](https://datascience.stackexchange.com/questions/15989/micro-average-vs-macro-average-performance-in-a-multiclass-classification-settin)
%% Cell type:markdown id: tags:
The function `cross_val_score` (introduced in the last script) allows to use other metrics than `accuray`.
We demonstrate usage of different metrics on two data sets:
- the known beer data samples in which labels distribution is almost 50:50.
- an unbalanced subset of the beer data samples.
%% Cell type:code id: tags:
``` python
import pandas as pd
beer_data = pd.read_csv("beers.csv")
print(beer_data.shape)
```
%% Output
(225, 5)
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, confusion_matrix
from sklearn.linear_model import LogisticRegression
def assess(classifier, beer_data):
features = beer_data.iloc[:, :-1]
labels = beer_data.iloc[:, -1]
n = len(labels)
print("{:.1f} % of the beers are yummy".format(100 * sum(labels == 1) /n))
print()
for metric in ["accuracy", "f1", "precision", "recall"]:
scores = cross_val_score(classifier, features, labels, scoring=metric, cv=5)
print(" {:12s}: mean value: {:.2f}".format(metric, scores.mean()))
print()
classifier = LogisticRegression(C=1, solver="lbfgs")
print("balanced data")
assess(classifier, beer_data)
# we sort by label, then removing samples| is easier:
beer_data = beer_data.sort_values(by="is_yummy")
print("unbalanced data")
beer_data_unbalanced = beer_data.iloc[:-80, :]
assess(classifier, beer_data_unbalanced)
```
%% Output
balanced data
52.9 % of the beers are yummy
accuracy : mean value: 0.91
f1 : mean value: 0.92
precision : mean value: 0.89
recall : mean value: 0.96
unbalanced data
26.9 % of the beers are yummy
accuracy : mean value: 0.85
f1 : mean value: 0.63
precision : mean value: 0.82
recall : mean value: 0.56
%% Cell type:markdown id: tags:
You can see that for the balanced data set the values for `f1` and for `accuracy` are almost equal, but differ significantly for the unbalanced data set.
%% Cell type:markdown id: tags:
## Exercise section 3
1. Play with the previous examples, use different classifiers with different settings
### Optional exercise
2. Modify the code from secton 5 of the previous script ("Training the final classifier") to use different metrics.
%% Cell type:code id: tags:
``` python
#REMOVEBEGIN
# THE LINES BELOW ARE JUST FOR STYLING THE CONTENT ABOVE !
from IPython import utils
from IPython.core.display import HTML
import os
def css_styling():
"""Load default custom.css file from ipython profile"""
base = utils.path.get_ipython_dir()
styles = """<style>
@import url('http://fonts.googleapis.com/css?family=Source+Code+Pro');
@import url('http://fonts.googleapis.com/css?family=Kameron');
@import url('http://fonts.googleapis.com/css?family=Crimson+Text');
@import url('http://fonts.googleapis.com/css?family=Lato');
@import url('http://fonts.googleapis.com/css?family=Source+Sans+Pro');
@import url('http://fonts.googleapis.com/css?family=Lora');
body {
font-family: 'Lora', Consolas, sans-serif;
-webkit-print-color-adjust: exact important !;
}
.alert-block {
width: 95%;
margin: auto;
}
.rendered_html code
{
color: black;
background: #eaf0ff;
background: #f5f5f5;
padding: 1pt;
font-family: 'Source Code Pro', Consolas, monocco, monospace;
}
p {
line-height: 140%;
}
strong code {
background: red;
}
.rendered_html strong code
{
background: #f5f5f5;
}
.CodeMirror pre {
font-family: 'Source Code Pro', monocco, Consolas, monocco, monospace;
}
.cm-s-ipython span.cm-keyword {
font-weight: normal;
}
strong {
background: #f5f5f5;
margin-top: 4pt;
margin-bottom: 4pt;
padding: 2pt;
border: 0.5px solid #a0a0a0;
font-weight: bold;
color: darkred;
}
div #notebook {
# font-size: 10pt;
line-height: 145%;
}
li {
line-height: 145%;
}
div.output_area pre {
background: #fff9d8 !important;
padding: 5pt;
-webkit-print-color-adjust: exact;
}
h1, h2, h3, h4 {
font-family: Kameron, arial;
}
div#maintoolbar {display: none !important;}
</style>"""
return HTML(styles)
css_styling()
#REMOVEEND
```
%% Output
/Users/uweschmitt/Projects/machinelearning-introduction-workshop/venv37/lib/python3.7/site-packages/ipykernel_launcher.py:9: UserWarning: get_ipython_dir has moved to the IPython.paths module since IPython 4.0.
if __name__ == '__main__':
<IPython.core.display.HTML object>
%% Cell type:code id: tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment