"2. Modify the code from secton 5 of the previous script (\"Training the final classifier\") to use different metrics."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/uweschmitt/Projects/machinelearning-introduction-workshop/venv37/lib/python3.7/site-packages/ipykernel_launcher.py:9: UserWarning: get_ipython_dir has moved to the IPython.paths module since IPython 4.0.\n",
# Chapter 4: Metrics for evaluating the performance of a classifier
%% Cell type:code id: tags:
``` python
importsklearn.metricsasmetrics
importmatplotlib
importmatplotlib.pyplotasplt
importnumpyasnp
importpandasaspd
%matplotlibinline
%configInlineBackend.figure_format='retina'
```
%% Cell type:markdown id: tags:
Up to now we used _accuracy_, the percentage of correct classifcations, to evaluate the quality of a classifier.
Regrettably _accuracy_ can produce very misleading results.
This and the next chapter will discuss other metrics to asses the quality of a classifier including the possible pitfalls.
%% Cell type:markdown id: tags:
## 1. The confusion matrix
%% Cell type:markdown id: tags:
Before we define the **confusion matrix** we must introduce some additional terms.
After applying a classifier to a data set with known labels `0` and `1`:
**TP (true positives)**: labels which were predicted as `1` and actually are `1`.
**TN (true negatives)**: labels which were predicted as `0` and actually are `0`.
**FP (false positives)**: labels which were predicted as `1` and actually are `0`.
**FN (false negatives)**: labels which were predicted as `0` and actually are `1`.
%% Cell type:markdown id: tags:
To memorize this: the second word "positives"/"negatives" refers to the prediction computed by the classifier.
The first word "true"/"false" expresses if the classification was correct or not.
Using these terms we can now define the so called **confusion matrix**:
%% Cell type:code id: tags:
``` python
pd.DataFrame(np.array([["TP","FP"],["FN","TN"]]),
index=["Predicted T","Predicted F"],
columns=["Actual T","Actual F"])
```
%% Output
Actual T Actual F
Predicted T TP FP
Predicted F FN TN
%% Cell type:markdown id: tags:
So the total number of predictions can be expressed as `TP` + `FP` + `FN` + `TN`.
The number of correct predictions is `TP` + `TN`.
This allows us to define **accuracy** as (`TP` + `TN`) / (`TP` + `FP` + `FN` + `TN`).
Beyond that: `TP` + `FN` is the number of positive examples in our data set, `FP` + `TN` is the number of negative examples.
%% Cell type:markdown id: tags:
## Pitfalls
**Accuracy can be very misleading if classe sizes are imbalanced.**
Let us demonstrate this with an extreme example:
- On average 10 out of 10000 people are infected with a disease `X`.
- A medical test `Z` diagnoses 50 % of infected people as `not infected` ?
- The test is correct on all not-infected people.
Among $10000$ people
- $10$ will be infected, $5$ gets a correct result.
- $9990$ will be not infected with a correct test result.
Thus accuracy is $\frac{9995}{10000} = 99.95 \% $
This is also called the **accuracy paradox** (<ahref="https://en.wikipedia.org/wiki/Accuracy_paradox">see also here</a>).
To evaluate this test on such an unbalanced dataset we need different numbers:
1. Does our test miss infected people: How many infected people are actually discovered to be infected ?
2. Does our test predict people as infected which are actually not: How many positive diagnoses are correct ?
We come back to this example later.
**TODO**: in a later chapter or in a extra box provide links to strategies for imbalanced data sets.
%% Cell type:markdown id: tags:
## Exercise block 1
1.1 A classifier predicts labels `[0, 0, 1, 1, 1, 0, 1, 1]` whereas true labels are `[0, 1, 0, 1, 1, 0, 1, 1]`. Write these values as a two columned table using pen & paper and assign `FP`, `TP`, ... to each pair. Determine confusion matrix and accuracy.
1.2 A random classfier just assign a randomly chosen label `0` or `1` for a given feature. What is the average accuracy of such a classifier?
### Optional exercise
1.3 Assume the previously described test also produces wrong results on not-infected people, such that 5 out of 10000 will be diagnosed as infected. Compute the confusion matrix and the accuracy of this test.
%% Cell type:markdown id: tags:
## 2. Precision and Recall
In order to understand the concept of **precision** and **recall**, imagine the following scenario:
A few days before thanksgiving you open an online recipe website and enter "turkey thanksgiving". You see some suitable recommendations but also unusable results related to Turkish recipes.
Such a search engine works like a filter applied on a collection of documents.
As a scientist you want to assess the reliablity of this service:
1. What fraction of relevant recipes stored in the underlying database do I see?
2. How many of the shown results are relevant recipes and not the recipes from Turkey?
In this context,
**recall**: is the fraction of all the relevant documents found by the engine.
And
**precision**: is the fraction of shown results that are correct.
### Trade-off between precision and recall.
The more results the search engine delivers, lesser will be the number of relevant documents which are ignored. But at the same time the fraction of wrong results will increase.
%% Cell type:markdown id: tags:
### How to compute precision and recall
To transfer this concept to classification, we can interpret a classifier as a filter. The classifier classifies every document in a collection as relevant or not relevant.
The number of shown documents is TP + FP, thus **precision** is computed as TP / (TP + FP).
The number of relevant documents is TP + FN, thus **recall** is computed as TP / (TP + FN).
The confusion matrix for the medical test `Z` is then:
For the medical test `Z` the `F1` score is `1 / 1.5 = 0.6666..`.
%% Cell type:markdown id: tags:
## Exercise block 2
Use your results from exercise 1.1 to compute precision, recall and F1 score.
### Optional exercise:
Compute precision, recall and F1-score for the test described in exercise 1.2.
%% Cell type:markdown id: tags:
## Other metrics
The discussion above was just a quick introduction to measuring the accuracy of a classifier. We skipped other metrics such as `ROC` and `AUC` amongst others.
A good introduction to `ROC`<ahref="https://classeval.wordpress.com/introduction/introduction-to-the-roc-receiver-operating-characteristics-plot/">can be found here.</a>
%% Cell type:markdown id: tags:
## 3. Metrics in scikit-learn
%% Cell type:markdown id: tags:
`sklearn.metrics` contains many metrics like `precision_score` etc., `classification_report` prints an overall report.
Comment: The `micro avg` and `macro avg` outputs account for class inbalances, in case you want to learn more about this [read here](https://datascience.stackexchange.com/questions/15989/micro-average-vs-macro-average-performance-in-a-multiclass-classification-settin)
%% Cell type:markdown id: tags:
The function `cross_val_score` (introduced in the last script) allows to use other metrics than `accuray`.
We demonstrate usage of different metrics on two data sets:
- the known beer data samples in which labels distribution is almost 50:50.
print(" {:12s}: mean value: {:.2f}".format(metric,scores.mean()))
print()
classifier=LogisticRegression(C=1,solver="lbfgs")
print("balanced data")
assess(classifier,beer_data)
# we sort by label, then removing samples| is easier:
beer_data=beer_data.sort_values(by="is_yummy")
print("unbalanced data")
beer_data_unbalanced=beer_data.iloc[:-80,:]
assess(classifier,beer_data_unbalanced)
```
%% Output
balanced data
52.9 % of the beers are yummy
accuracy : mean value: 0.91
f1 : mean value: 0.92
precision : mean value: 0.89
recall : mean value: 0.96
unbalanced data
26.9 % of the beers are yummy
accuracy : mean value: 0.85
f1 : mean value: 0.63
precision : mean value: 0.82
recall : mean value: 0.56
%% Cell type:markdown id: tags:
You can see that for the balanced data set the values for `f1` and for `accuracy` are almost equal, but differ significantly for the unbalanced data set.
%% Cell type:markdown id: tags:
## Exercise section 3
1. Play with the previous examples, use different classifiers with different settings
### Optional exercise
2. Modify the code from secton 5 of the previous script ("Training the final classifier") to use different metrics.
%% Cell type:code id: tags:
``` python
#REMOVEBEGIN
# THE LINES BELOW ARE JUST FOR STYLING THE CONTENT ABOVE !
fromIPythonimportutils
fromIPython.core.displayimportHTML
importos
defcss_styling():
"""Load default custom.css file from ipython profile"""
/Users/uweschmitt/Projects/machinelearning-introduction-workshop/venv37/lib/python3.7/site-packages/ipykernel_launcher.py:9: UserWarning: get_ipython_dir has moved to the IPython.paths module since IPython 4.0.