04 metrics: cont. true/predicted labels fix

871faca5 · Mikolaj Rybinski · 3ce13f93 · 871faca5
Commit 871faca5 authored 5 years ago by Mikolaj Rybinski
--- a/04_measuring_quality_of_a_classifier.ipynb
+++ b/04_measuring_quality_of_a_classifier.ipynb
@@ -2,7 +2,7 @@
 "cells": [
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
@@ -113,7 +113,7 @@
       "<IPython.core.display.HTML object>"
      ]
     },
-     "execution_count": 5,
+     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -138,7 +138,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -581,7 +581,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
@@ -591,12 +591,12 @@
     "output_type": "stream",
     "text": [
      "[[2 1]\n",
-      " [1 4]]\n",
+      " [2 3]]\n",
      "\n",
-      "precision            0.800\n",
-      "recall               0.800\n",
-      "f1                   0.800\n",
-      "accuracy             0.750\n"
+      "precision            0.750\n",
+      "recall               0.600\n",
+      "f1                   0.667\n",
+      "accuracy             0.625\n"
     ]
    }
   ],
@@ -605,10 +605,11 @@
    "                             confusion_matrix, accuracy_score)\n",
    "\n",
    "# these numbers are from exercise 1.1:\n",
-    "predicted = [0, 0, 1, 1, 1, 0, 1, 1]\n",
-    "labels =  [0, 1, 0, 1, 1, 0, 1, 1]\n",
+    "predicted = [0, 1, 0, 1, 1, 0, 1, 0]\n",
+    "labels =  [0, 0, 1, 1, 1, 0, 1, 1]\n",
    "\n",
    "print(confusion_matrix(labels, predicted))\n",
+    "# Attention: different order of entries, e.g. TPs are bottom-right\n",
    "print()\n",
    "\n",
    "#\n",
@@ -636,7 +637,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
@@ -656,7 +657,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
@@ -737,7 +738,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 6,
   "metadata": {
    "tags": [
     "solution"

 %% Cell type:code id: tags:

 ``` python
 # IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !
 import matplotlib.pyplot as plt
 %matplotlib inline
 %config InlineBackend.figure_format = 'retina'
 import warnings
 warnings.filterwarnings('ignore', category=FutureWarning)
 warnings.filterwarnings = lambda *a, **kw: None
 from IPython.core.display import HTML; HTML(open("custom.html", "r").read())
 ```

 %% Output

    <IPython.core.display.HTML object>

 %% Cell type:markdown id: tags:

 # Chapter 4: Metrics for evaluating the performance of a classifier

 %% Cell type:code id: tags:

 ``` python
 import sklearn.metrics as metrics
 import matplotlib
 import matplotlib.pyplot as plt
 import numpy as np
 import pandas as pd
 ```

 %% Cell type:markdown id: tags:

 Up to now we used _accuracy_, the percentage of correct classifcations, to evaluate the quality of a classifier.

 Regrettably _accuracy_ can produce very misleading results.

 This and the next chapter will discuss other metrics  to asses the quality of a classifier including the possible pitfalls.

 %% Cell type:markdown id: tags:

 ## 1. The confusion matrix

 %% Cell type:markdown id: tags:

 Before we define the **confusion matrix** we must introduce some additional terms.

 %% Cell type:markdown id: tags:

 After applying a classifier to a data set with known labels `0` and `1`:

 <div class="alert alert-block alert-warning">

 <h3><i class="fa fa-info-circle"></i>&nbsp;Definition</h3>
 <ul>

 <li><strong>TP (true positives)</strong>: labels which were predicted as <code>1</code> and actually are <code>1</code>. <br/><br/>


 <li><strong>TN (true negatives)</strong>: labels which were predicted as <code>0</code> and actually are <code>0</code>.<br/><br/>


 <li><strong>FP (false positives)</strong>: labels which were predicted as <code>1</code> and actually are <code>0</code>.<br/><br/>


 <li><strong>FN (false negatives)</strong>: labels which were predicted as <code>0</code> and actually are <code>1</code>.<br/><br/>

 </ul>

 To memorize this:

 <ul>

 <li>The second word "positives"/"negatives" refers to the prediction computed by the classifier.
 <li>The first word "true"/"false" expresses if the classification was correct or not.

 </ul>

 This is the so called <strong>Confusion Matrix</strong>:

 <table style="border: 1px; font-family: 'Source Code Pro', monocco, Consolas, monocco, monospace;
              font-size:110%;">
    <tbody >
        <tr>
            <td style="padding: 10px; background:#f8f8f8;"> </td>
            <td style="padding: 10px; background:#f8f8f8;">Actual P</td>
            <td style="padding: 10px; background:#f8f8f8;">Actual N</td>
        </tr>
        <tr>
            <td style="padding: 10px; background:#f8f8f8;">Predicted P</td>
            <td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">TP         </td>
            <td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">FP         </td>
        </tr>
        <tr>
            <td style="padding: 10px; background:#f8f8f8;">Predicted N</td>
            <td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">FN         </td>
            <td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">TN         </td>
        </tr>
    </tbody>
 </table>

 </div>

 <img src="https://i.imgflip.com/305c8j.jpg" title="made at imgflip.com" width=40%/>


 %% Cell type:markdown id: tags:



 - So the total number of predictions can be expressed as `TP` + `FP` + `FN` + `TN`.


 - The number of correct predictions is `TP` + `TN`.


 - `TP` + `FN` is the number of positive examples in our data set,


 - `FP` + `TN` is the number of negative examples.



 <div class="alert alert-block alert-warning">
 <h3><i class="fa fa-info-circle"></i>&nbsp;Definition</h3>

 This allows us to define <strong>accuracy</strong> as (<code>TP</code> + <code>TN</code>) / (<code>TP</code> + <code>FP</code> + <code>FN</code> + <code>TN</code>).

 </div>


 %% Cell type:markdown id: tags:

 ## Pitfalls

 <div class="alert alert-block alert-warning">
 <i class="fa fa-info-circle"></i>&nbsp; Accuracy can be very misleading if classe sizes are imbalanced
 </div>


 Let us demonstrate this with an extreme example:

 - On average 10 out of 10000 people are infected with a disease `X`.
 - A medical test `Z` diagnoses 50 % of infected people as `not infected` ?
 - The test is correct on all  not-infected people.


 Among $10000$ people

 - $10$ will be infected, $5$ gets a correct result.
 - $9990$ will be not infected with a correct test result.

 Thus accuracy is $\frac{9995}{10000} = 99.95 \% $


 This is also called the **accuracy paradox** (<a href="https://en.wikipedia.org/wiki/Accuracy_paradox">see also here</a>).


 <img src="https://i.imgflip.com/303wyp.jpg" title="made at imgflip.com" width=50%/>



 To evaluate this test on such an unbalanced dataset we need different numbers:

 1. Does our test miss infected people: How many infected people are actually discovered to be infected ?

 2. Does our test predict people as infected which are actually not: How many positive diagnoses are correct ?

 We come back to this example later.

 %% Cell type:markdown id: tags:

 ## Exercise block 1

 1.1 A classifier predicts labels `[0, 1, 0, 1, 1, 0, 1, 0]` whereas true labels are `[0, 0, 1, 1, 1, 0, 1, 1]`. First write these values as a two columned table using pen & paper and assign `FP`, `TP`, ... to each row. Now create the confusion matrix and compute accuracy.

 1.2 A random classfier just assign a randomly chosen label `0` or `1` for a given feature. What is the average accuracy of such a classifier?

 %% Cell type:markdown id: tags:solution

 SOLUTION 1.1
 <pre>
 TRUE   PREDICTED   THIS IS
 0      0           TN
 0      1           FP
 1      0           FN
 1      1           TP
 1      1           TP
 0      0           TN
 1      1           TP
 1      0           FP

 TP = 3    FP = 2
 FN = 1    TN = 2

 accuracy = 5 / 8 = 62.5 %
 </pre>

 SOLUTION 1.2

 On average all fields of the confusion matrix should contain same values, thus the accuracy would be 50 %.

 %% Cell type:markdown id: tags:

 ### Optional exercise

 1.3 Assume the previously described test also produces wrong results on not-infected people, such that 5 out of 10000 will be diagnosed as infected. Compute the confusion matrix and the accuracy of this test.

 %% Cell type:markdown id: tags:solution

 Solution 1.3

 This is the new situation:
 - On average 10 out of 10000 people are infected with a disease `X`.
 - A medical test `Z` diagnoses 50 % of infected people as `not infected` ?
 - The test is correct on 95% of all people.

 <pre>
 Infected people     = 10,      diagnosed as X: 5 (TP)
 Not infected people = 9990,    diagnosed as X: 0.05 * 9990 = 499.5 (FP)

 TP = 5   FP = 499.5
 FN = 5   TN = 9490.5

 accuracy = 9495.5 / 10000 = 94.96 %
 </pre>

 %% Cell type:markdown id: tags:

 ## 2. Precision and Recall

 In order to understand the concept of **precision** and **recall**, imagine the following scenario:

 A few days before thanksgiving you open an online recipe website and enter "turkey thanksgiving". You see some suitable recommendations but also unusable results related to Turkish recipes.

 Such a search engine works like a filter applied on a collection of documents.

 As a scientist you want to assess the reliablity of this service:

 1. What fraction of relevant recipes stored in the underlying database do I see?

 2. How many of the shown results are relevant recipes and not the recipes from Turkey?



 In this context, **recall** is the fraction of all the relevant documents found by the engine.

 And **precision** is the fraction of shown results that are correct.


 %% Cell type:markdown id: tags:

 ### How to compute precision and recall for a classifier

 To transfer this concept to classification, we can interpret a classifier as a filter. The classifier classifies every  document in a collection as relevant or not relevant.


 <div class="alert alert-block alert-warning">

 <h3><i class="fa fa-info-circle"></i>&nbsp;Definition</h3>

 To remember:

 <table style="border: 1px; font-family: 'Source Code Pro', monocco, Consolas, monocco, monospace;
              font-size:110%;">
    <tbody >
        <tr>
            <td style="padding: 10px; background:#f8f8f8;"> </td>
            <td style="padding: 10px; background:#f8f8f8;">Actual P</td>
            <td style="padding: 10px; background:#f8f8f8;">Actual N</td>
        </tr>
        <tr>
            <td style="padding: 10px; background:#f8f8f8;">Predicted P</td>
            <td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">TP         </td>
            <td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">FP         </td>
        </tr>
        <tr>
            <td style="padding: 10px; background:#f8f8f8;">Predicted N</td>
            <td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">FN         </td>
            <td style="padding: 10px; background:#fcfcfc; text-align:center; font-weight: bold">TN         </td>
        </tr>
    </tbody>
 </table>


 The number of shown documents is  <code>TP + FP </code>, the number of relevant documents is <code>TP + FN</code>

 Thus:

 - **precision** is computed as <code>TP / (TP + FP)</code>.


 - **recall** is computed as <code>TP / (TP + FN)</code>.

 </div>


 The confusion matrix for the medical test `Z` is then:


 <table style="border: 1px solid black">
    <tr style="border: 1px black">
        <td style="border: 1px  solid black; background: white; padding: 1em">TP = 5</td>
        <td style="border: 1px  solid black; background: white; ">FP = 0</td>
    </tr>
    <tr style="border: 1px black">
        <td style="border: 1px solid black; background: white; padding: 1em ">FN = 5</td>
        <td style="border: 1px solid black; background: white; ">TN = 9900</td>
    </tr>

 </table>

 Here precision is `1.0` and recall is `0.5`.


 ### Trade-off between precision and recall.

 The more results the search engine delivers, the lesser will be the number of relevant documents which are ignored. But at the same time the fraction of wrong results will increase.


 ### F1-score

 Sometimes we want a single number instead of two numbers to compare the performace of multiple classifiers.


 <div class="alert alert-block alert-warning">
 <h3><i class="fa fa-info-circle"></i>&nbsp;Definition</h3>

 The **F1 score** is computed as
 <code>F1 = 2 * (precision * recall) / (precision + recall)</code>.

 This is the *harmonic mean* of precision and recall.


 </div>

 For the medical test `Z` the `F1` score is `1 / 1.5 = 0.6666..`.

 %% Cell type:markdown id: tags:

 ## Exercise block 2

 Use your results from exercise 1.1 to compute precision, recall and F1 score.

 %% Cell type:markdown id: tags:solution

 <pre>
 TP = 3    FP = 2
 FN = 1    TN = 2

 precision = 3 / (3 + 2) = 60 %
 recall    = 3 / (3 + 1) = 75 %
 F1        = 2 * (0.6 * 0.75) / (0.6 + 0.75) = 66.66%
 </pre>

 %% Cell type:markdown id: tags:

 ### Optional exercise:

 Compute precision, recall and F1-score for the test described in exercise 1.2.

 %% Cell type:markdown id: tags:solution

 <pre>
 TP = 5   FP = 499.5
 FN = 5   TN = 9490.5

 precision = 5 / (5 + 499.5) = 0.0099
 recall    = 5 / (5 + 5) = 0.5
 F1        = 2 * (0.099 * 0.5) / (0.0099 + 0.5) = 0.194
 </pre>

 %% Cell type:markdown id: tags:

 ## Other metrics

 The discussion above was just a quick introduction to measuring the accuracy of a classifier. We skipped other metrics such as `ROC` and `AUC` amongst others.

 A good introduction to `ROC` <a href="https://classeval.wordpress.com/introduction/introduction-to-the-roc-receiver-operating-characteristics-plot/">can be found here.</a>

 %% Cell type:markdown id: tags:

 ## 3. Metrics in scikit-learn

 %% Cell type:markdown id: tags:

 `sklearn.metrics` contains many metrics like `precision_score` etc., `confusion_matrix` prints what it means.

 %% Cell type:code id: tags:

 ``` python
 from sklearn.metrics import (precision_score, recall_score, f1_score,
                             confusion_matrix, accuracy_score)

 # these numbers are from exercise 1.1:
-predicted = [0, 0, 1, 1, 1, 0, 1, 1]
-labels =  [0, 1, 0, 1, 1, 0, 1, 1]
+predicted = [0, 1, 0, 1, 1, 0, 1, 0]
+labels =  [0, 0, 1, 1, 1, 0, 1, 1]

 print(confusion_matrix(labels, predicted))
+# Attention: different order of entries, e.g. TPs are bottom-right
 print()

 #
 # The first argument of the metrics functions is the exact labels,
 # the second argument is the predictions:
 #

 print("{:20s} {:.3f}".format("precision", precision_score(labels, predicted)))
 print("{:20s} {:.3f}".format("recall", recall_score(labels, predicted)))
 print("{:20s} {:.3f}".format("f1", f1_score(labels, predicted)))
 print("{:20s} {:.3f}".format("accuracy", accuracy_score(labels, predicted)))
 ```

 %% Output

    [[2 1]
-     [1 4]]
+     [2 3]]
    
-    precision            0.800
-    recall               0.800
-    f1                   0.800
-    accuracy             0.750
+    precision            0.750
+    recall               0.600
+    f1                   0.667
+    accuracy             0.625

 %% Cell type:markdown id: tags:

 The function `cross_val_score` (introduced in the last script) allows to use other metrics than `accuracy`.

 We demonstrate usage of different metrics on two data sets:

 - the known beer data samples in which labels distribution is almost 50:50.
 - an unbalanced subset of the beer data samples.

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd

 beer_data = pd.read_csv("beers.csv")
 print(beer_data.shape)
 ```

 %% Output

    (225, 5)

 %% Cell type:code id: tags:

 ``` python
 from sklearn.model_selection import cross_val_score
 from sklearn.metrics import make_scorer, confusion_matrix
 from sklearn.linear_model import LogisticRegression


 def assess(classifier, beer_data):
    features = beer_data.iloc[:, :-1]
    labels = beer_data.iloc[:, -1]
    n = len(labels)
    print("{:.1f} % of the beers are yummy".format(100 * sum(labels == 1) /n))
    print()

    for metric in ["accuracy", "f1", "precision", "recall"]:
        scores = cross_val_score(classifier, features, labels, scoring=metric, cv=5)
        print("   {:12s}: mean value: {:.2f}".format(metric, scores.mean()))

    print()


 classifier = LogisticRegression(C=1)

 print("balanced data")
 assess(classifier, beer_data)

 # we sort by label, then removing samples| is easier:
 beer_data = beer_data.sort_values(by="is_yummy")

 print("unbalanced data")
 beer_data_unbalanced = beer_data.iloc[:-80, :]
 assess(classifier, beer_data_unbalanced)
 ```

 %% Output

    balanced data
    52.9 % of the beers are yummy
    
       accuracy    : mean value: 0.80
       f1          : mean value: 0.83
       precision   : mean value: 0.78
       recall      : mean value: 0.89
    
    unbalanced data
    26.9 % of the beers are yummy
    
       accuracy    : mean value: 0.79
       f1          : mean value: 0.41
       precision   : mean value: 0.87
       recall      : mean value: 0.28
    

 %% Cell type:markdown id: tags:

 You can see that for the balanced data set the values for `f1` and for `accuracy` are almost equal, but differ significantly for the unbalanced data set.

 %% Cell type:markdown id: tags:

 ## Exercise section 3

 1. Play with the previous examples, use different classifiers with different settings

 ### Optional exercise

 2. Modify the code from section 5 of the previous script ("Training the final classifier") to use different metrics.

 %% Cell type:code id: tags:solution

 ``` python
 beer_data = pd.read_csv("beers.csv")

 # all columns up to the last one:
 features = beer_data.iloc[:, :-1]
 labels = beer_data.iloc[:, -1]

 eval_data = pd.read_csv("beers_eval.csv")

 eval_features = eval_data.iloc[:, :-1]
 eval_labels = eval_data.iloc[:, -1]

 from sklearn.svm import SVC
 from sklearn.model_selection import cross_val_score
 from sklearn.metrics import classification_report


 results = []

 print("OPTIMIZE SETTINGS")

 for C in (5, 10):
    for gamma in (.5, 1, 2):
        classifier = SVC(C=C, gamma=gamma)
        test_scores = cross_val_score(classifier, features, labels, scoring="f1", cv=5)
        print("score = {:.3f}  C={:.1f}  gamma={:.1f}".format(test_scores.mean(), C, gamma))
        results.append((test_scores.mean(), C, gamma))

 # max of list of tuples considers value of first entry
 # to compare tuples. This we look for test_scores.mean() value:

 best_result = max(results)
 best_score, C, gamma = best_result

 print()
 print("BEST RESULT CROSS VALIDATION")
 print("score = {:.3f}  C={:.1f}  gamma={:.1f}".format(best_score, C, gamma))

 # EVALUATE CLASSIFIER ON VALIDATION DATASET

 classifier = SVC(C=C, gamma=gamma)

 classifier.fit(features, labels)
 predicted = classifier.predict(eval_features)
 ```

 %% Output

    OPTIMIZE SETTINGS
    score = 0.921  C=5.0  gamma=0.5
    score = 0.913  C=5.0  gamma=1.0
    score = 0.925  C=5.0  gamma=2.0
    score = 0.943  C=10.0  gamma=0.5
    score = 0.933  C=10.0  gamma=1.0
    score = 0.933  C=10.0  gamma=2.0
    
    BEST RESULT CROSS VALIDATION
    score = 0.943  C=10.0  gamma=0.5