Skip to content
Snippets Groups Projects
03_overfitting_and_cross_validation.ipynb 47.4 KiB
Newer Older
  • Learn to ignore specific revisions
  • schmittu's avatar
    schmittu committed
    {
     "cells": [
    
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
        "# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !\n",
        "%matplotlib inline\n",
        "%config InlineBackend.figure_format = 'retina'\n",
        "import warnings\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "import matplotlib.pyplot as plt\n",
        "\n",
        "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
        "warnings.filterwarnings(\"ignore\")  # , category=ConvergenceWarning)\n",
    
    schmittu's avatar
    schmittu committed
        "warnings.filterwarnings = lambda *a, **kw: None\n",
    
    schmittu's avatar
    schmittu committed
        "from IPython.core.display import HTML\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "HTML(open(\"custom.html\", \"r\").read())"
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "# Chapter 3: Overfitting, underfitting and cross-validation"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "## What are overfitting and underfitting?\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "Let us recall the `LogisticRegression`-based beer classfier we used in the first script. We've disovered that setting hyperparmeter `C=2` gave us good results (`C` controls `regularization`, lower `C` means higher `regularization` and vice-versa):"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {
        "scrolled": true
       },
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "import pandas as pd\n",
        "\n",
    
        "# reading the beer dataset\n",
    
        "beer_data = pd.read_csv(\"data/beers.csv\")\n",
    
    schmittu's avatar
    schmittu committed
        "print(beer_data.shape)\n",
        "\n",
        "# all columns up to the last one:\n",
        "input_features = beer_data.iloc[:, :-1]\n",
        "\n",
        "# only the last column:\n",
        "labels = beer_data.iloc[:, -1]\n",
        "\n",
        "from sklearn.linear_model import LogisticRegression\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "classifier = LogisticRegression(C=2)\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "classifier.fit(input_features, labels)\n",
        "\n",
        "# Predict\n",
        "predicted_labels = classifier.predict(input_features)\n",
    
    schmittu's avatar
    schmittu committed
        "print(\n",
        "    \"{:.2f} % labeled correctly\".format(\n",
        "        sum(predicted_labels == labels) / len(labels) * 100\n",
        "    )\n",
        ")"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "Here to train (fit) the model we only used 225 samples from the original data set of 300 beers.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "But if the above classifier works well, it should also show the same performance on the left out 75 beers.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "Let us check this on the left out data:"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "eval_data = pd.read_csv(\"data/beers_eval.csv\")\n",
    
    schmittu's avatar
    schmittu committed
        "print(eval_data.shape)"
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "eval_features = eval_data.iloc[:, :-1]\n",
        "eval_labels = eval_data.iloc[:, -1]\n",
        "\n",
        "# Predict\n",
        "predicted_labels = classifier.predict(eval_features)\n",
    
    schmittu's avatar
    schmittu committed
        "print(\n",
        "    \"{:.2f} % labeled correctly\".format(\n",
        "        sum(predicted_labels == eval_labels) / len(eval_labels) * 100\n",
        "    )\n",
        ")"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "<div style=\"font-size:150%; font-weight: bold;\">\n",
        "           \n",
        "WHAT HAPPENED????\n",
        "<br/>\n",
        "<br/>\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "Why is the accuracy on new data much lower?\n",
    
        "<br/>\n",
        "<br/>\n",
        "Answer: OVERFITTING !!\n",
        "\n",
        "</div>\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "We observed a phenomenon called **\"overfitting\"**.\n",
        "\n",
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "<img src=\"./images/2qky90.jpg\" width=30% />"
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
        "### Overfitting\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "To explain the concept of \"overfitting\" let's use the circle data set:"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "data = pd.read_csv(\"data/circle.csv\")\n",
    
    schmittu's avatar
    schmittu committed
        "features = data.iloc[:, :-1]\n",
        "labels = data.iloc[:, -1]\n",
        "\n",
    
        "COLORS = [\"chocolate\", \"steelblue\"]\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "plt.figure(figsize=(4, 4))\n",
        "ax = plt.subplot(1, 1, 1)\n",
    
    schmittu's avatar
    schmittu committed
        "plt.scatter(\n",
        "    features.iloc[:, 0], features.iloc[:, 1], c=[COLORS[l] for l in labels], marker=\"o\"\n",
        ");"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "We mentioned before that classifiers depend on (hyper)parameters (like `C`) which can be tuned to improve performance.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "Let us try to find out the purpose of the `gamma` parameter of `SVC` classifier:"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
       "outputs": [],
       "source": [
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "# utility functions copy-pasted from previous script\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "import matplotlib.pyplot as plt\n",
    
    schmittu's avatar
    schmittu committed
        "import numpy as np\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "def plot_points(features_2d, labels, plt=plt, marker=\"o\"):\n",
    
    schmittu's avatar
    schmittu committed
        "    colors = [[\"steelblue\", \"chocolate\"][i] for i in labels]\n",
    
    schmittu's avatar
    schmittu committed
        "    plt.scatter(features_2d[:, 0], features_2d[:, 1], color=colors, marker=marker)\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "def train_and_plot_decision_surface(\n",
    
    schmittu's avatar
    schmittu committed
        "    name, classifier, features_2d, labels, preproc=None, plt=plt, marker=\"o\", N=300\n",
    
    schmittu's avatar
    schmittu committed
        "):\n",
        "\n",
        "    features_2d = np.array(features_2d)\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "    xmin, ymin = features_2d.min(axis=0)\n",
        "    xmax, ymax = features_2d.max(axis=0)\n",
        "\n",
        "    x = np.linspace(xmin, xmax, N)\n",
        "    y = np.linspace(ymin, ymax, N)\n",
        "    points = np.array(np.meshgrid(x, y)).T.reshape(-1, 2)\n",
        "\n",
        "    if preproc is not None:\n",
        "        points_for_classifier = preproc.fit_transform(points)\n",
        "        features_2d = preproc.fit_transform(features_2d)\n",
        "    else:\n",
        "        points_for_classifier = points\n",
        "\n",
        "    classifier.fit(features_2d, labels)\n",
        "    predicted = classifier.predict(features_2d)\n",
        "\n",
        "    if preproc is not None:\n",
        "        name += \" (w/ preprocessing)\"\n",
        "    print(name + \":\\t\", sum(predicted == labels), \"/\", len(labels), \"correct\")\n",
        "\n",
        "    classes = np.array(classifier.predict(points_for_classifier), dtype=bool)\n",
        "    plt.scatter(\n",
        "        points[~classes][:, 0],\n",
        "        points[~classes][:, 1],\n",
        "        color=\"steelblue\",\n",
        "        marker=marker,\n",
        "        s=1,\n",
        "        alpha=0.05,\n",
        "    )\n",
        "    plt.scatter(\n",
        "        points[classes][:, 0],\n",
        "        points[classes][:, 1],\n",
        "        color=\"chocolate\",\n",
        "        marker=marker,\n",
        "        s=1,\n",
        "        alpha=0.05,\n",
        "    )\n",
        "\n",
        "    plot_points(features_2d, labels)\n",
        "    plt.title(name)"
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "from sklearn.svm import SVC\n",
        "\n",
    
        "df = pd.read_csv(\"data/circle.csv\")\n",
    
    schmittu's avatar
    schmittu committed
        "features = df.iloc[:, :-1]\n",
        "labels = df.iloc[:, -1]\n",
        "\n",
        "# three classifiers with different values for gamma:\n",
    
    schmittu's avatar
    schmittu committed
        "classifiers = [SVC(gamma=18), SVC(gamma=9), SVC(gamma=0.1)]\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "plt.figure(figsize=(21, 6))\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "for i, clf in enumerate(classifiers):\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "    plt.subplot(1, len(classifiers), i + 1)\n",
    
    schmittu's avatar
    schmittu committed
        "    train_and_plot_decision_surface(\n",
        "        \"gamma = {}\".format(clf.gamma), clf, features, labels\n",
        "    )"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "#### Observation\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "The parameter `gamma` of `SVC` has an effect on the flexibility/complexity of the decision surface. A large value allows a very flexible / \"irregular\" decision surface, for smaller values the surface gets smoother / \"stiffer\" / \"more regular\" (allowing more misclassifications).\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "This is also coined **simple** resp. **complex** models.\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "We see here also \n",
        "\n",
    
        "- that the smallest `gamma` value produces a classifier which seems to get the idea of a \"circle\", \n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "- whereas the large `gamma` value adapts the classifier more to the training data samples."
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "Let's try an even larger `gamma` value:"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "clf = SVC(gamma=90)\n",
    
    schmittu's avatar
    schmittu committed
        "plt.figure(figsize=(6, 6))\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "train_and_plot_decision_surface(\"gamma = {}\".format(clf.gamma), clf, features, labels)"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "The plot above shows an extreme example for the previously mentioned effect of overfitting.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "- If we evaluate performance of this classifier on the training data set we get an **accuracy of `~100%`**\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "- But the classifier totally fails to learn the concept of a circle, and you can easily imagine how bad this classifier performs on new and unseen data.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "<div class=\"alert alert-block alert-warning\">\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "<p style=\"font-weight: bold;\"><i class=\"fa fa-warning\"></i>&nbsp; Definitions</p>\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "<ul>\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "<li><strong>Overfitting</strong>: The classifier overfits if it too closely fits to/learns detail or noise in the training data instead of learning the underlying concept. Thus, the classifier does not generalize well and shows much worse performance on previously unseen new data.</li>\n",
        "<br/>\n",
        "<li><strong>Generalization</strong>: An ability of a classifier to learn the concept behind data. A classifier generalizes well if it shows similar performance on training and on new data.</li>\n",
        "<br/>\n",
        "<li><strong>Robust classifier</strong>: A classifier which is not or very little susceptible to overfitting when learning some data, i.e. a classfier which usually generalizes well.</li>\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "</ul>\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        " \n",
    
    schmittu's avatar
    schmittu committed
        "</div>\n",
        "\n",
        "\n"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "#### More \"probabilistic\" definition\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "- Our data is generated by a (usually unknown) model.\n",
        "- We have only samples from this model.\n",
        "- A classifier tries to approximate the underlying model based on the given samples.\n",
        "\n",
    
        "In this context the observed bad generalization performance of the classifier can be explained by computing a model which is to far away from the original model.\n",
        "\n",
        "The following graphics depicts our explanations: \n",
        "\n",
        "- The more \"complex\" a model gets the better it fits trainig data. Thus accuracy on the training data improves.\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "- At a certain point the model is too adapted to the training data and gets worse and worse when evaluated later on previously unseen new data.\n"
    
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "<img src=\"./images/accuracy_training_vs_eval.svg\" width=50%/>  "
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "### Underfitting\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "The other extreme of overfitting is called **underfitting**: the classifiers decision boundary deviates too far from the boundary in training data and produces a classifier which does not perform well even on training data.\n",
    
        "\n",
        "We can demonstrate this by choosing a \"too small\" value of `gamma`"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "# small gamma tries to build a \"safe\", \"perfect\" circle\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "clf = SVC(gamma=0.06)\n",
        "plt.figure(figsize=(6, 6))\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "train_and_plot_decision_surface(\"gamma = {}\".format(clf.gamma), clf, features, labels)\n",
        "# plt.scatter(features.iloc[:, 0], features.iloc[:, 1], color=c, marker='.');"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "## Diagnosing and solving the overfitting problem\n",
        "\n",
        "### How did we fall for overfitting? \n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "<div class=\"alert alert-block alert-warning\">\n",
        "\n",
    
        "<div style=\"font-size:150%;\">\n",
    
    schmittu's avatar
    schmittu committed
        "    <i class=\"fa fa-info-circle\"></i>\n",
        "    <center>\n",
    
    schmittu's avatar
    schmittu committed
        "Our fundamental mistake was to evaluate the performace <br/>of the classifier on the training data.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "</center>\n",
    
        "</div>\n",
    
    schmittu's avatar
    schmittu committed
        "</div>\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "Repeat:\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "<div class=\"alert alert-block alert-warning\">\n",
        "\n",
    
        "\n",
        "\n",
        "<div style=\"font-size:150%;\">\n",
        "     <i class=\"fa fa-info-circle\"></i>\n",
    
    schmittu's avatar
    schmittu committed
        "    <center>\n",
        "Our fundamental mistake was to evaluate the performace <br/>of the classifier on the training data.\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "</center>\n",
    
        "</div>\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "</div>\n"
    
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "### How can we do better?\n",
        "\n",
    
        "There is no classifier which works out of the box in all situations. Depending on the \"geometry\" / \"shape\" of the data, classification algorithms and their settings can make a big difference.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "In our previous 2D examples we were able to visualize the data and classification results, this is not possible for higher dimensional data.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "The general way to handle this situation is as follows: \n",
        "\n",
    
        "- split our data into a learning data set and a test data set\n",
    
    schmittu's avatar
    schmittu committed
        "- train the classifier on the learning data set\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "- assess performance of the classifier on the test data set."
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
        "## Cross-validation\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "The procedure called *cross-validation* goes a step further in data splitting: In this procedure the full dataset is split into learn-/test-set in various ways. Statistics of the achieved test performance is computed to assess future performance of the classifier.\n",
    
        "\n",
        "A common approach is **K-fold cross-validation**:\n",
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "K-fold cross-validation has an advantage that we do not leave out part of our data from training. This is useful when we do not have a lot of data.\n",
        "\n",
        "<img src=\"./images/305azk.jpg\" title=\"made at imgflip.com\" width=40%/>\n",
    
        "\n",
        "### Example: 4-fold cross validation\n",
        "\n",
        "For 4-fold cross validation we split our data set into four equal sized partitions P1, P2, P3 and P4.\n",
        "\n",
        "We:\n",
        "\n",
        "- hold out `P1`: train the classifier on `P2 + P3 + P4`, compute accuracy `m1` on `P1`.\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "<img src=\"./images/cross_val_0.svg\" />\n",
    
        "\n",
        "-  hold out `P2`: train the classifier on `P1 + P3 + P4`, compute accuracy `m2` on `P2`.\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "<img src=\"./images/cross_val_1.svg\" />\n",
    
        "\n",
        "-  hold out `P3`: train the classifier on `P1 + P2 + P4`, compute accuray `m3` on `P3`.\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "<img src=\"./images/cross_val_2.svg\" />\n",
    
        "\n",
        "-  hold out `P4`: train the classifier on `P1 + P2 + P3`, compute accuracy `m4` on `P4`.\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "<img src=\"./images/cross_val_3.svg\" />\n",
    
        "\n",
        "Finally we can compute the average of `m1` .. `m4` as the final measure for accuracy.\n",
        "\n",
        "Some advice:\n",
        "\n",
        "- This can be done on the original data or on randomly shuffled data. It is recommended to shuffle the data first, as there might be some unknown underlying ordering in your dataset\n",
        "\n",
        "- Usually one uses 3- to 10-fold cross validation, depending on the amount of data available."
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "### Variant: randomized cross validation\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "A randomized variant works like this:\n",
        "\n",
        "- Perform $n$ iterations:\n",
        "\n",
        "   - draw a fraction $p$ (e.g. 80%) from your full data set without replacement for the training data set.\n",
        "   - use the remaining fraction $1 - p$ as evaluation data set\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "   - train classifier and compute performance score(s).\n",
    
    schmittu's avatar
    schmittu committed
        "  "
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "### Cross valiation with scikit-learn"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "import pandas as pd\n",
        "\n",
    
        "beer = pd.read_csv(\"data/beers.csv\")\n",
        "beer_eval = pd.read_csv(\"data/beers_eval.csv\")\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "# Since we're using cross validation, let's use all data\n",
    
    schmittu's avatar
    schmittu committed
        "all_beer = pd.concat((beer, beer_eval))\n",
        "\n",
        "all_beer.shape"
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "Let's use the familiar _accuracy_ score: a percentage of correctly classified samples. (More about other ways of assessing quality of a classifier in one of the following scripts.)\n"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "from sklearn.utils import shuffle\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "all_beer = shuffle(all_beer, random_state=42)  # fix randomization for reproduciblity\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "features = all_beer.iloc[:, :-1]\n",
        "labels = all_beer.iloc[:, -1]\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "\n",
        "from sklearn.linear_model import LogisticRegression\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "classifier = LogisticRegression(C=2)\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "from sklearn.model_selection import cross_val_score\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "# 4-fold cross validation with the way we've evaluated classifiers\n",
        "# up to now: \"accuracy\" score (the percentage of correct classification)\n",
    
        "scores = cross_val_score(classifier, features, labels, scoring=\"accuracy\", cv=4)\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "for i, score in enumerate(scores):\n",
    
    schmittu's avatar
    schmittu committed
        "    print(\"Fold\", i + 1, \"score:\", score)"
    
    schmittu's avatar
    schmittu committed
       ]
      },
    
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
        "The `cross_val_score` as used in the previous code example works as follows:\n",
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "0. split training data in four chunks\n",
    
        "- learn `classifier` on chunk `1, 2, 3`, apply classifier to chunk `4` and compute score `s1`\n",
        "- learn `classifier` on chunk `1, 2, 4`, apply classifier to chunk `3` and compute score `s2`\n",
        "- learn `classifier` on chunk `1, 3, 4`, apply classifier to chunk `2` and compute score `s3`\n",
        "- learn `classifier` on chunk `2, 3, 4`, apply classifier to chunk `1` and compute score `s4`\n",
        "\n",
        "`cross_val_score` finally returns `[s1, s2, s3, s4]`."
       ]
      },
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {
        "scrolled": true
       },
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "m = scores.mean()\n",
        "s = scores.std()\n",
        "\n",
        "low = m - 2 * s\n",
        "high = m + 2 * s\n",
        "\n",
        "print(\"mean test score is {:.3f}\".format(m))\n",
        "print(\"std dev of test score is {:.3f}\".format(s))\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "# and, assuming normality of the scores\n",
    
    schmittu's avatar
    schmittu committed
        "print(\n",
        "    \"true test score is with 96% probability between {:.3f} and {:.3f}\".format(\n",
        "        low, high\n",
        "    )\n",
        ")"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
        "## Exercise section\n",
        "\n",
        "1. Play with the previous examples.\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "2. Try out different number of cross validation folds for the beer data. What happens with the score?"
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
       "metadata": {
        "tags": [
         "solution"
        ]
       },
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
       "source": [
        "import pandas as pd\n",
        "\n",
        "beer = pd.read_csv(\"data/beers.csv\")\n",
        "beer_eval = pd.read_csv(\"data/beers_eval.csv\")\n",
        "\n",
        "all_beer = pd.concat((beer, beer_eval))\n",
        "\n",
        "from sklearn.utils import shuffle\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "all_beer = shuffle(all_beer, random_state=42)  # fix randomization for reproduciblity\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "\n",
        "features = all_beer.iloc[:, :-1]\n",
        "labels = all_beer.iloc[:, -1]\n",
        "\n",
        "from sklearn.linear_model import LogisticRegression\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "classifier = LogisticRegression(C=2)\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "from sklearn.model_selection import cross_val_score\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "for k in [2, 5, 10, 25, 50, 150]:\n",
        "    scores = cross_val_score(classifier, features, labels, scoring=\"accuracy\", cv=k)\n",
        "    m = scores.mean()\n",
        "    s = scores.std()\n",
    
        "    print(\"{:3d}-fold accuracy score is {:.3f} +/- {:.3f}\".format(k, m, s))\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "\n",
        "#\n",
        "# Q: What happens with the score?\n",
        "#\n",
        "# Mean score increases, very slightly from a certain number of folds (here, 25),\n",
        "# and variance of the score increases significantly.\n",
        "#\n",
        "# Intuitively, with very high number of folds models become similar across folds,\n",
        "# as they fit a big common set of samples, whereas single misclassifications in\n",
        "# the small testing sets result in much smaller accuracies, increasing variance.\n",
        "#"
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
        "<div class=\"alert alert-block alert-info\">\n",
        "<p style=\"font-weight: bold;\"><i class=\"fa fa-info-circle\"></i>&nbsp;Rule of thumb</p>\n",
        "<p>Preffer 5- or 10- fold cross validation.</p>\n",
        "</div>"
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "### Optional exercises\n",
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "1. Split the dataset `data/spiral.csv` in 300 features/labels for training and 100 features/labels for evaluation. Find a good classifier which reaches 100% accuracy on the training samples, then evaluate the trained classifier on the remaining 100 samples."
    
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
       "metadata": {
        "tags": [
         "solution"
        ]
       },
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
       "source": [
        "import pandas as pd\n",
        "from sklearn.svm import SVC\n",
        "\n",
        "df = pd.read_csv(\"data/spiral.csv\")\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "n_train = 300\n",
        "features_learn = df.iloc[:n_train, :-1]\n",
        "features_eval = df.iloc[n_train:, :-1]\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "labels_learn = df.iloc[:n_train, -1]\n",
        "labels_eval = df.iloc[n_train:, -1]\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "clf = SVC(gamma=3, C=90)\n",
    
        "clf.fit(features_learn, labels_learn)\n",
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "predicted = clf.predict(features_learn)\n",
    
    schmittu's avatar
    schmittu committed
        "print(\n",
        "    \"training accuracy: {:3.1f}%\".format(\n",
        "        sum(predicted == labels_learn) * 100 / len(predicted)\n",
        "    )\n",
        ")\n",
    
        "\n",
        "predicted = clf.predict(features_eval)\n",
    
    schmittu's avatar
    schmittu committed
        "print(\n",
        "    \"testing accuracy: {:3.1f}%\".format(\n",
        "        sum(predicted == labels_eval) * 100 / len(predicted)\n",
        "    )\n",
        ")"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "## Some reasons for overfitting and how you might fight it.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "###  Small / insufficient data sets.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "The classifier fails to \"grab the concept\" because the \"concept\" is not represented strongly enough in the data set. \n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "Possible solutions:\n",
        "\n",
        "- Get more data.\n",
    
        "- Augment your data by creating artificial/synthetic data (e.g. for images: shift / scale / rotate images) if feasible.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "### Unsuitable classifier / classifier parameters used\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "This is what we observed in the example before.\n",
        "\n",
        "Possible solutions:\n",
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "- Optimize parameters using cross-validation.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "- Evaluate other classification algorithms.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "###  Noisy / uninformative features\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "A classifier can in some situations use noisy or uninformative features to explain noise in the training data. In such cases features noise contributes to \"artificially\" good results on the training data.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "Possible solutions:\n",
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "- Use features selection techniques:<br/><br/>\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "    - Inspect your data to detect noisy or uninformative features.\n",
        "        - See e.g. [removing features with low variance in scikit-learn](https://scikit-learn.org/stable/modules/feature_selection.html#removing-features-with-low-variance)<br/><br/>\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "    - Try learning classifier with some features excluded.\n",
        "        - This can be automated, see [recursive feature elimination in scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE).\n",
        "        - Random forest classifiers learn in such way (more about them later), hence, supporting features exclusion directly.<br/><br/>\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "    - Penalize for using many features (prefer simpler models).\n",
        "        - So called *sparse* learning methods do that (more about them later) and they can be used only for data pre-processing step, see [L1-based feature selection in scikit-learn](https://scikit-learn.org/stable/modules/feature_selection.html#l1-based-feature-selection)<br/><br/>\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "- Use dimension reduction techniques like `PCA` (more about this later).\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "### Strongly correlated / redundant features\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "In case the data set contains strongly, but not 100% correlated features, their (weighted) difference might be considered as random data. The effect is then similar to having noisy or uninformative features.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "Possible solutions:\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "- Same as for noise or uninformative features: features selection or dimension reduction techniques.\n"
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
    schmittu's avatar
    schmittu committed
        "The following code demonstrates the effect of noise and redundant features:"
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "beer_data = pd.read_csv(\"data/beers.csv\")\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "# all columns up to the last one:\n",
        "input_features = beer_data.iloc[:, :-1]\n",
        "input_labels = beer_data.iloc[:, -1]\n",
        "\n",
    
        "eval_data = pd.read_csv(\"data/beers_eval.csv\")\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "eval_features = eval_data.iloc[:, :-1]\n",
        "eval_labels = eval_data.iloc[:, -1]\n",
        "\n",
        "\n",
        "def assess(classifier, input_features, eval_features):\n",
        "\n",
        "    predicted_labels = classifier.predict(input_features)\n",
    
    schmittu's avatar
    schmittu committed
        "    print(\n",
        "        \"{:.2f} % labeled correctly on training dataset\".format(\n",
        "            sum(predicted_labels == input_labels) / len(input_labels) * 100\n",
        "        )\n",
        "    )\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "    # Predict\n",
        "    predicted_labels = classifier.predict(eval_features)\n",
    
    schmittu's avatar
    schmittu committed
        "    print(\n",
        "        \"{:.2f} % labeled correctly on evaluation dataset\".format(\n",
        "            sum(predicted_labels == eval_labels) / len(eval_labels) * 100\n",
        "        )\n",
        "    )\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "from sklearn.linear_model import LogisticRegression\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "classifier = SVC(C=2, gamma=2)\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "classifier.fit(input_features, input_labels)\n",
        "\n",
    
        "print(\"ORIGINAL DATA\")\n",
    
    schmittu's avatar
    schmittu committed
        "assess(classifier, input_features, eval_features)\n",
        "\n",
        "print()\n",
        "print(\"WITH ADDED NOISY FEATURES\")\n",
        "np.random.seed(5)\n",
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "# Extend original data by adding new features:\n",
        "#\n",
        "# 1. alcohol_content with some random noise added\n",
        "# 2. pure random noise\n",
        "#\n",
        "# to both training data\n",
    
    schmittu's avatar
    schmittu committed
        "input_features[\"redundant\"] = input_features.loc[:, \"alcohol_content\"] + 1 * (\n",
        "    np.random.random((225,)) - 0.5\n",
        ")\n",
        "input_features[\"noise\"] = 0.1 * (np.random.random((225,)) - 0.5)\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "# and evaluation data\n",
    
    schmittu's avatar
    schmittu committed
        "eval_features[\"redundant\"] = eval_features.loc[:, \"alcohol_content\"] + 1 * (\n",
        "    np.random.random((75,)) - 0.5\n",
        ")\n",
        "eval_features[\"noise\"] = 0.1 * (np.random.random((75,)) - 0.5)\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "classifier.fit(input_features, input_labels)\n",
        "\n",
        "assess(classifier, input_features, eval_features)"
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
        "You can see above that the classifier yields better accuracy on the extended training data set. But you also can see that the performance on the extended evaluation data set is worse than before.\n",
        "\n"
       ]
      },
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
        "<div class=\"alert alert-block alert-info\">\n",
        "<p style=\"font-weight: bold;\"><i class=\"fa fa-info-circle\"></i>&nbsp;About applicability to regression</p>\n",
        "\n",
        "<p>We're talking here about overfitting, underfitting and cross-validation in context of classification/classifiers, but these problems or methods, and related workarounds, apply in general to supervised learning methods, so also to regression methods about which we will learn later on.</p>\n",
        "</div>"
       ]
      },
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "## Training the final classifier\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "Cross-validation was helpful to determine and tune a good classifier. But how do we eventually build the classifier we want to use later \"in production\" ?\n",
        "\n",
        "A common procedure is:\n",
        "\n",
        "- Split your data 80% to 20% (or another fraction) from the beginning.\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "- Use the 80% fraction for determining and tuning a classifier.\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "- Train the final classifier on the 80% part.\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "- Finally use the 20% fraction for a final validation of the classifiers accuracy.\n",
        "\n",
    
        "<img src=\"./images/cross_eval_and_test.svg?7\">\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "Comment: Literature is not consistent in terms. Sometimes the terms \"validation data set\" and \"test data set\" are interchanged."
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
    schmittu's avatar
    schmittu committed
        "### Demonstration\n",
        "\n",
        "We introduce the `train_test_split` function from `sklearn.model_selection` in the following example.\n",
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "It splits features and labels in a given proportion. Usually this is randomized, so that you get different results for every function invocation. To get the same result every time we use `random_state=..` (with arbitrary number) below:"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
       "outputs": [],
       "source": [
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "import pandas as pd\n",
        "\n",
        "beer = pd.read_csv(\"data/beers.csv\")\n",
        "beer_eval = pd.read_csv(\"data/beers_eval.csv\")\n",
        "all_beer = pd.concat((beer, beer_eval))\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "features = all_beer.iloc[:, :-1]\n",
        "labels = all_beer.iloc[:, -1]"
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
       "source": [
        "# SHUFFLE AND SPLIT DATA 80:20\n",
        "# with fixed randomization\n",
        "from sklearn.model_selection import train_test_split\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "# Note 1: `shuffle=True` is default, hence, unnecessary to specify\n",
        "# Note 2: using `stratify=labels` to perserve classes proportion after split same as in the original dataset\n",
        "(\n",
    
    schmittu's avatar
    schmittu committed
        "    features_crosseval,\n",
        "    features_validation,\n",
        "    labels_crosseval,\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "    labels_validation,\n",
        ") = train_test_split(features, labels, test_size=0.2, stratify=labels, random_state=42)\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "print(\"# Whole dataset \")\n",
        "print(\"number of all samples:\", len(labels))\n",
    
    schmittu's avatar
    schmittu committed
        "print(\"proportion of yummy samples:\", sum(labels == 1) / len(labels))\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "print()\n",
    
        "print(\"# Cross-validation dataset \")\n",
        "print(\"number of all samples:\", len(labels_crosseval))\n",
    
    schmittu's avatar
    schmittu committed
        "print(\n",
        "    \"proportion of yummy samples:\", sum(labels_crosseval == 1) / len(labels_crosseval)\n",
        ")\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "print()\n",
    
        "print(\"# Validation dataset \")\n",
        "print(\"number of all samples:\", len(labels_validation))\n",
    
    schmittu's avatar
    schmittu committed
        "print(\n",
        "    \"proportion of yummy samples:\", sum(labels_validation == 1) / len(labels_validation)\n",
        ")"
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "Moreover, we introduce use of explicit speficiation of a cross-validation method: `StratifiedKFold` from `sklearn.model_selection`. \n",
        "\n",
        "This allows us to spilt data during cross validation in the same way as we did with `train_test_split`, i.e. \n",
        "\n",
        "a) with data shufflling before split, and \n",
        "\n",
        "b) perserving class-proportions of samples, "
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "# FIND A \"BEST\" CLASSIFIER\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "# with fixed randomization\n",
        "\n",
        "# By default `cross_val_score(.., cv=n)` call implicitly uses\n",
        "# `KFold(n_splits=n, shuffle=False)` cross-validator\n",
        "from sklearn.model_selection import StratifiedKFold\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "cross_validator = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "results = []\n",
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "print(\"OPTIMIZE HYPERPARAMETERS\")\n",
        "# selected classifier hyperparameters to optimize\n",
    
    schmittu's avatar
    schmittu committed
        "SVC_C_values = (0.1, 1, 10)\n",
        "SVC_gamma_values = (0.1, 1, 10, 100)\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "for C in SVC_C_values:\n",
        "    for gamma in SVC_gamma_values:\n",
    
    schmittu's avatar
    schmittu committed
        "        classifier = SVC(C=C, gamma=gamma)\n",
    
    schmittu's avatar
    schmittu committed
        "        test_scores = cross_val_score(\n",
        "            classifier,\n",
        "            features_crosseval,\n",
        "            labels_crosseval,\n",
        "            scoring=\"accuracy\",\n",
        "            cv=cross_validator,\n",
        "        )  # cv arg is now different\n",
        "        print(\n",
        "            \"score = {:.3f} +/- {:.3f}, C = {:5.1f},  gamma = {:5.1f}\".format(\n",
        "                test_scores.mean(), test_scores.std(), C, gamma\n",
        "            )\n",
        "        )\n",
        "        results.append((test_scores.mean(), test_scores.std(), C, gamma))\n",
        "\n",