03_overfitting_and_cross_validation.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "import warnings\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
    "warnings.filterwarnings(\"ignore\")  # , category=ConvergenceWarning)\n",
    "warnings.filterwarnings = lambda *a, **kw: None\n",
    "from IPython.core.display import HTML\n",
    "\n",
    "HTML(open(\"custom.html\", \"r\").read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 3: Overfitting, underfitting and cross-validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What are overfitting and underfitting?\n",
    "\n",
    "Let us recall the `LogisticRegression`-based beer classfier we used in the first script. We've disovered that setting hyperparmeter `C=2` gave us good results (`C` controls `regularization`, lower `C` means higher `regularization` and vice-versa):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# reading the beer dataset\n",
    "beer_data = pd.read_csv(\"data/beers.csv\")\n",
    "print(beer_data.shape)\n",
    "\n",
    "# all columns up to the last one:\n",
    "input_features = beer_data.iloc[:, :-1]\n",
    "\n",
    "# only the last column:\n",
    "labels = beer_data.iloc[:, -1]\n",
    "\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "classifier = LogisticRegression(C=2)\n",
    "\n",
    "classifier.fit(input_features, labels)\n",
    "\n",
    "# Predict\n",
    "predicted_labels = classifier.predict(input_features)\n",
    "print(\n",
    "    \"{:.2f} % labeled correctly\".format(\n",
    "        sum(predicted_labels == labels) / len(labels) * 100\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here to train (fit) the model we only used 225 samples from the original data set of 300 beers.\n",
    "\n",
    "But if the above classifier works well, it should also show the same performance on the left out 75 beers.\n",
    "\n",
    "Let us check this on the left out data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_data = pd.read_csv(\"data/beers_eval.csv\")\n",
    "print(eval_data.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_features = eval_data.iloc[:, :-1]\n",
    "eval_labels = eval_data.iloc[:, -1]\n",
    "\n",
    "# Predict\n",
    "predicted_labels = classifier.predict(eval_features)\n",
    "print(\n",
    "    \"{:.2f} % labeled correctly\".format(\n",
    "        sum(predicted_labels == eval_labels) / len(eval_labels) * 100\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"font-size:150%; font-weight: bold;\">\n",
    "           \n",
    "WHAT HAPPENED????\n",
    "<br/>\n",
    "<br/>\n",
    "Why is the accuracy on new data much lower?\n",
    "<br/>\n",
    "<br/>\n",
    "Answer: OVERFITTING !!\n",
    "\n",
    "</div>\n",
    "\n",
    "We observed a phenomenon called **\"overfitting\"**.\n",
    "\n",
    "\n",
    "<img src=\"./images/2qky90.jpg\" width=30% />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Overfitting\n",
    "\n",
    "To explain the concept of \"overfitting\" let's use the circle data set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(\"data/circle.csv\")\n",
    "features = data.iloc[:, :-1]\n",
    "labels = data.iloc[:, -1]\n",
    "\n",
    "COLORS = [\"chocolate\", \"steelblue\"]\n",
    "\n",
    "plt.figure(figsize=(4, 4))\n",
    "ax = plt.subplot(1, 1, 1)\n",
    "plt.scatter(\n",
    "    features.iloc[:, 0], features.iloc[:, 1], c=[COLORS[l] for l in labels], marker=\"o\"\n",
    ");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We mentioned before that classifiers depend on (hyper)parameters (like `C`) which can be tuned to improve performance.\n",
    "\n",
    "Let us try to find out the purpose of the `gamma` parameter of `SVC` classifier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# utility functions copy-pasted from previous script\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "def plot_points(features_2d, labels, plt=plt, marker=\"o\"):\n",
    "    colors = [[\"steelblue\", \"chocolate\"][i] for i in labels]\n",
    "    plt.scatter(features_2d[:, 0], features_2d[:, 1], color=colors, marker=marker)\n",
    "\n",
    "\n",
    "def train_and_plot_decision_surface(\n",
    "    name, classifier, features_2d, labels, preproc=None, plt=plt, marker=\"o\", N=300\n",
    "):\n",
    "\n",
    "    features_2d = np.array(features_2d)\n",
    "\n",
    "    xmin, ymin = features_2d.min(axis=0)\n",
    "    xmax, ymax = features_2d.max(axis=0)\n",
    "\n",
    "    x = np.linspace(xmin, xmax, N)\n",
    "    y = np.linspace(ymin, ymax, N)\n",
    "    points = np.array(np.meshgrid(x, y)).T.reshape(-1, 2)\n",
    "\n",
    "    if preproc is not None:\n",
    "        points_for_classifier = preproc.fit_transform(points)\n",
    "        features_2d = preproc.fit_transform(features_2d)\n",
    "    else:\n",
    "        points_for_classifier = points\n",
    "\n",
    "    classifier.fit(features_2d, labels)\n",
    "    predicted = classifier.predict(features_2d)\n",
    "\n",
    "    if preproc is not None:\n",
    "        name += \" (w/ preprocessing)\"\n",
    "    print(name + \":\\t\", sum(predicted == labels), \"/\", len(labels), \"correct\")\n",
    "\n",
    "    classes = np.array(classifier.predict(points_for_classifier), dtype=bool)\n",
    "    plt.scatter(\n",
    "        points[~classes][:, 0],\n",
    "        points[~classes][:, 1],\n",
    "        color=\"steelblue\",\n",
    "        marker=marker,\n",
    "        s=1,\n",
    "        alpha=0.05,\n",
    "    )\n",
    "    plt.scatter(\n",
    "        points[classes][:, 0],\n",
    "        points[classes][:, 1],\n",
    "        color=\"chocolate\",\n",
    "        marker=marker,\n",
    "        s=1,\n",
    "        alpha=0.05,\n",
    "    )\n",
    "\n",
    "    plot_points(features_2d, labels)\n",
    "    plt.title(name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.svm import SVC\n",
    "\n",
    "df = pd.read_csv(\"data/circle.csv\")\n",
    "features = df.iloc[:, :-1]\n",
    "labels = df.iloc[:, -1]\n",
    "\n",
    "# three classifiers with different values for gamma:\n",
    "classifiers = [SVC(gamma=18), SVC(gamma=9), SVC(gamma=0.1)]\n",
    "\n",
    "plt.figure(figsize=(21, 6))\n",
    "\n",
    "for i, clf in enumerate(classifiers):\n",
    "\n",
    "    plt.subplot(1, len(classifiers), i + 1)\n",
    "    train_and_plot_decision_surface(\n",
    "        \"gamma = {}\".format(clf.gamma), clf, features, labels\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Observation\n",
    "\n",
    "The parameter `gamma` of `SVC` has an effect on the flexibility/complexity of the decision surface. A large value allows a very flexible / \"irregular\" decision surface, for smaller values the surface gets smoother / \"stiffer\" / \"more regular\" (allowing more misclassifications).\n",
    "\n",
    "This is also coined **simple** resp. **complex** models.\n",
    "\n",
    "We see here also \n",
    "\n",
    "- that the smallest `gamma` value produces a classifier which seems to get the idea of a \"circle\", \n",
    "- whereas the large `gamma` value adapts the classifier more to the training data samples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try an even larger `gamma` value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = SVC(gamma=90)\n",
    "plt.figure(figsize=(6, 6))\n",
    "\n",
    "train_and_plot_decision_surface(\"gamma = {}\".format(clf.gamma), clf, features, labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The plot above shows an extreme example for the previously mentioned effect of overfitting.\n",
    "\n",
    "- If we evaluate performance of this classifier on the training data set we get an **accuracy of `~100%`**\n",
    "\n",
    "- But the classifier totally fails to learn the concept of a circle, and you can easily imagine how bad this classifier performs on new and unseen data.\n",
    "\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "<p style=\"font-weight: bold;\"><i class=\"fa fa-warning\"></i>&nbsp; Definitions</p>\n",
    "\n",
    "<ul>\n",
    "\n",
    "<li><strong>Overfitting</strong>: The classifier overfits if it too closely fits to/learns detail or noise in the training data instead of learning the underlying concept. Thus, the classifier does not generalize well and shows much worse performance on previously unseen new data.</li>\n",
    "<br/>\n",
    "<li><strong>Generalization</strong>: An ability of a classifier to learn the concept behind data. A classifier generalizes well if it shows similar performance on training and on new data.</li>\n",
    "<br/>\n",
    "<li><strong>Robust classifier</strong>: A classifier which is not or very little susceptible to overfitting when learning some data, i.e. a classfier which usually generalizes well.</li>\n",
    "\n",
    "\n",
    "</ul>\n",
    " \n",
    "</div>\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### More \"probabilistic\" definition\n",
    "\n",
    "- Our data is generated by a (usually unknown) model.\n",
    "- We have only samples from this model.\n",
    "- A classifier tries to approximate the underlying model based on the given samples.\n",
    "\n",
    "In this context the observed bad generalization performance of the classifier can be explained by computing a model which is to far away from the original model.\n",
    "\n",
    "The following graphics depicts our explanations: \n",
    "\n",
    "- The more \"complex\" a model gets the better it fits trainig data. Thus accuracy on the training data improves.\n",
    "- At a certain point the model is too adapted to the training data and gets worse and worse when evaluated later on previously unseen new data.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"./images/accuracy_training_vs_eval.svg\" width=50%/>  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Underfitting\n",
    "\n",
    "The other extreme of overfitting is called **underfitting**: the classifiers decision boundary deviates too far from the boundary in training data and produces a classifier which does not perform well even on training data.\n",
    "\n",
    "We can demonstrate this by choosing a \"too small\" value of `gamma`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# small gamma tries to build a \"safe\", \"perfect\" circle\n",
    "\n",
    "clf = SVC(gamma=0.06)\n",
    "plt.figure(figsize=(6, 6))\n",
    "\n",
    "train_and_plot_decision_surface(\"gamma = {}\".format(clf.gamma), clf, features, labels)\n",
    "# plt.scatter(features.iloc[:, 0], features.iloc[:, 1], color=c, marker='.');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Diagnosing and solving the overfitting problem\n",
    "\n",
    "### How did we fall for overfitting? \n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "\n",
    "<div style=\"font-size:150%;\">\n",
    "    <i class=\"fa fa-info-circle\"></i>\n",
    "    <center>\n",
    "Our fundamental mistake was to evaluate the performace <br/>of the classifier on the training data.\n",
    "\n",
    "</center>\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "Repeat:\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "\n",
    "\n",
    "\n",
    "<div style=\"font-size:150%;\">\n",
    "     <i class=\"fa fa-info-circle\"></i>\n",
    "    <center>\n",
    "Our fundamental mistake was to evaluate the performace <br/>of the classifier on the training data.\n",
    "\n",
    "</center>\n",
    "</div>\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How can we do better?\n",
    "\n",
    "There is no classifier which works out of the box in all situations. Depending on the \"geometry\" / \"shape\" of the data, classification algorithms and their settings can make a big difference.\n",
    "\n",
    "In our previous 2D examples we were able to visualize the data and classification results, this is not possible for higher dimensional data.\n",
    "\n",
    "The general way to handle this situation is as follows: \n",
    "\n",
    "- split our data into a learning data set and a test data set\n",
    "\n",
    "\n",
    "- train the classifier on the learning data set\n",
    "\n",
    "\n",
    "- assess performance of the classifier on the test data set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross-validation\n",
    "\n",
    "The procedure called *cross-validation* goes a step further in data splitting: In this procedure the full dataset is split into learn-/test-set in various ways. Statistics of the achieved test performance is computed to assess future performance of the classifier.\n",
    "\n",
    "A common approach is **K-fold cross-validation**:\n",
    "\n",
    "K-fold cross-validation has an advantage that we do not leave out part of our data from training. This is useful when we do not have a lot of data.\n",
    "\n",
    "<img src=\"./images/305azk.jpg\" title=\"made at imgflip.com\" width=40%/>\n",
    "\n",
    "### Example: 4-fold cross validation\n",
    "\n",
    "For 4-fold cross validation we split our data set into four equal sized partitions P1, P2, P3 and P4.\n",
    "\n",
    "We:\n",
    "\n",
    "- hold out `P1`: train the classifier on `P2 + P3 + P4`, compute accuracy `m1` on `P1`.\n",
    "\n",
    "<img src=\"./images/cross_val_0.svg\" />\n",
    "\n",
    "-  hold out `P2`: train the classifier on `P1 + P3 + P4`, compute accuracy `m2` on `P2`.\n",
    "\n",
    "<img src=\"./images/cross_val_1.svg\" />\n",
    "\n",
    "-  hold out `P3`: train the classifier on `P1 + P2 + P4`, compute accuray `m3` on `P3`.\n",
    "\n",
    "<img src=\"./images/cross_val_2.svg\" />\n",
    "\n",
    "-  hold out `P4`: train the classifier on `P1 + P2 + P3`, compute accuracy `m4` on `P4`.\n",
    "\n",
    "<img src=\"./images/cross_val_3.svg\" />\n",
    "\n",
    "Finally we can compute the average of `m1` .. `m4` as the final measure for accuracy.\n",
    "\n",
    "Some advice:\n",
    "\n",
    "- This can be done on the original data or on randomly shuffled data. It is recommended to shuffle the data first, as there might be some unknown underlying ordering in your dataset\n",
    "\n",
    "- Usually one uses 3- to 10-fold cross validation, depending on the amount of data available."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Variant: randomized cross validation\n",
    "\n",
    "A randomized variant works like this:\n",
    "\n",
    "- Perform $n$ iterations:\n",
    "\n",
    "   - draw a fraction $p$ (e.g. 80%) from your full data set without replacement for the training data set.\n",
    "   - use the remaining fraction $1 - p$ as evaluation data set\n",
    "   - train classifier and compute performance score(s).\n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cross valiation with scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "beer = pd.read_csv(\"data/beers.csv\")\n",
    "beer_eval = pd.read_csv(\"data/beers_eval.csv\")\n",
    "\n",
    "# Since we're using cross validation, let's use all data\n",
    "all_beer = pd.concat((beer, beer_eval))\n",
    "\n",
    "all_beer.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use the familiar _accuracy_ score: a percentage of correctly classified samples. (More about other ways of assessing quality of a classifier in one of the following scripts.)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.utils import shuffle\n",
    "\n",
    "all_beer = shuffle(all_beer, random_state=42)  # fix randomization for reproduciblity\n",
    "\n",
    "features = all_beer.iloc[:, :-1]\n",
    "labels = all_beer.iloc[:, -1]\n",
    "\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "classifier = LogisticRegression(C=2)\n",
    "\n",
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "# 4-fold cross validation with the way we've evaluated classifiers\n",
    "# up to now: \"accuracy\" score (the percentage of correct classification)\n",
    "scores = cross_val_score(classifier, features, labels, scoring=\"accuracy\", cv=4)\n",
    "\n",
    "for i, score in enumerate(scores):\n",
    "    print(\"Fold\", i + 1, \"score:\", score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `cross_val_score` as used in the previous code example works as follows:\n",
    "\n",
    "0. split training data in four chunks\n",
    "- learn `classifier` on chunk `1, 2, 3`, apply classifier to chunk `4` and compute score `s1`\n",
    "- learn `classifier` on chunk `1, 2, 4`, apply classifier to chunk `3` and compute score `s2`\n",
    "- learn `classifier` on chunk `1, 3, 4`, apply classifier to chunk `2` and compute score `s3`\n",
    "- learn `classifier` on chunk `2, 3, 4`, apply classifier to chunk `1` and compute score `s4`\n",
    "\n",
    "`cross_val_score` finally returns `[s1, s2, s3, s4]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "m = scores.mean()\n",
    "s = scores.std()\n",
    "\n",
    "low = m - 2 * s\n",
    "high = m + 2 * s\n",
    "\n",
    "print(\"mean test score is {:.3f}\".format(m))\n",
    "print(\"std dev of test score is {:.3f}\".format(s))\n",
    "# and, assuming normality of the scores\n",
    "print(\n",
    "    \"true test score is with 96% probability between {:.3f} and {:.3f}\".format(\n",
    "        low, high\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise section\n",
    "\n",
    "1. Play with the previous examples.\n",
    "2. Try out different number of cross validation folds for the beer data. What happens with the score?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "beer = pd.read_csv(\"data/beers.csv\")\n",
    "beer_eval = pd.read_csv(\"data/beers_eval.csv\")\n",
    "\n",
    "all_beer = pd.concat((beer, beer_eval))\n",
    "\n",
    "from sklearn.utils import shuffle\n",
    "\n",
    "all_beer = shuffle(all_beer, random_state=42)  # fix randomization for reproduciblity\n",
    "\n",
    "features = all_beer.iloc[:, :-1]\n",
    "labels = all_beer.iloc[:, -1]\n",
    "\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "classifier = LogisticRegression(C=2)\n",
    "\n",
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "for k in [2, 5, 10, 25, 50, 150]:\n",
    "    scores = cross_val_score(classifier, features, labels, scoring=\"accuracy\", cv=k)\n",
    "    m = scores.mean()\n",
    "    s = scores.std()\n",
    "    print(\"{:3d}-fold accuracy score is {:.3f} +/- {:.3f}\".format(k, m, s))\n",
    "\n",
    "#\n",
    "# Q: What happens with the score?\n",
    "#\n",
    "# Mean score increases, very slightly from a certain number of folds (here, 25),\n",
    "# and variance of the score increases significantly.\n",
    "#\n",
    "# Intuitively, with very high number of folds models become similar across folds,\n",
    "# as they fit a big common set of samples, whereas single misclassifications in\n",
    "# the small testing sets result in much smaller accuracies, increasing variance.\n",
    "#"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<p style=\"font-weight: bold;\"><i class=\"fa fa-info-circle\"></i>&nbsp;Rule of thumb</p>\n",
    "<p>Preffer 5- or 10- fold cross validation.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Optional exercises\n",
    "\n",
    "1. Split the dataset `data/spiral.csv` in 300 features/labels for training and 100 features/labels for evaluation. Find a good classifier which reaches 100% accuracy on the training samples, then evaluate the trained classifier on the remaining 100 samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "df = pd.read_csv(\"data/spiral.csv\")\n",
    "n_train = 300\n",
    "features_learn = df.iloc[:n_train, :-1]\n",
    "features_eval = df.iloc[n_train:, :-1]\n",
    "\n",
    "labels_learn = df.iloc[:n_train, -1]\n",
    "labels_eval = df.iloc[n_train:, -1]\n",
    "\n",
    "clf = SVC(gamma=3, C=90)\n",
    "clf.fit(features_learn, labels_learn)\n",
    "\n",
    "predicted = clf.predict(features_learn)\n",
    "print(\n",
    "    \"training accuracy: {:3.1f}%\".format(\n",
    "        sum(predicted == labels_learn) * 100 / len(predicted)\n",
    "    )\n",
    ")\n",
    "\n",
    "predicted = clf.predict(features_eval)\n",
    "print(\n",
    "    \"testing accuracy: {:3.1f}%\".format(\n",
    "        sum(predicted == labels_eval) * 100 / len(predicted)\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Some reasons for overfitting and how you might fight it.\n",
    "\n",
    "###  Small / insufficient data sets.\n",
    "\n",
    "The classifier fails to \"grab the concept\" because the \"concept\" is not represented strongly enough in the data set. \n",
    "\n",
    "Possible solutions:\n",
    "\n",
    "- Get more data.\n",
    "- Augment your data by creating artificial/synthetic data (e.g. for images: shift / scale / rotate images) if feasible.\n",
    "\n",
    "\n",
    "### Unsuitable classifier / classifier parameters used\n",
    "\n",
    "This is what we observed in the example before.\n",
    "\n",
    "Possible solutions:\n",
    "\n",
    "- Optimize parameters using cross-validation.\n",
    "\n",
    "- Evaluate other classification algorithms.\n",
    "\n",
    "###  Noisy / uninformative features\n",
    "\n",
    "A classifier can in some situations use noisy or uninformative features to explain noise in the training data. In such cases features noise contributes to \"artificially\" good results on the training data.\n",
    "\n",
    "Possible solutions:\n",
    "\n",
    "- Use features selection techniques:<br/><br/>\n",
    "\n",
    "    - Inspect your data to detect noisy or uninformative features.\n",
    "        - See e.g. [removing features with low variance in scikit-learn](https://scikit-learn.org/stable/modules/feature_selection.html#removing-features-with-low-variance)<br/><br/>\n",
    "\n",
    "    - Try learning classifier with some features excluded.\n",
    "        - This can be automated, see [recursive feature elimination in scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE).\n",
    "        - Random forest classifiers learn in such way (more about them later), hence, supporting features exclusion directly.<br/><br/>\n",
    "\n",
    "    - Penalize for using many features (prefer simpler models).\n",
    "        - So called *sparse* learning methods do that (more about them later) and they can be used only for data pre-processing step, see [L1-based feature selection in scikit-learn](https://scikit-learn.org/stable/modules/feature_selection.html#l1-based-feature-selection)<br/><br/>\n",
    "\n",
    "- Use dimension reduction techniques like `PCA` (more about this later).\n",
    "\n",
    "### Strongly correlated / redundant features\n",
    "\n",
    "In case the data set contains strongly, but not 100% correlated features, their (weighted) difference might be considered as random data. The effect is then similar to having noisy or uninformative features.\n",
    "\n",
    "Possible solutions:\n",
    "\n",
    "- Same as for noise or uninformative features: features selection or dimension reduction techniques.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code demonstrates the effect of noise and redundant features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "beer_data = pd.read_csv(\"data/beers.csv\")\n",
    "\n",
    "# all columns up to the last one:\n",
    "input_features = beer_data.iloc[:, :-1]\n",
    "input_labels = beer_data.iloc[:, -1]\n",
    "\n",
    "eval_data = pd.read_csv(\"data/beers_eval.csv\")\n",
    "\n",
    "eval_features = eval_data.iloc[:, :-1]\n",
    "eval_labels = eval_data.iloc[:, -1]\n",
    "\n",
    "\n",
    "def assess(classifier, input_features, eval_features):\n",
    "\n",
    "    predicted_labels = classifier.predict(input_features)\n",
    "    print(\n",
    "        \"{:.2f} % labeled correctly on training dataset\".format(\n",
    "            sum(predicted_labels == input_labels) / len(input_labels) * 100\n",
    "        )\n",
    "    )\n",
    "\n",
    "    # Predict\n",
    "    predicted_labels = classifier.predict(eval_features)\n",
    "    print(\n",
    "        \"{:.2f} % labeled correctly on evaluation dataset\".format(\n",
    "            sum(predicted_labels == eval_labels) / len(eval_labels) * 100\n",
    "        )\n",
    "    )\n",
    "\n",
    "\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "classifier = SVC(C=2, gamma=2)\n",
    "\n",
    "classifier.fit(input_features, input_labels)\n",
    "\n",
    "print(\"ORIGINAL DATA\")\n",
    "assess(classifier, input_features, eval_features)\n",
    "\n",
    "print()\n",
    "print(\"WITH ADDED NOISY FEATURES\")\n",
    "np.random.seed(5)\n",
    "\n",
    "# Extend original data by adding new features:\n",
    "#\n",
    "# 1. alcohol_content with some random noise added\n",
    "# 2. pure random noise\n",
    "#\n",
    "# to both training data\n",
    "input_features[\"redundant\"] = input_features.loc[:, \"alcohol_content\"] + 1 * (\n",
    "    np.random.random((225,)) - 0.5\n",
    ")\n",
    "input_features[\"noise\"] = 0.1 * (np.random.random((225,)) - 0.5)\n",
    "# and evaluation data\n",
    "eval_features[\"redundant\"] = eval_features.loc[:, \"alcohol_content\"] + 1 * (\n",
    "    np.random.random((75,)) - 0.5\n",
    ")\n",
    "eval_features[\"noise\"] = 0.1 * (np.random.random((75,)) - 0.5)\n",
    "\n",
    "classifier.fit(input_features, input_labels)\n",
    "\n",
    "assess(classifier, input_features, eval_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see above that the classifier yields better accuracy on the extended training data set. But you also can see that the performance on the extended evaluation data set is worse than before.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<p style=\"font-weight: bold;\"><i class=\"fa fa-info-circle\"></i>&nbsp;About applicability to regression</p>\n",
    "\n",
    "<p>We're talking here about overfitting, underfitting and cross-validation in context of classification/classifiers, but these problems or methods, and related workarounds, apply in general to supervised learning methods, so also to regression methods about which we will learn later on.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the final classifier\n",
    "\n",
    "Cross-validation was helpful to determine and tune a good classifier. But how do we eventually build the classifier we want to use later \"in production\" ?\n",
    "\n",
    "A common procedure is:\n",
    "\n",
    "- Split your data 80% to 20% (or another fraction) from the beginning.\n",
    "\n",
    "\n",
    "- Use the 80% fraction for determining and tuning a classifier.\n",
    "\n",
    "\n",
    "- Train the final classifier on the 80% part.\n",
    "\n",
    "\n",
    "- Finally use the 20% fraction for a final validation of the classifiers accuracy.\n",
    "\n",
    "<img src=\"./images/cross_eval_and_test.svg?7\">\n",
    "\n",
    "Comment: Literature is not consistent in terms. Sometimes the terms \"validation data set\" and \"test data set\" are interchanged."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Demonstration\n",
    "\n",
    "We introduce the `train_test_split` function from `sklearn.model_selection` in the following example.\n",
    "\n",
    "It splits features and labels in a given proportion. Usually this is randomized, so that you get different results for every function invocation. To get the same result every time we use `random_state=..` (with arbitrary number) below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "beer = pd.read_csv(\"data/beers.csv\")\n",
    "beer_eval = pd.read_csv(\"data/beers_eval.csv\")\n",
    "all_beer = pd.concat((beer, beer_eval))\n",
    "\n",
    "features = all_beer.iloc[:, :-1]\n",
    "labels = all_beer.iloc[:, -1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SHUFFLE AND SPLIT DATA 80:20\n",
    "# with fixed randomization\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Note 1: `shuffle=True` is default, hence, unnecessary to specify\n",
    "# Note 2: using `stratify=labels` to perserve classes proportion after split same as in the original dataset\n",
    "(\n",
    "    features_crosseval,\n",
    "    features_validation,\n",
    "    labels_crosseval,\n",
    "    labels_validation,\n",
    ") = train_test_split(features, labels, test_size=0.2, stratify=labels, random_state=42)\n",
    "\n",
    "print(\"# Whole dataset \")\n",
    "print(\"number of all samples:\", len(labels))\n",
    "print(\"proportion of yummy samples:\", sum(labels == 1) / len(labels))\n",
    "print()\n",
    "print(\"# Cross-validation dataset \")\n",
    "print(\"number of all samples:\", len(labels_crosseval))\n",
    "print(\n",
    "    \"proportion of yummy samples:\", sum(labels_crosseval == 1) / len(labels_crosseval)\n",
    ")\n",
    "print()\n",
    "print(\"# Validation dataset \")\n",
    "print(\"number of all samples:\", len(labels_validation))\n",
    "print(\n",
    "    \"proportion of yummy samples:\", sum(labels_validation == 1) / len(labels_validation)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Moreover, we introduce use of explicit speficiation of a cross-validation method: `StratifiedKFold` from `sklearn.model_selection`. \n",
    "\n",
    "This allows us to spilt data during cross validation in the same way as we did with `train_test_split`, i.e. \n",
    "\n",
    "a) with data shufflling before split, and \n",
    "\n",
    "b) perserving class-proportions of samples, "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# FIND A \"BEST\" CLASSIFIER\n",
    "# with fixed randomization\n",
    "\n",
    "# By default `cross_val_score(.., cv=n)` call implicitly uses\n",
    "# `KFold(n_splits=n, shuffle=False)` cross-validator\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "\n",
    "cross_validator = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)\n",
    "\n",
    "results = []\n",
    "\n",
    "print(\"OPTIMIZE HYPERPARAMETERS\")\n",
    "# selected classifier hyperparameters to optimize\n",
    "SVC_C_values = (0.1, 1, 10)\n",
    "SVC_gamma_values = (0.1, 1, 10, 100)\n",
    "\n",
    "for C in SVC_C_values:\n",
    "    for gamma in SVC_gamma_values:\n",
    "        classifier = SVC(C=C, gamma=gamma)\n",
    "        test_scores = cross_val_score(\n",
    "            classifier,\n",
    "            features_crosseval,\n",
    "            labels_crosseval,\n",
    "            scoring=\"accuracy\",\n",
    "            cv=cross_validator,\n",
    "        )  # cv arg is now different\n",
    "        print(\n",
    "            \"score = {:.3f} +/- {:.3f}, C = {:5.1f},  gamma = {:5.1f}\".format(\n",
    "                test_scores.mean(), test_scores.std(), C, gamma\n",
    "            )\n",
    "        )\n",
    "        results.append((test_scores.mean(), test_scores.std(), C, gamma))\n",
    "\n",