Skip to content
Snippets Groups Projects
02_classification.ipynb 41 KiB
Newer Older
  • Learn to ignore specific revisions
  • schmittu's avatar
    schmittu committed
    {
     "cells": [
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !\n",
        "%matplotlib inline\n",
        "%config InlineBackend.figure_format = 'retina'\n",
        "import warnings\n",
    
        "\n",
        "import matplotlib.pyplot as plt\n",
        "\n",
        "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
        "warnings.filterwarnings(\"ignore\", message=\"X does not have valid feature names, but [a-zA-Z]+ was fitted with feature names\", category=UserWarning)\n",
        "                                  \n",
    
    schmittu's avatar
    schmittu committed
        "warnings.filterwarnings = lambda *a, **kw: None\n",
    
        "from IPython.core.display import HTML\n",
        "\n",
        "HTML(open(\"custom.html\", \"r\").read())"
    
    schmittu's avatar
    schmittu committed
      },
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
        "# Chapter 2: Classification"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "As we have learned in the previous chapter *classification* is a machine learning problem belonging to a group of *supervised learning* problems. In classification the aim is to learn how to predict a class of a categorical label, based on set of already labelled training examples (hence, supervised). Such labels (categories) and corresponding classes can be:\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "- sick: yes / no,\n",
        "- character: good / bad / dont't know,\n",
        "- digit: 0 / ... / 9.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "In this chapter we introduce the core concepts of classification."
    
    schmittu's avatar
    schmittu committed
       ]
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "## How to build a simple classifier?"
    
    schmittu's avatar
    schmittu committed
       ]
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "Let's quickly look again at the beer example:"
    
    schmittu's avatar
    schmittu committed
       ]
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
    schmittu's avatar
    schmittu committed
        "import pandas as pd\n",
    
    schmittu's avatar
    schmittu committed
        "pd.set_option('display.precision', 3)\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "import seaborn as sns\n",
        "sns.set(style=\"ticks\")\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "beer_data = pd.read_csv(\"data/beers.csv\")\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "for_plot = beer_data.copy()\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "# fixes seaborn labels issue\n",
        "def translate_label(value):\n",
        "    return \"no\" if value == 0 else \"yes\"\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "for_plot[\"is_yummy\"] = for_plot[\"is_yummy\"].apply(translate_label)\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "sns.pairplot(for_plot, hue=\"is_yummy\", diag_kind=\"hist\", diag_kws=dict(alpha=.7));\n",
    
    schmittu's avatar
    schmittu committed
        "beer_data.describe()"
    
    schmittu's avatar
    schmittu committed
      },
    
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
       "source": [
        "We can assume that the person who rated these beers has preferences such as:\n",
        "* \"I don't like too low alcohol content\",\n",
        "* \"I like more fruity beers\", etc."
    
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
       "source": [
        "This means we could construct a score where high numbers relate to \"favorable beer\". One simple way to implement such a score is to use a weighted sum like:\n",
        "\n",
        "     score = 1.1 * alcohol_content + 4 * bitterness + 1.5 * darkness + 1.8 * fruitiness\n",
        "\n",
        "The actual weights here are guessed and serve as an example.\n",
        "\n",
        "The size of the numbers reflects the numerical ranges of the features: alcohol content is in the range 3 to 5.9, where as bitterness is between 0 and 1.08:"
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "scores =( 1.1 * beer_data[\"alcohol_content\"] + 4 * beer_data[\"bitterness\"] \n",
    
    schmittu's avatar
    schmittu committed
        "          + 1.5 * beer_data[\"darkness\"] + 1.8 * beer_data[\"fruitiness\"])\n",
        "scores.shape"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
        "Now we can plot the histogram of the scores by classes:"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "scores_bad = scores[beer_data[\"is_yummy\"] == 0]\n",
    
        "scores_good = scores[beer_data[\"is_yummy\"] == 1]\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "plt.hist(scores_bad, bins=25, color=\"steelblue\", alpha=.7) # alpha makes bars translucent\n",
        "plt.hist(scores_good,  bins=25, color=\"chocolate\", alpha=.7);"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "Consequence: a simple classifier could use these scores and use a threshold around 10.5 to assign a class label."
    
    schmittu's avatar
    schmittu committed
      },
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "def classify(beer_feature):\n",
    
    schmittu's avatar
    schmittu committed
        "    scores = (1.1 * beer_feature[\"alcohol_content\"] + 4 * beer_feature[\"bitterness\"] \n",
        "              + 1.5 * beer_feature[\"darkness\"] + 1.8 * beer_feature[\"fruitiness\"])\n",
    
        "    if scores > 10.5:\n",
    
    schmittu's avatar
    schmittu committed
        "        return \"yummy\"\n",
        "    else:\n",
    
    schmittu's avatar
    schmittu committed
        "        return \"not yummy\"\n",
        "\n",
        "# check this for samples 5 .. 14:\n",
    
    schmittu's avatar
    schmittu committed
        "for i in range(5, 15):\n",
    
        "    is_yummy = translate_label(beer_data[\"is_yummy\"][i])\n",
    
    schmittu's avatar
    schmittu committed
        "    classified_as = classify(beer_data.iloc[i, :])\n",
        "    print(i, \n",
    
        "          \"is yummy?\", \"{:3s}\".format(is_yummy),\n",
    
    schmittu's avatar
    schmittu committed
        "          \".. classified as\", classified_as)"
    
    schmittu's avatar
    schmittu committed
       ]
    
    schmittu's avatar
    schmittu committed
      },
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "markdown",
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "**This is how \"linear\" classifiers work. The magic is in computing the weights and the final threshold to guarantee good results.**\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "<div class=\"alert alert-block alert-info\">\n",
        "<i class=\"fa fa-info-circle\"></i>\n",
        "Although this seems to be a simplistic concept, linear classifiers can actually work very well, especially for problems with many features (high-dimensional problems).\n",
        "</div>\n"
    
    schmittu's avatar
    schmittu committed
      },
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
        "## Exercise section 1\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "- Modify the weights in the beer classifiers and check if you can improve separation in the histogram.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "- In `scikit-learn` the weights of a trained linear classifier are availble via the `coef_` attribute as a 2 dimensional `numpy` array. Extract the weights from the `LogisticRegression` classifier example from the last script and try them out in your weighted sum scoring function."
    
      },
      {
       "cell_type": "code",
    
       "metadata": {
        "tags": [
         "solution"
        ]
       },
       "outputs": [],
    
       "source": [
        "from sklearn.linear_model import LogisticRegression\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "\n",
    
        "classifier = LogisticRegression()\n",
        "input_features = beer_data.iloc[:, :-1]\n",
        "labels = beer_data.iloc[:, -1]\n",
        "classifier.fit(input_features, labels)\n",
        "w = classifier.coef_[0]\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "scores =( w[0] * beer_data[\"alcohol_content\"] + w[1] * beer_data[\"bitterness\"] \n",
        "          + w[2] * beer_data[\"darkness\"] + w[3] * beer_data[\"fruitiness\"])\n",
        "\n",
        "scores_bad = scores[beer_data[\"is_yummy\"] == 0]\n",
        "scores_good = scores[beer_data[\"is_yummy\"] == 1]\n",
        "\n",
        "plt.hist(scores_bad, bins=25, color=\"steelblue\", alpha=.7) # alpha makes bars translucent\n",
        "plt.hist(scores_good,  bins=25, color=\"chocolate\", alpha=.7);"
    
    schmittu's avatar
    schmittu committed
      },
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "markdown",
    
    schmittu's avatar
    schmittu committed
       "source": [
        "## Geometrical interpretation of feature vectors\n",
        "\n",
    
        "If you take the values of an input-feature vector you can imagine this as a point in a n-dimensional space.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "E.g. if a data set consists of feature vectors of length 2, you can interpret the first feature value as a x-coordinate and the second value as a y-coordinate.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "Classes then group points."
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
        "### Example\n",
        "\n",
        "For sake of simplicity we restrict our beer data set to two features: `alcohol_content` and `bitterness`.\n",
        "\n",
        "The following plot shows how these reduced feature vectors can be interpreted as point clouds. For every feature vector we color points in green or red to indicate the according classes:"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "xv = beer_data[\"alcohol_content\"]\n",
        "yv = beer_data[\"bitterness\"]\n",
        "\n",
    
        "colors = [[\"steelblue\", \"chocolate\"][i] for i in beer_data[\"is_yummy\"]]\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "plt.scatter(xv, yv, color=colors, marker='o');\n",
    
    schmittu's avatar
    schmittu committed
        "plt.xlabel(\"alcohol_content\")\n",
        "plt.ylabel(\"bitterness\");"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
        "\n",
        "What do we see here ?\n",
        "\n",
        "1. Both point clouds overlap, this tells us that the two features lack information for a 100% separation of classes. \n",
        "2. We could draw a line to separate most points of both clouds.\n",
    
        "3. Later we could use this line to make a guess for classifying a new feature vector."
    
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
       "source": [
        "<div class=\"alert alert-block alert-warning\">\n",
        "<p><i class=\"fa fa-warning\"></i>&nbsp;\n",
    
    schmittu's avatar
    schmittu committed
        "Eventually, classification is about finding a procedure to separate point clouds in a n-dimensional space.\n",
    
    schmittu's avatar
    schmittu committed
        "</div>\n",
        "\n",
    
        "\n",
        "<a href='https://imgflip.com/i/303vuc'><img src=\"./images/303vuc.jpg\" width=50%/></a><br/><a href='https://imgflip.com/gif-maker'> via Imgflip GIF Maker</a> "
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "Next, we illustrate how more features can support classification. We add the `darkness` feature as a third dimension.\n"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "from mpl_toolkits.mplot3d import Axes3D\n",
        "\n",
        "fig = plt.figure(figsize=(20, 7))\n",
        "\n",
        "xv = beer_data[\"alcohol_content\"]\n",
        "yv = beer_data[\"darkness\"]\n",
        "zv = beer_data[\"bitterness\"]\n",
        "\n",
    
        "colors = [[\"steelblue\", \"chocolate\"][i] for i in beer_data[\"is_yummy\"]]\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "def plot3d(ax):\n",
        "    ax.scatter(xv, yv, zv, c=colors, marker='o') \n",
        "    \n",
    
    schmittu's avatar
    schmittu committed
        "    ax.set_xlabel('alcohol_content')\n",
        "    ax.set_ylabel('darkness')\n",
        "    ax.set_zlabel('bitterness');\n",
        "\n",
        "ax = fig.add_subplot(121, projection='3d')\n",
        "\n",
    
        "plot3d(ax)\n",
    
        "ax.view_init(3, 270)\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "ax = fig.add_subplot(122, projection='3d')\n",
    
        "plot3d(ax)\n",
    
        "ax.view_init(3, 0);"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
        "The first view is very similar to the scatter plot before as we don't see the effects of the third feature. \n",
        "\n",
    
        "The second view shows the same cube rotated by 90˚ horizontally. We see that the new dimension adds extra information which could improve separation by separating more points.\n",
    
        "\n",
        "Geometrically, the 1D line, which could separat samples in the previous example that used 2D samples, would be now a 2D plane. It would still look like a line in the first view, but rotating it using the third dimensions could separate more points.\n"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "### Decision surfaces\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "The concept of decision surfaces is crucial in classification.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "Lets start with an easy to visualize 2D features space."
    
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
       "source": [
        "#### Decision lines\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "For a score based on a weighted sum for two features `x` and `y` the equation\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "\n",
        "     weight_x * x + weight_y * y = threshold\n",
        "     \n",
        "\n",
    
        "can be rearranged to the form `y = a * x + b` and thus defines a line in 2D space. Points fulfilling\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "     weight_x * x + weight_y * y < threshold\n",
        "      \n",
        "      \n",
        "vs\n",
        "\n",
        "     weight_x * x + weight_y * y > threshold\n",
        "      \n",
        "\n",
    
        "are located on opposite sides of this line. Such a classifier thus determines a line which separates the feature space in two parts according to the two classes."
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
        "Lets visualize this! \n",
        "\n",
        "We \n",
        "\n",
    
        "1. define a decision line (using weights and threshold),\n",
    
        "2. create random 2D samples,\n",
        "3. compute scores for the samples,\n",
        "4. split points according to their score compared to the threshold,\n",
        "5. plot samples in different colors,\n",
        "6. plot decision line.\n"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "import numpy as np\n",
        "\n",
    
        "# define a decision line\n",
    
    schmittu's avatar
    schmittu committed
        "wx = -.5\n",
        "wy = 1.2\n",
    
        "weights = np.array((wx, wy))\n",
    
    schmittu's avatar
    schmittu committed
        "threshold = 0.2\n",
        "\n",
    
        "# create random 2D samples\n",
    
    schmittu's avatar
    schmittu committed
        "points_2d = np.random.random((200, 2))\n",
        "\n",
    
        "# compute scores for the samples\n",
    
        "scores = points_2d  @ weights # (samples matrix) x (weights vector) = (scores vector)\n",
    
        "\n",
        "# split points according to their score compared to the threshold \n",
    
    schmittu's avatar
    schmittu committed
        "above_points = points_2d[scores > threshold]\n",
        "below_points = points_2d[scores < threshold]\n",
        "\n",
    
        "# plot samples in different colors\n",
        "plt.scatter(above_points[:, 0], above_points[:, 1], label=\"above\",\n",
        "            color=\"steelblue\", marker=\"o\")\n",
        "plt.scatter(below_points[:, 0], below_points[:, 1], label=\"below\",\n",
        "            color=\"chocolate\", marker=\"o\")\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "# plot decision line\n",
        "x = np.linspace(-.01, 1.01, 2)\n",
        "y = threshold / weights[1] - weights[0] / weights[1] * x\n",
    
        "plt.plot(x, y, color='k', linestyle=':')\n",
    
    schmittu's avatar
    schmittu committed
        "plt.legend();"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "#### Decision (hyper)plane"
    
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
       "source": [
        "For 3D samples a linear classifiers separates into classes by a 2D plane, and in general, for `n` dimensions we get `n-1` dimensional hyperplanes."
    
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
       "source": [
        "Let's visualize a decision plane the same way we did visualize the line."
    
      },
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
       "source": [
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
    
        "from mpl_toolkits.mplot3d import Axes3D\n",
    
        "\n",
        "# define a decision line\n",
        "wx = -.5\n",
        "wy = 1.2\n",
        "wz = .4\n",
        "weights = np.array((wx, wy, wz))\n",
        "threshold = 0.3\n",
        "\n",
        "# create random 3D samples\n",
        "points_3d = np.random.random((200, 3))\n",
        "\n",
        "# compute scores for the samples\n",
        "scores = points_3d @ weights   # (samples matrix) x (weights vector) = (scores vector)\n",
        "\n",
        "# split points according to their score compared to the threshold \n",
        "above_points = points_3d[scores > threshold]\n",
        "below_points = points_3d[scores < threshold]\n",
        "\n",
        "# plot samples in different colors\n",
        "fig = plt.figure(figsize=(10, 10))\n",
        "ax = fig.add_subplot(111, projection='3d')\n",
        "ax.set_xlabel('x')\n",
        "ax.set_ylabel('y')\n",
        "ax.set_zlabel('z');\n",
        "ax.set_xlim(0,1)\n",
        "ax.set_ylim(0,1)\n",
        "ax.set_zlim(0,1)\n",
        "\n",
        "sc_above = ax.scatter(above_points[:, 0], above_points[:, 1], above_points[:, 2],\n",
        "                      color=\"steelblue\", marker=\"o\")\n",
        "sc_below = ax.scatter(below_points[:, 0], below_points[:, 1], below_points[:, 2],\n",
        "                      color=\"chocolate\", marker=\"o\")\n",
        "ax.legend([sc_above, sc_below], ['above', 'below'], numpoints=1)\n",
        "\n",
        "# plot decision plane\n",
        "x = np.linspace(-1, 2, 50)\n",
        "y = np.linspace(-1, 2, 50)\n",
        "xx, yy = np.meshgrid(x, y)\n",
        "zz = (threshold - weights[0] * xx - weights[1] * yy) / weights[2]\n",
        "zz_cut = np.copy(zz) # don't plot points outside of the [0,1] range\n",
    
        "zz_cut[zz < 0.2] = np.nan\n",
        "zz_cut[zz > .8] = np.nan\n",
        "ax.plot_wireframe(xx, yy, zz_cut, color='k', linestyle=':', alpha=.5)\n",
    
        "# for readability, the view angles are chosen so that the 2D separation plane\n",
        "# looks like a line\n",
    
        "ax.view_init(20, 210);"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
        "### Example\n",
        "\n",
    
        "In the initil beer example, features satisfying\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "    1.1 * alcohol_content + 4 * bitterness + 1.5 * darkness + 1.8 * fruitiness == threshold\n",
    
    schmittu's avatar
    schmittu committed
        "    \n",
    
        "are located on a 3D hyperplane in a 4D features space.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "Samples, which we classified as \"not yummy\", satisfying:\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "    1.1 * alcohol_content + 4 * bitterness + 1.5 * darkness + 1.8 * fruitiness < threshold\n",
    
    schmittu's avatar
    schmittu committed
        "    \n",
    
        "and samples, which we classfied as \"yummy class\", satisfying:\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "    1.1 * alcohol_content + 4 * bitterness + 1.5 * darkness + 1.8 * fruitiness > threshold\n",
    
    schmittu's avatar
    schmittu committed
        "    \n",
        "are located on different sides of this plane.\n",
        "\n",
    
        "Again: **Here, the classifier separates the 4D space into two parts; the separation boundary is a 3D hyperplane in this space.**"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "<div class=\"alert alert-block alert-info\">\n",
        "\n",
    
    Mikolaj Rybinski's avatar
    Mikolaj Rybinski committed
        "<p style=\"font-size: 150%;\"><i class=\"fa fa-info-circle\"></i>&nbsp;About 2D examples</p>\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "<p>For the sake of simplicity and visualisation we continue with 2 dimensional examples.</p>\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "<p>It is clear that such examples only represent very small subset of realistic ML scenarios. But most concepts can be illustrated in 2D or 3D without loss of generality.</p>\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "<p>The examples also might look artificial, this is because they highlight specific aspects or problems. At the end, general classifiers should work on all kind of problems.</p>\n",
        "</div>\n",
        "\n"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "## Non-linear decision surfaces"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "The next example data set can not be classified by a straight line, the decision line is curved:\n"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "df = pd.read_csv(\"data/circle.csv\")\n",
    
    schmittu's avatar
    schmittu committed
        "df.head(3)"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
    schmittu's avatar
    schmittu committed
        "xv = df[\"x\"]\n",
        "yv = df[\"y\"]\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "colors = [[\"steelblue\", \"chocolate\"][i] for i in df[\"label\"]]\n",
    
    schmittu's avatar
    schmittu committed
        "plt.figure(figsize=(5, 5))\n",
    
    schmittu's avatar
    schmittu committed
        "plt.xlim([-2, 2])\n",
        "plt.ylim([-2, 2])\n",
    
        "plt.scatter(xv, yv, color=colors, marker=\"o\");"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "In this case a suitable decision surface is a (closed) curve - it looks like a circle. A hand-crafted classifier could classify new points based on their distance to the center.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "It should be clear that a **linear classifier is not suitable for this problem**!"
    
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
       "source": [
        "## Feature engineering\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "To improve ML performance we can try to create new feature by transformation of existing features. This process is called **feature engineering**.\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "This is actually the opposite of \"garbage in / garbage out\".\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "<a href='https://imgflip.com/i/303whl'><img src=\"./images/303whl.jpg\" width=50%/></a><br/><a href='https://imgflip.com/gif-maker'> via Imgflip GIF Maker</a> \n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "The general idea is to include / extract usefull information based on domain knowledge. \n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "E.g. to classify spam emails you can count the number of words written in capital letters only or group countries and add the group number."
    
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
        "### An example for feature engineering"
    
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
        "In the previous example we see that the distance of the origin of a point could be used to implement a classifier.\n",
        "\n",
        "Computing the distance of a point to the origin (0, 0) using the euclidian formula includes terms $x^2$ and $y^2$. \n",
        "\n",
        "Let us create a scatter plot for this transformation:"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "plt.figure(figsize=(5, 5))\n",
    
    schmittu's avatar
    schmittu committed
        "plt.scatter(xv ** 2, yv ** 2, color=colors, marker='o');\n",
        "plt.xlabel(\"$x^2$\")\n",
        "plt.ylabel(\"$y^2$\");"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "As you can see both sets can be separated by a line now!"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "### Another example for feature engineering"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
        "The so called \"xor-problem\" is a typical benchmark problem for machine learning. The following example illustrates this problem:"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "xor = pd.read_csv(\"data/xor.csv\")\n",
    
    schmittu's avatar
    schmittu committed
        "xor.head()"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "xv = xor[\"x\"]\n",
        "yv = xor[\"y\"]\n",
        "\n",
    
        "colors = [[\"steelblue\", \"chocolate\"][i] for i in xor[\"label\"]]\n",
    
    schmittu's avatar
    schmittu committed
        "plt.figure(figsize=(5, 5))\n",
        "plt.xlim([-2, 2])\n",
        "plt.ylim([-2, 2])\n",
    
    schmittu's avatar
    schmittu committed
        "plt.plot([0, 0], [-2, 2], \"k:\")\n",
        "plt.plot([-2, 2], [0, 0], \"k:\")\n",
    
        "plt.title(\"Blue points are labeled False\")\n",
    
        "plt.scatter(xv, yv, color=colors, marker=\"o\");"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
        "Again, this example data set can not be separated by a line. But we see that points where the sign of x and y are the same appear to form one class, and point with different signs for x and y belong to the other class.\n",
        "\n",
    
        "Let's engineer a new numerical feature corresponding to a descriptive feature \"x and y have the same sign\". How? \n",
        "\n",
        "Here we can use the fact that the product of two numbers is postive if and only if both numbers have the same sign.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "So lets plot a histogram over `x * y`:"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "products = xor[\"x\"] * xor[\"y\"]\n",
        "\n",
        "features_class_false = products[~xor[\"label\"]]\n",
    
        "features_class_true = products[xor[\"label\"]]\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "plt.hist(features_class_false,  bins=30, color=\"steelblue\", alpha=.5, histtype=\"stepfilled\")\n",
        "plt.hist(features_class_true,  bins=30, color=\"chocolate\", alpha=.5, histtype=\"stepfilled\");"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "Having such feature a simple classifier could just introduce a threshold of 0 to distinguish both classes."
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "### Feature engineering HOWTO\n",
    
        "Manual feature engineering is difficult. It requires understanding of data to extract useful features.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "However, feature engineering can boost the performance of a classifier significantly.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "Examples:\n",
        "\n",
    
        "- early classifiers to detect nudity in images used color histograms of full image and image patches (plus more features) to create suitable features.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "- spam classifieries depend on choice of dictionary, or counting words only in capital cases, or counting words with special characters (like \"pill$\")\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "- to distinguish background noise from speach audio samples, the frequency distribution might help, as well as standard deviation, or a histogram, of loudness/energy of a sample.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "- to classify DNA sequences, n-gram histograms (n>=1) can be benefitial.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "- geopolitical data can be enhanced from a feature \"state\" by extra features of \"political system\" and/or \"gross national product (GNP)\".\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "- sales data can be enhanced from a date feature by an extra feature \"is weekday\"."
    
    schmittu's avatar
    schmittu committed
      },
    
       "metadata": {},
    
        "Most cases have higher dimensions than 2 or 3 and visual inspection can be difficult. Thus, engineering features as we did in the 2D examples becomes tricky.\n",
        "\n",
        "<div class=\"alert alert-block alert-warning\"><p><i class=\"fa fa-warning\"></i>&nbsp;\n",
    
        "<span style=\"font-weight: bold; font-size: 125%;\">General recommendations for feature engineering</span>\n",
    
    schmittu's avatar
    schmittu committed
        "<li>use descriptive statistics (mean, standard deviations, higher order statistics), as well as histograms if applicable;</li>\n",
    
        "<li>polynomial features (e.g. extend <code>x, y</code> to <code>x, y, x * y, x ** 2, y ** 2</code>) (see below);</li>\n",
    
        "    <li>image classification: dig into computer vision to learn about image descriptors;</li>\n",
    
        "<li>audio classification: learn about FFT, wavelets, filter banks, power spectrum, ...;</li>\n",
        "<li>try to incorporate external data.</li>\n",
        "</ul>\n",
        "</p></div>\n",
        "\n",
        "<div class=\"alert alert-block alert-info\"><p><i class=\"fa fa-info-circle\"></i>&nbsp;\n",
    
    schmittu's avatar
    schmittu committed
        "Adding too many features (especially redundant features) can introduce other problems, such as, for instance, <strong>overfitting</strong> (we'll learn later about that). There are methods for selection of a subset of \"good-enough\" features (cf. <a href=\"https://scikit-learn.org/stable/modules/feature_selection.html\"><code>scikit-learn</code> feature selection module</a>).\n",
    
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
        "### Engineer polynomial features using `scikit-learn`\n",
    
        "*Polynomial features* are a way to (semi-)automatically engineere new non-linear features. These are all polynomial combinations of the features (up to given degree)."
    
    schmittu's avatar
    schmittu committed
      },
    
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
       "source": [
    
        "In <code>scikit-learn</code> polynomial feature engineering is part of the `sklearn.preprocessing` module containing utilities for features preprocessing.\n",
        "\n",
        "<div class=\"alert alert-block alert-warning\">\n",
    
    schmittu's avatar
    schmittu committed
        "    <i class=\"fa fa-warning\"></i>&nbsp;<strong><code>scikit-learn</code> API</strong>\n",
    
        "\n",
        "In scikit-learn</code> preprocessing utilites have:\n",
        "<ul>\n",
    
    schmittu's avatar
    schmittu committed
        "    <li>a <strong><code>transform()</code></strong> method to appropriately transform data\n",
        "        \n",
        "</ul>\n",
        "\n",
        "and, if applicable<ul>\n",
    
        "    <li>a <strong><code>fit()</code></strong> and <strong><code>fit_transform()</code></strong> methods to learn the preprocessing from data or fit and transform in one step.</li>\n",
        "</ul>\n",
        "</div>"
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "For instance, the \"sign\" feature `x * y` in the XOR dataset in the previous example is a polynomial feature of rank 2 (1+1). Let's see how to generate it among with other polynomial features up to rank 2 using `scikit-lern`:"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
       "source": [
        "import pandas as pd\n",
        "from sklearn.preprocessing import PolynomialFeatures\n",
        "\n",
        "# using first 10 samples from XOR data\n",
    
        "df = pd.read_csv(\"data/xor.csv\")\n",
    
        "features = df.iloc[:, :-1]\n",
        "features.head()"
    
       "metadata": {},
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "preproc = PolynomialFeatures(degree=2, include_bias=False)\n",
    
    schmittu's avatar
    schmittu committed
        "data = preproc.fit_transform(features)\n",
    
        "pd.DataFrame(data).head()"
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
    schmittu's avatar
    schmittu committed
        "In this case \n",
    
        "- columns `0` and `1` are $x$ and $y$ from the original data set.\n",
        "- column `2` is $x^2$\n",
        "- column `3` is $x y$\n",
        "- column `4` is $y^2$.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "Setting `include_bias=False` omits the degree 0 polynomial, which is a constant column with value `1`. For a complete description see [docs for `sklearn.preprocessing.PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)."
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "## Exercise section 2"
    
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
        "The following functions plot a 2D dataset points and a decision surface of classifier trained beforehand on that dataset."
    
    schmittu's avatar
    schmittu committed
      },
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
       "source": [
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "def plot_points(features_2d, labels, plt=plt, marker='o'):\n",
    
        "    colors = [[\"steelblue\", \"chocolate\"][i] for i in labels]\n",
    
    schmittu's avatar
    schmittu committed
        "    plt.scatter(features_2d[:, 0], features_2d[:, 1], color=colors, marker=marker);\n",
    
    schmittu's avatar
    schmittu committed
        "    \n",
    
        "def train_and_plot_decision_surface(\n",
    
    schmittu's avatar
    schmittu committed
        "    name, classifier, features_2d, labels, preproc=None, plt=plt, marker='o', N=200\n",
    
        "):\n",
        "\n",
        "    features_2d = np.array(features_2d)\n",
    
    schmittu's avatar
    schmittu committed
        "    \n",
    
        "    xmin, ymin = features_2d.min(axis=0)\n",
        "    xmax, ymax = features_2d.max(axis=0)\n",
        "\n",
        "    x = np.linspace(xmin, xmax, N)\n",
        "    y = np.linspace(ymin, ymax, N)\n",
        "    points = np.array(np.meshgrid(x, y)).T.reshape(-1, 2)\n",
        "\n",
        "    if preproc is not None:\n",
        "        points_for_classifier = preproc.fit_transform(points)\n",
        "        features_2d = preproc.fit_transform(features_2d)\n",
        "    else:\n",
        "        points_for_classifier = points\n",
        "\n",
        "    classifier.fit(features_2d, labels)\n",
        "    predicted = classifier.predict(features_2d)\n",
        "\n",
        "    if preproc is not None:\n",
        "        name += \" (w/ preprocessing)\"\n",
        "    print(name + \":\\t\", sum(predicted == labels), \"/\", len(labels), \"correct\")\n",
        "\n",
        "    classes = np.array(classifier.predict(points_for_classifier), dtype=bool)\n",
    
    schmittu's avatar
    schmittu committed
        "    plt.scatter(\n",
    
        "        points[~classes][:, 0],\n",
        "        points[~classes][:, 1],\n",
        "        color=\"steelblue\",\n",
        "        marker=marker,\n",
    
    schmittu's avatar
    schmittu committed
        "        s=1,\n",
        "        alpha=0.05,\n",
    
    schmittu's avatar
    schmittu committed
        "    plt.scatter(\n",
    
        "        points[classes][:, 0],\n",
        "        points[classes][:, 1],\n",
        "        color=\"chocolate\",\n",
        "        marker=marker,\n",
    
    schmittu's avatar
    schmittu committed
        "        s=1,\n",
        "        alpha=0.05,\n",
        "    )\n",
        "\n",
        "    plot_points(features_2d, labels)\n",
        "    plt.title(name)"
    
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
        "Note: `scikit-learn 1.0` introduced a class `sklearn.inspection.DecisionBoundaryDisplay` (see also [here](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html)) to plot decision surfaces.\n",
        "\n",
    
        "Let's use them to plot a decision surface of a logistic regression classifier trained on a XOR dataset:"
    
      },
      {
       "cell_type": "code",
    
       "metadata": {},
       "outputs": [],
    
       "source": [
        "from sklearn.linear_model import LogisticRegression\n",
    
    schmittu's avatar
    schmittu committed
        "from sklearn.preprocessing import PolynomialFeatures\n",
    
        "df = pd.read_csv(\"data/xor.csv\")\n",
    
        "features = df.iloc[:, :-1]\n",
        "labels = df.iloc[:, -1]\n",
        "\n",
    
        "clf = LogisticRegression()\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "plt.figure(figsize=(6, 6))\n",
    
    schmittu's avatar
    schmittu committed
        "# preproc = PolynomialFeatures(2, include_bias=False)\n",
    
    schmittu's avatar
    schmittu committed
        "train_and_plot_decision_surface(\"Logistic regression\", clf, features, labels, preproc=None)"
    
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},
    
        "### Logistic regression with polynomial features\n",
    
        "Train and plot decision surface for logistic regression classifier of XOR dataset with polynomial features engineered. What's the result and why?"
    
      },
      {
       "cell_type": "code",
    
       "metadata": {
        "tags": [
         "solution"
        ]
       },
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "from sklearn.linear_model import LogisticRegression\n",
        "from sklearn.preprocessing import PolynomialFeatures\n",
    
        "df = pd.read_csv(\"data/xor.csv\")\n",
    
        "features = df.iloc[:, :-1]\n",
        "labels = df.iloc[:, -1]\n",
        "\n",
        "clf = LogisticRegression()\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "plt.figure(figsize=(6, 6))\n",
    
    schmittu's avatar
    schmittu committed
        "preproc = PolynomialFeatures(2, include_bias=False)\n",
        "train_and_plot_decision_surface(\"Logistic regression\", clf, features, labels, preproc=preproc)"
    
      },
      {
       "cell_type": "markdown",
    
       "metadata": {},