02_classification.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "import warnings\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
    "warnings.filterwarnings(\"ignore\", message=\"X does not have valid feature names, but [a-zA-Z]+ was fitted with feature names\", category=UserWarning)\n",
    "                                  \n",
    "warnings.filterwarnings = lambda *a, **kw: None\n",
    "from IPython.core.display import HTML\n",
    "\n",
    "HTML(open(\"custom.html\", \"r\").read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 2: Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we have learned in the previous chapter *classification* is a machine learning problem belonging to a group of *supervised learning* problems. In classification the aim is to learn how to predict a class of a categorical label, based on set of already labelled training examples (hence, supervised). Such labels (categories) and corresponding classes can be:\n",
    "\n",
    "- sick: yes / no,\n",
    "- character: good / bad / dont't know,\n",
    "- digit: 0 / ... / 9.\n",
    "\n",
    "In this chapter we introduce the core concepts of classification."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How to build a simple classifier?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's quickly look again at the beer example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "pd.set_option('display.precision', 3)\n",
    "\n",
    "import seaborn as sns\n",
    "sns.set(style=\"ticks\")\n",
    "\n",
    "beer_data = pd.read_csv(\"data/beers.csv\")\n",
    "\n",
    "for_plot = beer_data.copy()\n",
    "\n",
    "# fixes seaborn labels issue\n",
    "def translate_label(value):\n",
    "    return \"no\" if value == 0 else \"yes\"\n",
    "\n",
    "for_plot[\"is_yummy\"] = for_plot[\"is_yummy\"].apply(translate_label)\n",
    "\n",
    "sns.pairplot(for_plot, hue=\"is_yummy\", diag_kind=\"hist\", diag_kws=dict(alpha=.7));\n",
    "beer_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can assume that the person who rated these beers has preferences such as:\n",
    "* \"I don't like too low alcohol content\",\n",
    "* \"I like more fruity beers\", etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This means we could construct a score where high numbers relate to \"favorable beer\". One simple way to implement such a score is to use a weighted sum like:\n",
    "\n",
    "     score = 1.1 * alcohol_content + 4 * bitterness + 1.5 * darkness + 1.8 * fruitiness\n",
    "\n",
    "The actual weights here are guessed and serve as an example.\n",
    "\n",
    "The size of the numbers reflects the numerical ranges of the features: alcohol content is in the range 3 to 5.9, where as bitterness is between 0 and 1.08:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores =( 1.1 * beer_data[\"alcohol_content\"] + 4 * beer_data[\"bitterness\"] \n",
    "          + 1.5 * beer_data[\"darkness\"] + 1.8 * beer_data[\"fruitiness\"])\n",
    "scores.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can plot the histogram of the scores by classes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores_bad = scores[beer_data[\"is_yummy\"] == 0]\n",
    "scores_good = scores[beer_data[\"is_yummy\"] == 1]\n",
    "\n",
    "plt.hist(scores_bad, bins=25, color=\"steelblue\", alpha=.7) # alpha makes bars translucent\n",
    "plt.hist(scores_good,  bins=25, color=\"chocolate\", alpha=.7);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Consequence: a simple classifier could use these scores and use a threshold around 10.5 to assign a class label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def classify(beer_feature):\n",
    "    scores = (1.1 * beer_feature[\"alcohol_content\"] + 4 * beer_feature[\"bitterness\"] \n",
    "              + 1.5 * beer_feature[\"darkness\"] + 1.8 * beer_feature[\"fruitiness\"])\n",
    "    if scores > 10.5:\n",
    "        return \"yummy\"\n",
    "    else:\n",
    "        return \"not yummy\"\n",
    "\n",
    "# check this for samples 5 .. 14:\n",
    "for i in range(5, 15):\n",
    "    is_yummy = translate_label(beer_data[\"is_yummy\"][i])\n",
    "    classified_as = classify(beer_data.iloc[i, :])\n",
    "    print(i, \n",
    "          \"is yummy?\", \"{:3s}\".format(is_yummy),\n",
    "          \".. classified as\", classified_as)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**This is how \"linear\" classifiers work. The magic is in computing the weights and the final threshold to guarantee good results.**\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<i class=\"fa fa-info-circle\"></i>\n",
    "Although this seems to be a simplistic concept, linear classifiers can actually work very well, especially for problems with many features (high-dimensional problems).\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise section 1\n",
    "\n",
    "- Modify the weights in the beer classifiers and check if you can improve separation in the histogram.\n",
    "\n",
    "\n",
    "- In `scikit-learn` the weights of a trained linear classifier are availble via the `coef_` attribute as a 2 dimensional `numpy` array. Extract the weights from the `LogisticRegression` classifier example from the last script and try them out in your weighted sum scoring function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "\n",
    "classifier = LogisticRegression()\n",
    "input_features = beer_data.iloc[:, :-1]\n",
    "labels = beer_data.iloc[:, -1]\n",
    "classifier.fit(input_features, labels)\n",
    "w = classifier.coef_[0]\n",
    "\n",
    "scores =( w[0] * beer_data[\"alcohol_content\"] + w[1] * beer_data[\"bitterness\"] \n",
    "          + w[2] * beer_data[\"darkness\"] + w[3] * beer_data[\"fruitiness\"])\n",
    "\n",
    "scores_bad = scores[beer_data[\"is_yummy\"] == 0]\n",
    "scores_good = scores[beer_data[\"is_yummy\"] == 1]\n",
    "\n",
    "plt.hist(scores_bad, bins=25, color=\"steelblue\", alpha=.7) # alpha makes bars translucent\n",
    "plt.hist(scores_good,  bins=25, color=\"chocolate\", alpha=.7);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Geometrical interpretation of feature vectors\n",
    "\n",
    "If you take the values of an input-feature vector you can imagine this as a point in a n-dimensional space.\n",
    "\n",
    "E.g. if a data set consists of feature vectors of length 2, you can interpret the first feature value as a x-coordinate and the second value as a y-coordinate.\n",
    "\n",
    "Classes then group points."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example\n",
    "\n",
    "For sake of simplicity we restrict our beer data set to two features: `alcohol_content` and `bitterness`.\n",
    "\n",
    "The following plot shows how these reduced feature vectors can be interpreted as point clouds. For every feature vector we color points in green or red to indicate the according classes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xv = beer_data[\"alcohol_content\"]\n",
    "yv = beer_data[\"bitterness\"]\n",
    "\n",
    "colors = [[\"steelblue\", \"chocolate\"][i] for i in beer_data[\"is_yummy\"]]\n",
    "\n",
    "plt.scatter(xv, yv, color=colors, marker='o');\n",
    "plt.xlabel(\"alcohol_content\")\n",
    "plt.ylabel(\"bitterness\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "What do we see here ?\n",
    "\n",
    "1. Both point clouds overlap, this tells us that the two features lack information for a 100% separation of classes. \n",
    "2. We could draw a line to separate most points of both clouds.\n",
    "3. Later we could use this line to make a guess for classifying a new feature vector."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<p><i class=\"fa fa-warning\"></i>&nbsp;\n",
    "Eventually, classification is about finding a procedure to separate point clouds in a n-dimensional space.\n",
    "</p>\n",
    "</div>\n",
    "\n",
    "<img src=\"./images/303vuc.jpg\" width=50%/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we illustrate how more features can support classification. We add the `darkness` feature as a third dimension.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "\n",
    "fig = plt.figure(figsize=(20, 7))\n",
    "\n",
    "xv = beer_data[\"alcohol_content\"]\n",
    "yv = beer_data[\"darkness\"]\n",
    "zv = beer_data[\"bitterness\"]\n",
    "\n",
    "colors = [[\"steelblue\", \"chocolate\"][i] for i in beer_data[\"is_yummy\"]]\n",
    "\n",
    "def plot3d(ax):\n",
    "    ax.scatter(xv, yv, zv, c=colors, marker='o') \n",
    "    \n",
    "    ax.set_xlabel('alcohol_content')\n",
    "    ax.set_ylabel('darkness')\n",
    "    ax.set_zlabel('bitterness');\n",
    "\n",
    "ax = fig.add_subplot(121, projection='3d')\n",
    "\n",
    "plot3d(ax)\n",
    "ax.view_init(3, 270)\n",
    "\n",
    "ax = fig.add_subplot(122, projection='3d')\n",
    "plot3d(ax)\n",
    "ax.view_init(3, 0);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first view is very similar to the scatter plot before as we don't see the effects of the third feature. \n",
    "\n",
    "The second view shows the same cube rotated by 90˚ horizontally. We see that the new dimension adds extra information which could improve separation by separating more points.\n",
    "\n",
    "Geometrically, the 1D line, which could separat samples in the previous example that used 2D samples, would be now a 2D plane. It would still look like a line in the first view, but rotating it using the third dimensions could separate more points.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Decision surfaces\n",
    "\n",
    "The concept of decision surfaces is crucial in classification.\n",
    "\n",
    "Lets start with an easy to visualize 2D features space."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Decision lines\n",
    "\n",
    "For a score based on a weighted sum for two features `x` and `y` the equation\n",
    "\n",
    "\n",
    "     weight_x * x + weight_y * y = threshold\n",
    "     \n",
    "\n",
    "can be rearranged to the form `y = a * x + b` and thus defines a line in 2D space. Points fulfilling\n",
    "\n",
    "     weight_x * x + weight_y * y < threshold\n",
    "      \n",
    "      \n",
    "vs\n",
    "\n",
    "     weight_x * x + weight_y * y > threshold\n",
    "      \n",
    "\n",
    "are located on opposite sides of this line. Such a classifier thus determines a line which separates the feature space in two parts according to the two classes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets visualize this! \n",
    "\n",
    "We \n",
    "\n",
    "1. define a decision line (using weights and threshold),\n",
    "2. create random 2D samples,\n",
    "3. compute scores for the samples,\n",
    "4. split points according to their score compared to the threshold,\n",
    "5. plot samples in different colors,\n",
    "6. plot decision line.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# define a decision line\n",
    "wx = -.5\n",
    "wy = 1.2\n",
    "weights = np.array((wx, wy))\n",
    "threshold = 0.2\n",
    "\n",
    "# create random 2D samples\n",
    "points_2d = np.random.random((200, 2))\n",
    "\n",
    "# compute scores for the samples\n",
    "scores = points_2d  @ weights # (samples matrix) x (weights vector) = (scores vector)\n",
    "\n",
    "# split points according to their score compared to the threshold \n",
    "above_points = points_2d[scores > threshold]\n",
    "below_points = points_2d[scores < threshold]\n",
    "\n",
    "# plot samples in different colors\n",
    "plt.scatter(above_points[:, 0], above_points[:, 1], label=\"above\",\n",
    "            color=\"steelblue\", marker=\"o\")\n",
    "plt.scatter(below_points[:, 0], below_points[:, 1], label=\"below\",\n",
    "            color=\"chocolate\", marker=\"o\")\n",
    "\n",
    "# plot decision line\n",
    "x = np.linspace(-.01, 1.01, 2)\n",
    "y = threshold / weights[1] - weights[0] / weights[1] * x\n",
    "plt.plot(x, y, color='k', linestyle=':')\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Decision (hyper)plane"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For 3D samples a linear classifiers separates into classes by a 2D plane, and in general, for `n` dimensions we get `n-1` dimensional hyperplanes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's visualize a decision plane the same way we did visualize the line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "\n",
    "# define a decision line\n",
    "wx = -.5\n",
    "wy = 1.2\n",
    "wz = .4\n",
    "weights = np.array((wx, wy, wz))\n",
    "threshold = 0.3\n",
    "\n",
    "# create random 3D samples\n",
    "points_3d = np.random.random((200, 3))\n",
    "\n",
    "# compute scores for the samples\n",
    "scores = points_3d @ weights   # (samples matrix) x (weights vector) = (scores vector)\n",
    "\n",
    "# split points according to their score compared to the threshold \n",
    "above_points = points_3d[scores > threshold]\n",
    "below_points = points_3d[scores < threshold]\n",
    "\n",
    "# plot samples in different colors\n",
    "fig = plt.figure(figsize=(10, 10))\n",
    "ax = fig.add_subplot(111, projection='3d')\n",
    "ax.set_xlabel('x')\n",
    "ax.set_ylabel('y')\n",
    "ax.set_zlabel('z');\n",
    "ax.set_xlim(0,1)\n",
    "ax.set_ylim(0,1)\n",
    "ax.set_zlim(0,1)\n",
    "\n",
    "sc_above = ax.scatter(above_points[:, 0], above_points[:, 1], above_points[:, 2],\n",
    "                      color=\"steelblue\", marker=\"o\")\n",
    "sc_below = ax.scatter(below_points[:, 0], below_points[:, 1], below_points[:, 2],\n",
    "                      color=\"chocolate\", marker=\"o\")\n",
    "ax.legend([sc_above, sc_below], ['above', 'below'], numpoints=1)\n",
    "\n",
    "# plot decision plane\n",
    "x = np.linspace(-1, 2, 50)\n",
    "y = np.linspace(-1, 2, 50)\n",
    "xx, yy = np.meshgrid(x, y)\n",
    "zz = (threshold - weights[0] * xx - weights[1] * yy) / weights[2]\n",
    "zz_cut = np.copy(zz) # don't plot points outside of the [0,1] range\n",
    "zz_cut[zz < 0.2] = np.nan\n",
    "zz_cut[zz > .8] = np.nan\n",
    "ax.plot_wireframe(xx, yy, zz_cut, color='k', linestyle=':', alpha=.5)\n",
    "\n",
    "# for readability, the view angles are chosen so that the 2D separation plane\n",
    "# looks like a line\n",
    "ax.view_init(20, 210);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example\n",
    "\n",
    "In the initil beer example, features satisfying\n",
    "\n",
    "    1.1 * alcohol_content + 4 * bitterness + 1.5 * darkness + 1.8 * fruitiness == threshold\n",
    "    \n",
    "are located on a 3D hyperplane in a 4D features space.\n",
    "\n",
    "Samples, which we classified as \"not yummy\", satisfying:\n",
    "\n",
    "    1.1 * alcohol_content + 4 * bitterness + 1.5 * darkness + 1.8 * fruitiness < threshold\n",
    "    \n",
    "and samples, which we classfied as \"yummy class\", satisfying:\n",
    "\n",
    "    1.1 * alcohol_content + 4 * bitterness + 1.5 * darkness + 1.8 * fruitiness > threshold\n",
    "    \n",
    "are located on different sides of this plane.\n",
    "\n",
    "Again: **Here, the classifier separates the 4D space into two parts; the separation boundary is a 3D hyperplane in this space.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "\n",
    "<p style=\"font-size: 150%;\"><i class=\"fa fa-info-circle\"></i>&nbsp;About 2D examples</p>\n",
    "\n",
    "<p>For the sake of simplicity and visualisation we continue with 2 dimensional examples.</p>\n",
    "\n",
    "<p>It is clear that such examples only represent very small subset of realistic ML scenarios. But most concepts can be illustrated in 2D or 3D without loss of generality.</p>\n",
    "\n",
    "<p>The examples also might look artificial, this is because they highlight specific aspects or problems. At the end, general classifiers should work on all kind of problems.</p>\n",
    "</div>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Non-linear decision surfaces"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next example data set can not be classified by a straight line, the decision line is curved:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"data/circle.csv\")\n",
    "df.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xv = df[\"x\"]\n",
    "yv = df[\"y\"]\n",
    "\n",
    "colors = [[\"steelblue\", \"chocolate\"][i] for i in df[\"label\"]]\n",
    "plt.figure(figsize=(5, 5))\n",
    "plt.xlim([-2, 2])\n",
    "plt.ylim([-2, 2])\n",
    "plt.scatter(xv, yv, color=colors, marker=\"o\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case a suitable decision surface is a (closed) curve - it looks like a circle. A hand-crafted classifier could classify new points based on their distance to the center.\n",
    "\n",
    "It should be clear that a **linear classifier is not suitable for this problem**!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature engineering\n",
    "\n",
    "To improve ML performance we can try to create new feature by transformation of existing features. This process is called **feature engineering**.\n",
    "\n",
    "This is actually the opposite of \"garbage in / garbage out\".\n",
    "\n",
    "<img src=\"./images/303whl.jpg\" width=50% title=\"made at imgflip.com\"/>\n",
    "\n",
    "The general idea is to include / extract usefull information based on domain knowledge. \n",
    "\n",
    "E.g. to classify spam emails you can count the number of words written in capital letters only or group countries and add the group number."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### An example for feature engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous example we see that the distance of the origin of a point could be used to implement a classifier.\n",
    "\n",
    "Computing the distance of a point to the origin (0, 0) using the euclidian formula includes terms $x^2$ and $y^2$. \n",
    "\n",
    "Let us create a scatter plot for this transformation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(5, 5))\n",
    "plt.scatter(xv ** 2, yv ** 2, color=colors, marker='o');\n",
    "plt.xlabel(\"$x^2$\")\n",
    "plt.ylabel(\"$y^2$\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see both sets can be separated by a line now!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Another example for feature engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The so called \"xor-problem\" is a typical benchmark problem for machine learning. The following example illustrates this problem:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xor = pd.read_csv(\"data/xor.csv\")\n",
    "xor.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xv = xor[\"x\"]\n",
    "yv = xor[\"y\"]\n",
    "\n",
    "colors = [[\"steelblue\", \"chocolate\"][i] for i in xor[\"label\"]]\n",
    "plt.figure(figsize=(5, 5))\n",
    "plt.xlim([-2, 2])\n",
    "plt.ylim([-2, 2])\n",
    "plt.plot([0, 0], [-2, 2], \"k:\")\n",
    "plt.plot([-2, 2], [0, 0], \"k:\")\n",
    "plt.title(\"Blue points are labeled False\")\n",
    "plt.scatter(xv, yv, color=colors, marker=\"o\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, this example data set can not be separated by a line. But we see that points where the sign of x and y are the same appear to form one class, and point with different signs for x and y belong to the other class.\n",
    "\n",
    "Let's engineer a new numerical feature corresponding to a descriptive feature \"x and y have the same sign\". How? \n",
    "\n",
    "Here we can use the fact that the product of two numbers is postive if and only if both numbers have the same sign.\n",
    "\n",
    "So lets plot a histogram over `x * y`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "products = xor[\"x\"] * xor[\"y\"]\n",
    "\n",
    "features_class_false = products[~xor[\"label\"]]\n",
    "features_class_true = products[xor[\"label\"]]\n",
    "\n",
    "plt.hist(features_class_false,  bins=30, color=\"steelblue\", alpha=.5, histtype=\"stepfilled\")\n",
    "plt.hist(features_class_true,  bins=30, color=\"chocolate\", alpha=.5, histtype=\"stepfilled\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Having such feature a simple classifier could just introduce a threshold of 0 to distinguish both classes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature engineering HOWTO\n",
    "\n",
    "Manual feature engineering is difficult. It requires understanding of data to extract useful features.\n",
    "\n",
    "However, feature engineering can boost the performance of a classifier significantly.\n",
    "\n",
    "Examples:\n",
    "\n",
    "- early classifiers to detect nudity in images used color histograms of full image and image patches (plus more features) to create suitable features.\n",
    "\n",
    "\n",
    "- spam classifieries depend on choice of dictionary, or counting words only in capital cases, or counting words with special characters (like \"pill$\")\n",
    "\n",
    "\n",
    "- to distinguish background noise from speach audio samples, the frequency distribution might help, as well as standard deviation, or a histogram, of loudness/energy of a sample.\n",
    "\n",
    "\n",
    "- to classify DNA sequences, n-gram histograms (n>=1) can be benefitial.\n",
    "\n",
    "\n",
    "- geopolitical data can be enhanced from a feature \"state\" by extra features of \"political system\" and/or \"gross national product (GNP)\".\n",
    "\n",
    "\n",
    "- sales data can be enhanced from a date feature by an extra feature \"is weekday\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most cases have higher dimensions than 2 or 3 and visual inspection can be difficult. Thus, engineering features as we did in the 2D examples becomes tricky.\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\"><p><i class=\"fa fa-warning\"></i>&nbsp;\n",
    "<span style=\"font-weight: bold; font-size: 125%;\">General recommendations for feature engineering</span>\n",
    "<ul>\n",
    "<li>use descriptive statistics (mean, standard deviations, higher order statistics), as well as histograms if applicable;</li>\n",
    "<li>polynomial features (e.g. extend <code>x, y</code> to <code>x, y, x * y, x ** 2, y ** 2</code>) (see below);</li>\n",
    "    <li>image classification: dig into computer vision to learn about image descriptors;</li>\n",
    "<li>audio classification: learn about FFT, wavelets, filter banks, power spectrum, ...;</li>\n",
    "<li>try to incorporate external data.</li>\n",
    "</ul>\n",
    "</p></div>\n",
    "\n",
    "<div class=\"alert alert-block alert-info\"><p><i class=\"fa fa-info-circle\"></i>&nbsp;\n",
    "Adding too many features (especially redundant features) can introduce other problems, such as, for instance, <strong>overfitting</strong> (we'll learn later about that). There are methods for selection of a subset of \"good-enough\" features (cf. <a href=\"https://scikit-learn.org/stable/modules/feature_selection.html\"><code>scikit-learn</code> feature selection module</a>).\n",
    "</p></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Engineer polynomial features using `scikit-learn`\n",
    "\n",
    "*Polynomial features* are a way to (semi-)automatically engineere new non-linear features. These are all polynomial combinations of the features (up to given degree)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In <code>scikit-learn</code> polynomial feature engineering is part of the `sklearn.preprocessing` module containing utilities for features preprocessing.\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <i class=\"fa fa-warning\"></i>&nbsp;<strong><code>scikit-learn</code> API</strong>\n",
    "\n",
    "In scikit-learn</code> preprocessing utilites have:\n",
    "<ul>\n",
    "    <li>a <strong><code>transform()</code></strong> method to appropriately transform data\n",
    "        \n",
    "</ul>\n",
    "\n",
    "and, if applicable<ul>\n",
    "    <li>a <strong><code>fit()</code></strong> and <strong><code>fit_transform()</code></strong> methods to learn the preprocessing from data or fit and transform in one step.</li>\n",
    "</ul>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For instance, the \"sign\" feature `x * y` in the XOR dataset in the previous example is a polynomial feature of rank 2 (1+1). Let's see how to generate it among with other polynomial features up to rank 2 using `scikit-lern`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "\n",
    "# using first 10 samples from XOR data\n",
    "df = pd.read_csv(\"data/xor.csv\")\n",
    "features = df.iloc[:, :-1]\n",
    "features.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preproc = PolynomialFeatures(degree=2, include_bias=False)\n",
    "data = preproc.fit_transform(features)\n",
    "pd.DataFrame(data).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case \n",
    "- columns `0` and `1` are $x$ and $y$ from the original data set.\n",
    "- column `2` is $x^2$\n",
    "- column `3` is $x y$\n",
    "- column `4` is $y^2$.\n",
    "\n",
    "Setting `include_bias=False` omits the degree 0 polynomial, which is a constant column with value `1`. For a complete description see [docs for `sklearn.preprocessing.PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise section 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following functions plot a 2D dataset points and a decision surface of classifier trained beforehand on that dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def plot_points(features_2d, labels, plt=plt, marker='o'):\n",
    "    colors = [[\"steelblue\", \"chocolate\"][i] for i in labels]\n",
    "    plt.scatter(features_2d[:, 0], features_2d[:, 1], color=colors, marker=marker);\n",
    "\n",
    "    \n",
    "def train_and_plot_decision_surface(\n",
    "    name, classifier, features_2d, labels, preproc=None, plt=plt, marker='o', N=200\n",
    "):\n",
    "\n",
    "    features_2d = np.array(features_2d)\n",
    "    \n",
    "    xmin, ymin = features_2d.min(axis=0)\n",
    "    xmax, ymax = features_2d.max(axis=0)\n",
    "\n",
    "    x = np.linspace(xmin, xmax, N)\n",
    "    y = np.linspace(ymin, ymax, N)\n",
    "    points = np.array(np.meshgrid(x, y)).T.reshape(-1, 2)\n",
    "\n",
    "    if preproc is not None:\n",
    "        points_for_classifier = preproc.fit_transform(points)\n",
    "        features_2d = preproc.fit_transform(features_2d)\n",
    "    else:\n",
    "        points_for_classifier = points\n",
    "\n",
    "    classifier.fit(features_2d, labels)\n",
    "    predicted = classifier.predict(features_2d)\n",
    "\n",
    "    if preproc is not None:\n",
    "        name += \" (w/ preprocessing)\"\n",
    "    print(name + \":\\t\", sum(predicted == labels), \"/\", len(labels), \"correct\")\n",
    "\n",
    "    classes = np.array(classifier.predict(points_for_classifier), dtype=bool)\n",
    "    plt.scatter(\n",
    "        points[~classes][:, 0],\n",
    "        points[~classes][:, 1],\n",
    "        color=\"steelblue\",\n",
    "        marker=marker,\n",
    "        s=1,\n",
    "        alpha=0.05,\n",
    "    )\n",
    "    plt.scatter(\n",
    "        points[classes][:, 0],\n",
    "        points[classes][:, 1],\n",
    "        color=\"chocolate\",\n",
    "        marker=marker,\n",
    "        s=1,\n",
    "        alpha=0.05,\n",
    "    )\n",
    "\n",
    "    plot_points(features_2d, labels)\n",
    "    plt.title(name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: `scikit-learn 1.0` introduced a class `sklearn.inspection.DecisionBoundaryDisplay` (see also [here](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html)) to plot decision surfaces.\n",
    "\n",
    "Let's use them to plot a decision surface of a logistic regression classifier trained on a XOR dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "\n",
    "df = pd.read_csv(\"data/xor.csv\")\n",
    "features = df.iloc[:, :-1]\n",
    "labels = df.iloc[:, -1]\n",
    "\n",
    "clf = LogisticRegression()\n",
    "\n",
    "plt.figure(figsize=(6, 6))\n",
    "\n",
    "# preproc = PolynomialFeatures(2, include_bias=False)\n",
    "\n",
    "train_and_plot_decision_surface(\"Logistic regression\", clf, features, labels, preproc=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logistic regression with polynomial features\n",
    "\n",
    "Train and plot decision surface for logistic regression classifier of XOR dataset with polynomial features engineered. What's the result and why?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "\n",
    "df = pd.read_csv(\"data/xor.csv\")\n",
    "features = df.iloc[:, :-1]\n",
    "labels = df.iloc[:, -1]\n",
    "\n",
    "clf = LogisticRegression()\n",
    "\n",
    "plt.figure(figsize=(6, 6))\n",
    "\n",
    "preproc = PolynomialFeatures(2, include_bias=False)\n",
    "train_and_plot_decision_surface(\"Logistic regression\", clf, features, labels, preproc=preproc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [