Skip to content
Snippets Groups Projects
02_classification.ipynb 296 KiB
Newer Older
schmittu's avatar
schmittu committed
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 2: Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we have learned in the previous chapter *classification* belongs to the field of *supervised learning*. In such problems the aim is to predict a category. Such categories can be \n",
schmittu's avatar
schmittu committed
    "\n",
    "- ok/not ok\n",
    "- good / bad / dont't know\n",
    "- digit 0 ... / digit 9\n",
    "- etc \n",
    "\n",
    "In this chapter we introduce the core concepts of classification."
schmittu's avatar
schmittu committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How could we  build  a simple classifier  ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
schmittu's avatar
schmittu committed
    "If we look at the beer example, we can assume that the person who rated tbe beers has preferences like, \"I don't like high alcohol content\", \"I like fruity beer\", etc."
schmittu's avatar
schmittu committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This means we could construct a score where high numbers relate to \"favorable beer\". One simple way to implement such a score is to use a weighted sum like\n",
    "\n",
    "\n",
    "     score = -0.1 * alcohol_content + 4 * bitterness + 0.8 * darkness + 1.8 * fruitiness \n",
    "\n",
    "Positive weights contribute to a heigher score and negative weights to a lower.\n",
    "\n",
    "The actual weights here are guessed and serve as an example.\n",
    "\n",
    "The size of the numbers also reflects the numerical ranges of the features: alcohol content is in the range 3 to 5.9, where as bitterness is between 0 and 1.08:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
schmittu's avatar
schmittu committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>alcohol_content</th>\n",
       "      <th>bitterness</th>\n",
       "      <th>darkness</th>\n",
       "      <th>fruitiness</th>\n",
       "      <th>is_yummy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>225.000000</td>\n",
       "      <td>225.000000</td>\n",
       "      <td>225.000000</td>\n",
       "      <td>225.000000</td>\n",
       "      <td>225.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>4.711873</td>\n",
       "      <td>0.463945</td>\n",
       "      <td>2.574963</td>\n",
       "      <td>0.223111</td>\n",
       "      <td>0.528889</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.437040</td>\n",
       "      <td>0.227366</td>\n",
       "      <td>1.725916</td>\n",
       "      <td>0.117272</td>\n",
       "      <td>0.500278</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>3.073993</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>4.429183</td>\n",
       "      <td>0.281291</td>\n",
       "      <td>1.197640</td>\n",
       "      <td>0.135783</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>4.740846</td>\n",
       "      <td>0.488249</td>\n",
       "      <td>2.026548</td>\n",
       "      <td>0.242396</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>5.005170</td>\n",
       "      <td>0.631056</td>\n",
       "      <td>4.043995</td>\n",
       "      <td>0.311874</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>5.955272</td>\n",
       "      <td>1.080170</td>\n",
       "      <td>7.221285</td>\n",
       "      <td>0.535315</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       alcohol_content  bitterness    darkness  fruitiness    is_yummy\n",
       "count       225.000000  225.000000  225.000000  225.000000  225.000000\n",
       "mean          4.711873    0.463945    2.574963    0.223111    0.528889\n",
       "std           0.437040    0.227366    1.725916    0.117272    0.500278\n",
       "min           3.073993    0.000000    0.000000    0.000000    0.000000\n",
       "25%           4.429183    0.281291    1.197640    0.135783    0.000000\n",
       "50%           4.740846    0.488249    2.026548    0.242396    1.000000\n",
       "75%           5.005170    0.631056    4.043995    0.311874    1.000000\n",
       "max           5.955272    1.080170    7.221285    0.535315    1.000000"
      ]
     },
     "execution_count": 9,
schmittu's avatar
schmittu committed
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
schmittu's avatar
schmittu committed
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "# read some data\n",
    "beer_data = pd.read_csv(\"beers.csv\")\n",
    "\n",
schmittu's avatar
schmittu committed
    "beer_data.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
schmittu's avatar
schmittu committed
   "metadata": {},
   "outputs": [],
   "source": [
    "scores =( -0.1 * beer_data[\"alcohol_content\"] + 3 * beer_data[\"bitterness\"] \n",
    "          + 0.8 * beer_data[\"darkness\"] + 1.8 * beer_data[\"fruitiness\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can plot the histogram of the scores by classes:"
   ]
  },
schmittu's avatar
schmittu committed
  {
   "cell_type": "code",
   "execution_count": 12,
schmittu's avatar
schmittu committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAADd9JREFUeJzt3W+MZfVdx/H3pyykhWKp2RGRZdyNaTZpiBEyqVYMbtjSbC2BPugDNoG0FTM+sAjahECNYX1moqltotFsgIIpLip/YtM0FdIywSYUyy4gf5b+ESksQhdCDEVNEPv1wVzidjoz9885M/fOr+9XMtl7zz0757Ozu5/97bnnfCdVhSRp63vbtANIkvphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIasW0zD7Z9+/bauXPnZh5Skra8w4cPv1JVc8P229RC37lzJw8//PBmHlKStrwk3xtlP0+5SFIjLHRJaoSFLkmNsNAlqREWuiQ1YmihJ7klyfEkT6zy2qeSVJLtGxNPkjSqUVbotwL7Vm5Mcg7wQeC5njNJkiYwtNCr6gHg1VVe+jPgOsDvYSdJM2Cic+hJLgNeqKrHes4jSZrQ2HeKJjkV+DTLp1tG2X8RWASYn58f93DSug4c6Pa61JJJVui/AOwCHkvyLLADOJLkZ1fbuaoOVtVCVS3MzQ0dRSBJmtDYK/Sqehz4mbeeD0p9oape6TGXJGlMo1y2eAh4ENid5FiSqzY+liRpXENX6FW1f8jrO3tLI0mamHeKSlIjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWrE2MO5tEWNMkfWWbPSluYKXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjhhZ6kluSHE/yxAnb/iTJ00n+Jck9Sc7Y2JiSpGFGWaHfCuxbse0+4Nyq+kXg28ANPeeSJI1paKFX1QPAqyu23VtVbw6efgPYsQHZJElj6GPa4m8Cf7vWi0kWgUWA+fn5Hg4nrW/P0oH/f3Jgrb1wuqSa0+lN0SR/ALwJ3L7WPlV1sKoWqmphbm6uy+EkSeuYeIWe5OPAJcDeqqreEkmSJjJRoSfZB1wH/HpV/Ve/kSRJkxjlssVDwIPA7iTHklwF/DlwOnBfkkeT/NUG55QkDTF0hV5V+1fZfPMGZJEkdeCdopLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1Ig+xufqJ82oY2cdTyttKlfoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhoxtNCT3JLkeJInTtj200nuS/KdwY/v3tiYkqRhRlmh3wrsW7HteuCrVfUe4KuD55KkKRpa6FX1APDqis2XAbcNHt8GfKTnXJKkMU16Dv3Mqnpx8Pgl4Mye8kiSJtR5fG5VVZJa6/Uki8AiwPz8fNfDSWNZWlrntQNO+FVbJl2hfz/JWQCDH4+vtWNVHayqhapamJubm/BwkqRhJi30LwIfGzz+GPAP/cSRJE1qlMsWDwEPAruTHEtyFfDHwMVJvgN8YPBckjRFQ8+hV9X+NV7a23MWSVIH3ikqSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqRGdx+dKaxp1Nq0zbKVeuEKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1IhOhZ7k95I8meSJJIeSvL2vYJKk8Uxc6EnOBn4XWKiqc4GTgMv7CiZJGk/XUy7bgHck2QacCvx790iSpElMPG2xql5I8qfAc8B/A/dW1b0r90uyCCwCzM/PT3o4bYYZnHo47UjDjj/tfLA1MmpzdDnl8m7gMmAX8HPAaUmuWLlfVR2sqoWqWpibm5s8qSRpXV1OuXwA+Leqermq/ge4G/jVfmJJksbVpdCfA34lyalJAuwFjvYTS5I0rokLvaoeAu4EjgCPDz7XwZ5ySZLG1Olb0FXVjcCNPWWRJHXgnaKS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktSITneKSr1YZ77rnqWpHVraclyhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWpEp0JPckaSO5M8neRokvf3FUySNJ6us1w+B3ylqj6a5BTg1B4ySZImMHGhJ3kXcCHwcYCqegN4o59YkqRxdTnlsgt4Gfh8kkeS3JTktJ5ySZLG1OWUyzbgfODqqnooyeeA64E/PHGnJIvAIsD8/HyHw/2EGXWu6xaf/7q0NO0E3Yzy5R+2T9fXN/r42jq6rNCPAceq6qHB8ztZLvgfUVUHq2qhqhbm5uY6HE6StJ6JC72qXgKeT7J7sGkv8FQvqSRJY+t6lcvVwO2DK1yeAT7RPZIkaRKdCr2qHgUWesoiSerAO0UlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNaLrLJetq5XxtDOeb6uPx50F0/4tdvzu1uEKXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjOhd6kpOSPJLkS30EkiRNpo8V+jXA0R4+jySpg06FnmQH8GHgpn7iSJIm1XWF/lngOuCHPWSRJHUw8fjcJJcAx6vqcJI96+y3CCwCzM/PT3q46el7zG5js0a38njcPUsHRtpvac9o+62lsd9yzbAuK/QLgEuTPAvcAVyU5Asrd6qqg1W1UFULc3NzHQ4nSVrPxIVeVTdU1Y6q2glcDnytqq7oLZkkaSxehy5JjejlW9BV1RKw1MfnkiRNxhW6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1opc7RSWNbrOmPG6WUaZJtj5xcla+Bq7QJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRkxc6EnOSXJ/kqeSPJnkmj6DSZLG02U415vAp6rqSJLTgcNJ7quqp3rKJkkaw8Qr9Kp6saqODB7/ADgKnN1XMEnSeHoZn5tkJ3Ae8NAqry0CiwDz8/OTH2TU2ZPTmtPZ+nxQNWsz/uh2PcZGZxz2+bfKX+/Ob4omeSdwF3BtVb228vWqOlhVC1W1MDc31/VwkqQ1dCr0JCezXOa3V9Xd/USSJE2iy1UuAW4GjlbVZ/qLJEmaRJcV+gXAlcBFSR4dfPxGT7kkSWOa+E3Rqvo6kB6zSJI68E5RSWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqRC/jc2fKVplzKQ2xZ+lAb59raU9/n2sco/4ahuXzr/VoXKFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJakSnQk+yL8m3knw3yfV9hZIkjW/iQk9yEvAXwIeA9wL7k7y3r2CSpPF0WaG/D/huVT1TVW8AdwCX9RNLkjSuLoV+NvD8Cc+PDbZJkqYgVTXZT0w+Cuyrqt8aPL8S+OWq+uSK/RaBxcHT3cC3xjzUduCViUJuLnP2y5z9Mme/Njvnz1fV3LCdusxDfwE454TnOwbbfkRVHQQOTnqQJA9X1cKkP3+zmLNf5uyXOfs1qzm7nHL5JvCeJLuSnAJcDnyxn1iSpHFNvEKvqjeTfBL4R+Ak4JaqerK3ZJKksXT6FnRV9WXgyz1lWcvEp2s2mTn7Zc5+mbNfM5lz4jdFJUmzxVv/JakRM1voW2WsQJJbkhxP8sS0s6wlyTlJ7k/yVJInk1wz7UyrSfL2JP+c5LFBzj+adqb1JDkpySNJvjTtLOtJ8mySx5M8muThaedZS5IzktyZ5OkkR5O8f9qZVkqye/B1fOvjtSTXTjvXW2bylMtgrMC3gYtZvmHpm8D+qnpqqsFWkeRC4HXgr6vq3GnnWU2Ss4CzqupIktOBw8BHZu3rmSTAaVX1epKTga8D11TVN6YcbVVJfh9YAH6qqi6Zdp61JHkWWKiqmb6+O8ltwD9V1U2DK+dOrar/mHautQx66gWW77/53rTzwOyu0LfMWIGqegB4ddo51lNVL1bVkcHjHwBHmcG7emvZ64OnJw8+Zm/FASTZAXwYuGnaWVqQ5F3AhcDNAFX1xiyX+cBe4F9npcxhdgvdsQIbJMlO4DzgoekmWd3gNMajwHHgvqqayZzAZ4HrgB9OO8gICrg3yeHBnduzaBfwMvD5wWmsm5KcNu1QQ1wOHJp2iBPNaqFrAyR5J3AXcG1VvTbtPKupqv+tql9i+c7j9yWZudNYSS4BjlfV4WlnGdGvVdX5LE9G/Z3BacJZsw04H/jLqjoP+E9glt87OwW4FPj7aWc50awW+khjBTS6wTnpu4Dbq+ruaecZZvDf7fuBfdPOsooLgEsH56bvAC5K8oXpRlpbVb0w+PE4cA/LpzRnzTHg2An/I7uT5YKfVR8CjlTV96cd5ESzWuiOFejR4M3Gm4GjVfWZaedZS5K5JGcMHr+D5TfFn55uqh9XVTdU1Y6q2snyn82vVdUVU461qiSnDd4IZ3AK44PAzF2RVVUvAc8n2T3YtBeYqTftV9jPjJ1ugY53im6UrTRWIMkhYA+wPckx4Maqunm6qX7MBcCVwOOD89MAnx7c6TtLzgJuG1w98Dbg76pqpi8J3ALOBO5Z/jedbcDfVNVXphtpTVcDtw8Wcc8An5hynlUN/mG8GPjtaWdZaSYvW5QkjW9WT7lIksZkoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1Ij/A5u1TdGt1ADcAAAAAElFTkSuQmCC\n",
schmittu's avatar
schmittu committed
      "text/plain": [
schmittu's avatar
schmittu committed
       "<Figure size 432x288 with 1 Axes>"
schmittu's avatar
schmittu committed
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "scores_good = scores[beer_data[\"is_yummy\"] == 1]\n",
    "scores_bad = scores[beer_data[\"is_yummy\"] == 0]\n",
    "\n",
schmittu's avatar
schmittu committed
    "plt.hist(scores_good,  bins=25, color=\"blue\", alpha=.5) # alpha makes bars translucent\n",
schmittu's avatar
schmittu committed
    "plt.hist(scores_bad, bins=25, color=\"red\", alpha=.5);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Consequence: A simple classifier could use these scores and use a threshold around 3.5 to assign a class label."
   ]
  },
schmittu's avatar
schmittu committed
  {
   "cell_type": "code",
   "execution_count": 13,
schmittu's avatar
schmittu committed
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "not yummuy\n",
      "yummy\n"
     ]
    }
   ],
   "source": [
    "def classify(beer_feature):\n",
    "    scores =( -0.1 * beer_feature[\"alcohol_content\"] + 3 * beer_feature[\"bitterness\"] \n",
    "             + 0.8 * beer_feature[\"darkness\"] + 1.8 * beer_feature[\"fruitiness\"])\n",
    "    if scores > 3.5:\n",
    "        return \"yummy\"\n",
    "    else:\n",
    "        return \"not yummuy\"\n",
    "    \n",
    "print(classify(beer_data.iloc[0]))\n",
    "print(classify(beer_data.iloc[1]))"
   ]
  },
schmittu's avatar
schmittu committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**This is how so called linear classifiers work. The magic is in computing the weights and the final threshold to guarantee good results.**\n",
    "\n",
    "*Comment*: although this seems to be a simple concept, linear classifiers can work very well, especially for higher resp. very high dimensions."
   ]
  },
schmittu's avatar
schmittu committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise section 1\n",
    "\n",
    "Modify the weights in the beer classifiers and check if you can improve separation in the histogram.\n",
    "\n",
    "Try weights  `[-0.05837955,  3.69479038,  0.6666397 ,  1.62751838]` in the beer classifier. These are the weights the `LogisticRegression` classifier in the previous script computed.\n"
   ]
  },
schmittu's avatar
schmittu committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Geometrical interpretation of feature vectors\n",
    "\n",
    "If you take the values of a input-feature vector you can imagine this as a point in a d-dimensional space.\n",
    "\n",
    "\n",
    "E.g. if a data set consists of  feature vectors of length 2, you can interpret the first feature value as a x-coordinate and the second value as a y-coordinate.\n",
    "\n",
    "Labeled features then group such points to different point clouds.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example\n",
    "\n",
    "For sake of simplicity we restrict our beer data set to two features: `alcohol_content` and `bitterness`.\n",
    "\n",
    "The following plot shows how these reduced feature vectors can be interpreted as point clouds. For every feature vector we color points in green or red to indicate the according classes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
schmittu's avatar
schmittu committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "\n",
schmittu's avatar
schmittu committed
      "text/plain": [
schmittu's avatar
schmittu committed
       "<Figure size 432x288 with 1 Axes>"
schmittu's avatar
schmittu committed
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "xv = beer_data[\"alcohol_content\"]\n",
    "yv = beer_data[\"bitterness\"]\n",
    "\n",
schmittu's avatar
schmittu committed
    "colors = [\"rb\"[i] for i in beer_data[\"is_yummy\"]]\n",
schmittu's avatar
schmittu committed
    "\n",
    "plt.scatter(xv, yv, color=colors, marker='.');\n",
    "plt.xlabel(\"alcohol_content\")\n",
    "plt.ylabel(\"bitterness\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "What do we see here ?\n",
    "\n",
    "1. Both point clouds overlap, this tells us that the two features lack information for a 100% separation of classes. \n",
    "2. We could draw a line to separate most points of both clouds.\n",
    "3. Later we could use this line to make a guess for classifying a new feature vector.\n",
    "\n",
    "Eventually **classification is about finding a procedure to separate point clouds in an n-dimesional space.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we illustrate how more features can support classification. We add the `darkness` feature as third dimension.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
schmittu's avatar
schmittu committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "
Loading
Loading full blame...