02_classification.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 2: Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we have learned in the previous chapter *classification* belongs to the field of *supervised learning*. In such problems the aim is to predict a category. Such categories can be \n",
    "\n",
    "- ok/not ok\n",
    "- good / bad / dont't know\n",
    "- digit 0 ... / digit 9\n",
    "- etc \n",
    "\n",
    "In this chapter we introduce the core concepts of classification."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How could we  build  a simple classifier  ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we look at the beer example, we can assume that the person who rated tbe beers has preferences like, \"I don't like high alcohol content\", \"I like fruity beer\", etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This means we could construct a score where high numbers relate to \"favorable beer\". One simple way to implement such a score is to use a weighted sum like\n",
    "\n",
    "\n",
    "     score = -0.1 * alcohol_content + 4 * bitterness + 0.8 * darkness + 1.8 * fruitiness \n",
    "\n",
    "Positive weights contribute to a heigher score and negative weights to a lower.\n",
    "\n",
    "The actual weights here are guessed and serve as an example.\n",
    "\n",
    "The size of the numbers also reflects the numerical ranges of the features: alcohol content is in the range 3 to 5.9, where as bitterness is between 0 and 1.08:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>alcohol_content</th>\n",
       "      <th>bitterness</th>\n",
       "      <th>darkness</th>\n",
       "      <th>fruitiness</th>\n",
       "      <th>is_yummy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>225.000000</td>\n",
       "      <td>225.000000</td>\n",
       "      <td>225.000000</td>\n",
       "      <td>225.000000</td>\n",
       "      <td>225.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>4.711873</td>\n",
       "      <td>0.463945</td>\n",
       "      <td>2.574963</td>\n",
       "      <td>0.223111</td>\n",
       "      <td>0.528889</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.437040</td>\n",
       "      <td>0.227366</td>\n",
       "      <td>1.725916</td>\n",
       "      <td>0.117272</td>\n",
       "      <td>0.500278</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>3.073993</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>4.429183</td>\n",
       "      <td>0.281291</td>\n",
       "      <td>1.197640</td>\n",
       "      <td>0.135783</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>4.740846</td>\n",
       "      <td>0.488249</td>\n",
       "      <td>2.026548</td>\n",
       "      <td>0.242396</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>5.005170</td>\n",
       "      <td>0.631056</td>\n",
       "      <td>4.043995</td>\n",
       "      <td>0.311874</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>5.955272</td>\n",
       "      <td>1.080170</td>\n",
       "      <td>7.221285</td>\n",
       "      <td>0.535315</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       alcohol_content  bitterness    darkness  fruitiness    is_yummy\n",
       "count       225.000000  225.000000  225.000000  225.000000  225.000000\n",
       "mean          4.711873    0.463945    2.574963    0.223111    0.528889\n",
       "std           0.437040    0.227366    1.725916    0.117272    0.500278\n",
       "min           3.073993    0.000000    0.000000    0.000000    0.000000\n",
       "25%           4.429183    0.281291    1.197640    0.135783    0.000000\n",
       "50%           4.740846    0.488249    2.026548    0.242396    1.000000\n",
       "75%           5.005170    0.631056    4.043995    0.311874    1.000000\n",
       "max           5.955272    1.080170    7.221285    0.535315    1.000000"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "\n",
    "# read some data\n",
    "beer_data = pd.read_csv(\"beers.csv\")\n",
    "\n",
    "beer_data.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores =( -0.1 * beer_data[\"alcohol_content\"] + 3 * beer_data[\"bitterness\"] \n",
    "          + 0.8 * beer_data[\"darkness\"] + 1.8 * beer_data[\"fruitiness\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can plot the histogram of the scores by classes:"
   ]