Skip to content
Snippets Groups Projects
02_classification.ipynb 641 KiB
Newer Older
schmittu's avatar
schmittu committed
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 2: Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we have learned in the previous chapter *classification* belongs to the field of *supervised learning*. In such problems the aim is to predict a category. Such categories can be \n",
schmittu's avatar
schmittu committed
    "\n",
    "- ok/not ok\n",
    "- good / bad / dont't know\n",
    "- digit 0 ... / digit 9\n",
    "- etc \n",
    "\n",
    "In this chapter we introduce the core concepts of classification."
schmittu's avatar
schmittu committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How could we  build  a simple classifier  ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
schmittu's avatar
schmittu committed
    "If we look at the beer example, we can assume that the person who rated tbe beers has preferences like, \"I don't like high alcohol content\", \"I like fruity beer\", etc."
schmittu's avatar
schmittu committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This means we could construct a score where high numbers relate to \"favorable beer\". One simple way to implement such a score is to use a weighted sum like\n",
    "\n",
    "\n",
    "     score = -0.1 * alcohol_content + 4 * bitterness + 0.8 * darkness + 1.8 * fruitiness \n",
    "\n",
    "Positive weights contribute to a heigher score and negative weights to a lower.\n",
    "\n",
    "The actual weights here are guessed and serve as an example.\n",
    "\n",
    "The size of the numbers also reflects the numerical ranges of the features: alcohol content is in the range 3 to 5.9, where as bitterness is between 0 and 1.08:"
   ]
  },
  {
   "cell_type": "code",
schmittu's avatar
schmittu committed
   "execution_count": 56,
schmittu's avatar
schmittu committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>alcohol_content</th>\n",
       "      <th>bitterness</th>\n",
       "      <th>darkness</th>\n",
       "      <th>fruitiness</th>\n",
       "      <th>is_yummy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>225.000000</td>\n",
       "      <td>225.000000</td>\n",
       "      <td>225.000000</td>\n",
       "      <td>225.000000</td>\n",
       "      <td>225.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>4.711873</td>\n",
       "      <td>0.463945</td>\n",
       "      <td>2.574963</td>\n",
       "      <td>0.223111</td>\n",
       "      <td>0.528889</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.437040</td>\n",
       "      <td>0.227366</td>\n",
       "      <td>1.725916</td>\n",
       "      <td>0.117272</td>\n",
       "      <td>0.500278</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>3.073993</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>4.429183</td>\n",
       "      <td>0.281291</td>\n",
       "      <td>1.197640</td>\n",
       "      <td>0.135783</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>4.740846</td>\n",
       "      <td>0.488249</td>\n",
       "      <td>2.026548</td>\n",
       "      <td>0.242396</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>5.005170</td>\n",
       "      <td>0.631056</td>\n",
       "      <td>4.043995</td>\n",
       "      <td>0.311874</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>5.955272</td>\n",
       "      <td>1.080170</td>\n",
       "      <td>7.221285</td>\n",
       "      <td>0.535315</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       alcohol_content  bitterness    darkness  fruitiness    is_yummy\n",
       "count       225.000000  225.000000  225.000000  225.000000  225.000000\n",
       "mean          4.711873    0.463945    2.574963    0.223111    0.528889\n",
       "std           0.437040    0.227366    1.725916    0.117272    0.500278\n",
       "min           3.073993    0.000000    0.000000    0.000000    0.000000\n",
       "25%           4.429183    0.281291    1.197640    0.135783    0.000000\n",
       "50%           4.740846    0.488249    2.026548    0.242396    1.000000\n",
       "75%           5.005170    0.631056    4.043995    0.311874    1.000000\n",
       "max           5.955272    1.080170    7.221285    0.535315    1.000000"
      ]
     },
schmittu's avatar
schmittu committed
     "execution_count": 56,
schmittu's avatar
schmittu committed
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
schmittu's avatar
schmittu committed
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
schmittu's avatar
schmittu committed
    "%config InlineBackend.figure_format = 'retina'\n",
schmittu's avatar
schmittu committed
    "\n",
    "# read some data\n",
    "beer_data = pd.read_csv(\"beers.csv\")\n",
    "\n",
schmittu's avatar
schmittu committed
    "beer_data.describe()"
   ]
  },
  {
   "cell_type": "code",
schmittu's avatar
schmittu committed
   "execution_count": 57,
schmittu's avatar
schmittu committed
   "metadata": {},
   "outputs": [],
   "source": [
    "scores =( -0.1 * beer_data[\"alcohol_content\"] + 3 * beer_data[\"bitterness\"] \n",
    "          + 0.8 * beer_data[\"darkness\"] + 1.8 * beer_data[\"fruitiness\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can plot the histogram of the scores by classes:"
   ]
Loading
Loading full blame...