02_classification.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "import warnings\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
    "warnings.filterwarnings(\"ignore\", message=\"X does not have valid feature names, but [a-zA-Z]+ was fitted with feature names\", category=UserWarning)\n",
    "                                  \n",
    "warnings.filterwarnings = lambda *a, **kw: None\n",
    "from IPython.core.display import HTML\n",
    "\n",
    "HTML(open(\"custom.html\", \"r\").read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 2: Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we have learned in the previous chapter *classification* is a machine learning problem belonging to a group of *supervised learning* problems. In classification the aim is to learn how to predict a class of a categorical label, based on set of already labelled training examples (hence, supervised). Such labels (categories) and corresponding classes can be:\n",
    "\n",
    "- sick: yes / no,\n",
    "- character: good / bad / dont't know,\n",
    "- digit: 0 / ... / 9.\n",
    "\n",
    "In this chapter we introduce the core concepts of classification."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How to build a simple classifier?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's quickly look again at the beer example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "pd.set_option('display.precision', 3)\n",
    "\n",
    "import seaborn as sns\n",
    "sns.set(style=\"ticks\")\n",
    "\n",
    "beer_data = pd.read_csv(\"data/beers.csv\")\n",
    "\n",
    "for_plot = beer_data.copy()\n",
    "\n",
    "# fixes seaborn labels issue\n",
    "def translate_label(value):\n",
    "    return \"no\" if value == 0 else \"yes\"\n",
    "\n",
    "for_plot[\"is_yummy\"] = for_plot[\"is_yummy\"].apply(translate_label)\n",
    "\n",
    "sns.pairplot(for_plot, hue=\"is_yummy\", diag_kind=\"hist\", diag_kws=dict(alpha=.7));\n",
    "beer_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can assume that the person who rated these beers has preferences such as:\n",
    "* \"I don't like too low alcohol content\",\n",
    "* \"I like more fruity beers\", etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This means we could construct a score where high numbers relate to \"favorable beer\". One simple way to implement such a score is to use a weighted sum like:\n",
    "\n",
    "     score = 1.1 * alcohol_content + 4 * bitterness + 1.5 * darkness + 1.8 * fruitiness\n",
    "\n",
    "The actual weights here are guessed and serve as an example.\n",
    "\n",
    "The size of the numbers reflects the numerical ranges of the features: alcohol content is in the range 3 to 5.9, where as bitterness is between 0 and 1.08:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores =( 1.1 * beer_data[\"alcohol_content\"] + 4 * beer_data[\"bitterness\"] \n",
    "          + 1.5 * beer_data[\"darkness\"] + 1.8 * beer_data[\"fruitiness\"])\n",
    "scores.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can plot the histogram of the scores by classes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores_bad = scores[beer_data[\"is_yummy\"] == 0]\n",
    "scores_good = scores[beer_data[\"is_yummy\"] == 1]\n",
    "\n",
    "plt.hist(scores_bad, bins=25, color=\"steelblue\", alpha=.7) # alpha makes bars translucent\n",
    "plt.hist(scores_good,  bins=25, color=\"chocolate\", alpha=.7);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Consequence: a simple classifier could use these scores and use a threshold around 10.5 to assign a class label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def classify(beer_feature):\n",
    "    scores = (1.1 * beer_feature[\"alcohol_content\"] + 4 * beer_feature[\"bitterness\"] \n",
    "              + 1.5 * beer_feature[\"darkness\"] + 1.8 * beer_feature[\"fruitiness\"])\n",
    "    if scores > 10.5:\n",
    "        return \"yummy\"\n",
    "    else:\n",
    "        return \"not yummy\"\n",
    "\n",
    "# check this for samples 5 .. 14:\n",
    "for i in range(5, 15):\n",
    "    is_yummy = translate_label(beer_data[\"is_yummy\"][i])\n",
    "    classified_as = classify(beer_data.iloc[i, :])\n",
    "    print(i, \n",
    "          \"is yummy?\", \"{:3s}\".format(is_yummy),\n",
    "          \".. classified as\", classified_as)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**This is how \"linear\" classifiers work. The magic is in computing the weights and the final threshold to guarantee good results.**\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<i class=\"fa fa-info-circle\"></i>\n",
    "Although this seems to be a simplistic concept, linear classifiers can actually work very well, especially for problems with many features (high-dimensional problems).\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise section 1\n",
    "\n",
    "- Modify the weights in the beer classifiers and check if you can improve separation in the histogram.\n",
    "\n",
    "\n",
    "- In `scikit-learn` the weights of a trained linear classifier are availble via the `coef_` attribute as a 2 dimensional `numpy` array. Extract the weights from the `LogisticRegression` classifier example from the last script and try them out in your weighted sum scoring function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],