Skip to content
Snippets Groups Projects
02_classification.ipynb 41 KiB
Newer Older
schmittu's avatar
schmittu committed
{
 "cells": [
schmittu's avatar
schmittu committed
  {
   "cell_type": "code",
   "metadata": {},
   "outputs": [],
schmittu's avatar
schmittu committed
   "source": [
    "# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "import warnings\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
    "warnings.filterwarnings(\"ignore\", message=\"X does not have valid feature names, but [a-zA-Z]+ was fitted with feature names\", category=UserWarning)\n",
    "                                  \n",
schmittu's avatar
schmittu committed
    "warnings.filterwarnings = lambda *a, **kw: None\n",
    "from IPython.core.display import HTML\n",
    "\n",
    "HTML(open(\"custom.html\", \"r\").read())"
schmittu's avatar
schmittu committed
  },
schmittu's avatar
schmittu committed
  {
   "cell_type": "markdown",
   "metadata": {},
schmittu's avatar
schmittu committed
   "source": [
    "# Chapter 2: Classification"
schmittu's avatar
schmittu committed
  },
  {
   "cell_type": "markdown",
   "metadata": {},
schmittu's avatar
schmittu committed
   "source": [
    "As we have learned in the previous chapter *classification* is a machine learning problem belonging to a group of *supervised learning* problems. In classification the aim is to learn how to predict a class of a categorical label, based on set of already labelled training examples (hence, supervised). Such labels (categories) and corresponding classes can be:\n",
schmittu's avatar
schmittu committed
    "\n",
    "- sick: yes / no,\n",
    "- character: good / bad / dont't know,\n",
    "- digit: 0 / ... / 9.\n",
schmittu's avatar
schmittu committed
    "\n",
    "In this chapter we introduce the core concepts of classification."
schmittu's avatar
schmittu committed
   ]
schmittu's avatar
schmittu committed
  },
  {
   "cell_type": "markdown",
schmittu's avatar
schmittu committed
   "metadata": {},
schmittu's avatar
schmittu committed
   "source": [
    "## How to build a simple classifier?"
schmittu's avatar
schmittu committed
   ]
schmittu's avatar
schmittu committed
  },
  {
   "cell_type": "markdown",
schmittu's avatar
schmittu committed
   "metadata": {},
schmittu's avatar
schmittu committed
   "source": [
    "Let's quickly look again at the beer example:"
schmittu's avatar
schmittu committed
   ]
schmittu's avatar
schmittu committed
  },
  {
   "cell_type": "code",
schmittu's avatar
schmittu committed
   "metadata": {},
   "outputs": [],
schmittu's avatar
schmittu committed
   "source": [
schmittu's avatar
schmittu committed
    "import pandas as pd\n",
schmittu's avatar
schmittu committed
    "pd.set_option('display.precision', 3)\n",
schmittu's avatar
schmittu committed
    "\n",
    "import seaborn as sns\n",
    "sns.set(style=\"ticks\")\n",
schmittu's avatar
schmittu committed
    "\n",
    "beer_data = pd.read_csv(\"data/beers.csv\")\n",
schmittu's avatar
schmittu committed
    "\n",
    "for_plot = beer_data.copy()\n",
schmittu's avatar
schmittu committed
    "\n",
    "# fixes seaborn labels issue\n",
    "def translate_label(value):\n",
    "    return \"no\" if value == 0 else \"yes\"\n",
schmittu's avatar
schmittu committed
    "\n",
    "for_plot[\"is_yummy\"] = for_plot[\"is_yummy\"].apply(translate_label)\n",
    "\n",
schmittu's avatar
schmittu committed
    "sns.pairplot(for_plot, hue=\"is_yummy\", diag_kind=\"hist\", diag_kws=dict(alpha=.7));\n",
schmittu's avatar
schmittu committed
    "beer_data.describe()"
schmittu's avatar
schmittu committed
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can assume that the person who rated these beers has preferences such as:\n",
    "* \"I don't like too low alcohol content\",\n",
    "* \"I like more fruity beers\", etc."
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This means we could construct a score where high numbers relate to \"favorable beer\". One simple way to implement such a score is to use a weighted sum like:\n",
    "\n",
    "     score = 1.1 * alcohol_content + 4 * bitterness + 1.5 * darkness + 1.8 * fruitiness\n",
    "\n",
    "The actual weights here are guessed and serve as an example.\n",
    "\n",
    "The size of the numbers reflects the numerical ranges of the features: alcohol content is in the range 3 to 5.9, where as bitterness is between 0 and 1.08:"
schmittu's avatar
schmittu committed
  {
   "cell_type": "code",
   "metadata": {},
   "outputs": [],
schmittu's avatar
schmittu committed
   "source": [
    "scores =( 1.1 * beer_data[\"alcohol_content\"] + 4 * beer_data[\"bitterness\"] \n",
schmittu's avatar
schmittu committed
    "          + 1.5 * beer_data[\"darkness\"] + 1.8 * beer_data[\"fruitiness\"])\n",
    "scores.shape"
schmittu's avatar
schmittu committed
  },
  {
   "cell_type": "markdown",
   "metadata": {},
schmittu's avatar
schmittu committed
   "source": [
    "Now we can plot the histogram of the scores by classes:"
schmittu's avatar
schmittu committed
  },
  {
   "cell_type": "code",
   "metadata": {},
   "outputs": [],
schmittu's avatar
schmittu committed
   "source": [
    "scores_bad = scores[beer_data[\"is_yummy\"] == 0]\n",
    "scores_good = scores[beer_data[\"is_yummy\"] == 1]\n",
schmittu's avatar
schmittu committed
    "\n",
    "plt.hist(scores_bad, bins=25, color=\"steelblue\", alpha=.7) # alpha makes bars translucent\n",
    "plt.hist(scores_good,  bins=25, color=\"chocolate\", alpha=.7);"
schmittu's avatar
schmittu committed
  },
  {
   "cell_type": "markdown",
   "metadata": {},
schmittu's avatar
schmittu committed
   "source": [
    "Consequence: a simple classifier could use these scores and use a threshold around 10.5 to assign a class label."
schmittu's avatar
schmittu committed
  },
schmittu's avatar
schmittu committed
  {
   "cell_type": "code",
   "metadata": {},
   "outputs": [],
schmittu's avatar
schmittu committed
   "source": [
    "def classify(beer_feature):\n",
schmittu's avatar
schmittu committed
    "    scores = (1.1 * beer_feature[\"alcohol_content\"] + 4 * beer_feature[\"bitterness\"] \n",
    "              + 1.5 * beer_feature[\"darkness\"] + 1.8 * beer_feature[\"fruitiness\"])\n",
    "    if scores > 10.5:\n",
schmittu's avatar
schmittu committed
    "        return \"yummy\"\n",
    "    else:\n",
schmittu's avatar
schmittu committed
    "        return \"not yummy\"\n",
    "\n",
    "# check this for samples 5 .. 14:\n",
schmittu's avatar
schmittu committed
    "for i in range(5, 15):\n",
    "    is_yummy = translate_label(beer_data[\"is_yummy\"][i])\n",
schmittu's avatar
schmittu committed
    "    classified_as = classify(beer_data.iloc[i, :])\n",
    "    print(i, \n",
    "          \"is yummy?\", \"{:3s}\".format(is_yummy),\n",
schmittu's avatar
schmittu committed
    "          \".. classified as\", classified_as)"
schmittu's avatar
schmittu committed
   ]
schmittu's avatar
schmittu committed
  },
schmittu's avatar
schmittu committed
  {
   "cell_type": "markdown",
schmittu's avatar
schmittu committed
   "metadata": {},
schmittu's avatar
schmittu committed
   "source": [
    "**This is how \"linear\" classifiers work. The magic is in computing the weights and the final threshold to guarantee good results.**\n",
schmittu's avatar
schmittu committed
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<i class=\"fa fa-info-circle\"></i>\n",
    "Although this seems to be a simplistic concept, linear classifiers can actually work very well, especially for problems with many features (high-dimensional problems).\n",
    "</div>\n"
schmittu's avatar
schmittu committed
  },
schmittu's avatar
schmittu committed
  {
   "cell_type": "markdown",
   "metadata": {},
schmittu's avatar
schmittu committed
   "source": [
    "## Exercise section 1\n",
    "\n",
schmittu's avatar
schmittu committed
    "- Modify the weights in the beer classifiers and check if you can improve separation in the histogram.\n",
schmittu's avatar
schmittu committed
    "\n",
schmittu's avatar
schmittu committed
    "\n",
schmittu's avatar
schmittu committed
    "- In `scikit-learn` the weights of a trained linear classifier are availble via the `coef_` attribute as a 2 dimensional `numpy` array. Extract the weights from the `LogisticRegression` classifier example from the last script and try them out in your weighted sum scoring function."
  },
  {
   "cell_type": "code",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
Loading
Loading full blame...