01_introduction.ipynb

    "<div class=\"alert alert-block alert-warning\">\n",
    "<i class=\"fa fa-warning\"></i>&nbsp;<strong>Built-in documentation</strong>\n",
    "\n",
    "If you want to learn more about <code>LogisticRegression</code> you can use <code>help(LogisticRegression)</code> or <code>?LogisticRegression</code> to see the related documenation. The latter version works only in Jupyter Notebooks (or in IPython shell).\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <i class=\"fa fa-warning\"></i>&nbsp;<strong><code>scikit-learn</code> API</strong>\n",
    "\n",
    "In <code>scikit-learn</code> all classifiers have:\n",
    "<ul>\n",
    "    <li>a <strong><code>fit()</code></strong> method to learn from data, and</li>\n",
    "    <li>and a subsequent <strong><code>predict()</code></strong> method for predicting classes from input features.</li>\n",
    "</ul>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sanity check: can't predict if not fitted (trained)\n",
    "classifier.predict(input_features)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fit\n",
    "classifier.fit(input_features, labels)\n",
    "\n",
    "# Predict\n",
    "predicted_labels = classifier.predict(input_features)\n",
    "print(predicted_labels.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we've just re-classified our training data. Lets check our result with a few examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(3):\n",
    "    print(labels[i], \"predicted as\", predicted_labels[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What, \"0 predicted as 1\"? This looks suspicious!\n",
    "\n",
    "Lets investigate this further:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(len(labels), \"examples\")\n",
    "print(sum(predicted_labels == labels), \"labeled correctly\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<i class=\"fa fa-info-circle\"></i>\n",
    "<code>predicted_labels == labels</code> evaluates to a vector of <code>True</code> or <code>False</code> Boolean values. When used as numbers, Python handles <code>True</code> as <code>1</code> and <code>False</code> as <code>0</code>. So, <code>sum(...)</code> simply counts the correctly predicted labels.\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"font-weight: bold; font-size: 200%;\">What happened?</div>\n",
    "\n",
    "Why were not all labels predicted correctly?\n",
    "\n",
    "Neither `Python` nor `scikit-learn` is broken. What we observed above is very typical for machine-learning applications.\n",
    "\n",
    "Reasons could be:\n",
    "\n",
    "- we have incomplete information: other features of beer which also contribute to the rating (like \"maltiness\") were not measured or can not be measured. \n",
    "\n",
    "- the used classifiers might have been not suitable for the given problem.\n",
    "\n",
    "- noise in the data as incorrectly assigned labels also affect results.\n",
    "\n",
    "\n",
    "**Finding sufficient features and clean data is crucial for the performance of ML algorithms!**\n",
    "\n",
    "\n",
    "Another important requirement is to make sure that you have clean data: input-features might be corrupted by flawed entries, feeding such data into a ML algorithm will usually lead to reduced performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise section"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compare with alternative machine learning method from `scikit-learn`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, using previously loaded and prepared beer data, train a different `scikit-learn` classifier - the so called **Support Vector Classifier** `SVC`, and evaluate its \"re-classification\" performance again.\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<i class=\"fa fa-info-circle\"></i>\n",
    "<code>SVC</code>  belongs to a class of algorithms named \"Support Vector Machines\" (SVMs). Again, it will be discussed in more detail in the following scripts.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.svm import SVC\n",
    "\n",
    "classifier = SVC()\n",
    "# ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "classifier = SVC()\n",
    "classifier.fit(input_features, labels)\n",
    "\n",
    "predicted_labels = classifier.predict(input_features)\n",
    "\n",
    "assert predicted_labels.shape == labels.shape\n",
    "print(len(labels), \"examples\")\n",
    "print(sum(predicted_labels == labels), \"labeled correctly\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Better?\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<i class=\"fa fa-info-circle\"></i>\n",
    "Better re-classification in our example does not indicate here that <code>SVC</code> is better than <code>LogisticRegression</code> in all cases. The performance of a classifier strongly depends on the data set.\n",
    "</div>\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Experiment with hyperparameters of ML methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both `LogisticRegression` and `SVC` classifiers have a hyperparameter `C` which allows to enforce a \"simplification\" (often called **regularization**) of the resulting model. Test the beers data \"re-classification\" with different values of this parameter. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Recall: ?LogisticRegression\n",
    "# ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "classifier = LogisticRegression(C=2)\n",
    "\n",
    "classifier.fit(input_features, labels)\n",
    "\n",
    "predicted_labels = classifier.predict(input_features)\n",
    "\n",
    "assert predicted_labels.shape == labels.shape\n",
    "print(len(labels), \"examples\")\n",
    "print(sum(predicted_labels == labels), \"labeled correctly\")\n",
    "print(sum(predicted_labels == labels) / len(labels) * 100, \"% labeled correctly\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<i class=\"fa fa-warning\"></i>&nbsp;<strong>Classifiers have hyper-parameters</strong>\n",
    "    \n",
    "All classifiers have hyper-parameters, e.g. the `C` we have seen before. It is an incident that both, `LogisticRegression` and `SVC`, have parameter named `C`. Beyond that some classifiers have more than one parameter, e.g. `SVC` also has a parameter `gamma`. But more about these details later.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optional exercise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load and inspect the cannonical Fisher's \"Iris\" data set, which is included in `scikit-learn`: see [docs for `sklearn.datasets.load_iris`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html). What's conceptually diffferent?\n",
    "\n",
    "Inspect the data using scatter plots.\n",
    "\n",
    "Apply `LogisticRegression` or `SVC` classifiers. Is it easier or more difficult than classification of the beers data?\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "\n",
    "data = load_iris()\n",
    "\n",
    "# labels as text\n",
    "print(data.target_names)\n",
    "\n",
    "# (rows, columns) of the feature matrix:\n",
    "print(data.data.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# transform the scikit-learn data structure into a data frame:\n",
    "df = pd.DataFrame(data.data, columns=data.feature_names)\n",
    "\n",
    "# add new column\n",
    "df[\"class\"] = data.target\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SOLUTION STARTS HERE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "sns.set(style=\"ticks\")\n",
    "\n",
    "for_plot = df.copy()\n",
    "\n",
    "\n",
    "def transform_label(class_):\n",
    "    return data.target_names[class_]\n",
    "\n",
    "\n",
    "# seaborn does not work here if we use numeric values in the class\n",
    "# column, or strings which represent numbers. To fix this we\n",
    "# create textual class labels\n",
    "for_plot[\"class\"] = for_plot[\"class\"].apply(transform_label)\n",
    "sns.pairplot(for_plot, hue=\"class\", diag_kind=\"hist\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "features = df.iloc[:, :-1]\n",
    "labels = df.iloc[:, -1]\n",
    "\n",
    "# classifier = SVC()\n",
    "classifier = LogisticRegression(max_iter=200)\n",
    "classifier.fit(features, labels)\n",
    "\n",
    "predicted_labels = classifier.predict(features)\n",
    "\n",
    "assert predicted_labels.shape == labels.shape\n",
    "print(len(labels), \"examples\")\n",
    "print(sum(predicted_labels == labels), \"labeled correctly\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (C) 2019-2021 ETH Zurich, SIS ID"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "178.6666717529297px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}