Skip to content
Snippets Groups Projects
01_introduction.ipynb 45 KiB
Newer Older
  • Learn to ignore specific revisions
  •     "<div class=\"alert alert-block alert-warning\">\n",
        "<i class=\"fa fa-warning\"></i>&nbsp;<strong>Built-in documentation</strong>\n",
        "\n",
        "If you want to learn more about <code>LogisticRegression</code> you can use <code>help(LogisticRegression)</code> or <code>?LogisticRegression</code> to see the related documenation. The latter version works only in Jupyter Notebooks (or in IPython shell).\n",
        "</div>"
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "<div class=\"alert alert-block alert-warning\">\n",
    
        "    <i class=\"fa fa-warning\"></i>&nbsp;<strong><code>scikit-learn</code> API</strong>\n",
    
        "In <code>scikit-learn</code> all classifiers have:\n",
        "<ul>\n",
        "    <li>a <strong><code>fit()</code></strong> method to learn from data, and</li>\n",
        "    <li>and a subsequent <strong><code>predict()</code></strong> method for predicting classes from input features.</li>\n",
        "</ul>\n",
        "</div>"
       ]
      },
    
    schmittu's avatar
    schmittu committed
      {
    
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "# Sanity check: can't predict if not fitted (trained)\n",
        "classifier.predict(input_features)"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "# Fit\n",
        "classifier.fit(input_features, labels)\n",
        "\n",
        "# Predict\n",
        "predicted_labels = classifier.predict(input_features)\n",
        "print(predicted_labels.shape)"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "Here we've just re-classified our training data. Lets check our result with a few examples:"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "for i in range(3):\n",
        "    print(labels[i], \"predicted as\", predicted_labels[i])"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "What, \"0 predicted as 1\"? This looks suspicious!\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "Lets investigate this further:"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
        "print(len(labels), \"examples\")\n",
        "print(sum(predicted_labels == labels), \"labeled correctly\")"
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "<div class=\"alert alert-block alert-info\">\n",
        "<i class=\"fa fa-info-circle\"></i>\n",
        "<code>predicted_labels == labels</code> evaluates to a vector of <code>True</code> or <code>False</code> Boolean values. When used as numbers, Python handles <code>True</code> as <code>1</code> and <code>False</code> as <code>0</code>. So, <code>sum(...)</code> simply counts the correctly predicted labels.\n",
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "<div style=\"font-weight: bold; font-size: 200%;\">What happened?</div>\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "Why were not all labels predicted correctly?\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "Neither `Python` nor `scikit-learn` is broken. What we observed above is very typical for machine-learning applications.\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "- we have incomplete information: other features of beer which also contribute to the rating (like \"maltiness\") were not measured or can not be measured. \n",
    
        "- the used classifiers might have been not suitable for the given problem.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "- noise in the data as incorrectly assigned labels also affect results.\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "**Finding sufficient features and clean data is crucial for the performance of ML algorithms!**\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "\n",
        "Another important requirement is to make sure that you have clean data: input-features might be corrupted by flawed entries, feeding such data into a ML algorithm will usually lead to reduced performance."
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "## Exercise section"
    
    schmittu's avatar
    schmittu committed
       ]
      },
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "### Compare with alternative machine learning method from `scikit-learn`"
    
    schmittu's avatar
    schmittu committed
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "Now, using previously loaded and prepared beer data, train a different `scikit-learn` classifier - the so called **Support Vector Classifier** `SVC`, and evaluate its \"re-classification\" performance again.\n",
    
        "\n",
        "<div class=\"alert alert-block alert-info\">\n",
        "<i class=\"fa fa-info-circle\"></i>\n",
    
        "<code>SVC</code>  belongs to a class of algorithms named \"Support Vector Machines\" (SVMs). Again, it will be discussed in more detail in the following scripts.\n",
    
    schmittu's avatar
    schmittu committed
       ]
    
    schmittu's avatar
    schmittu committed
      },
    
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
       "metadata": {},
       "outputs": [],
       "source": [
        "from sklearn.svm import SVC\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "classifier = SVC()\n",
        "# ..."
       ]
      },
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
       "metadata": {
        "tags": [
         "solution"
        ]
       },
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "classifier = SVC()\n",
        "classifier.fit(input_features, labels)\n",
        "\n",
        "predicted_labels = classifier.predict(input_features)\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "assert predicted_labels.shape == labels.shape\n",
    
        "print(len(labels), \"examples\")\n",
        "print(sum(predicted_labels == labels), \"labeled correctly\")"
    
    schmittu's avatar
    schmittu committed
       ]
      },
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "<div class=\"alert alert-block alert-info\">\n",
        "<i class=\"fa fa-info-circle\"></i>\n",
    
        "Better re-classification in our example does not indicate here that <code>SVC</code> is better than <code>LogisticRegression</code> in all cases. The performance of a classifier strongly depends on the data set.\n",
        "</div>\n",
        "\n",
        "\n"
    
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "###  Experiment with hyperparameters of ML methods"
    
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
    schmittu's avatar
    schmittu committed
        "Both `LogisticRegression` and `SVC` classifiers have a hyperparameter `C` which allows to enforce a \"simplification\" (often called **regularization**) of the resulting model. Test the beers data \"re-classification\" with different values of this parameter. "
    
    schmittu's avatar
    schmittu committed
       ]
      },
    
    schmittu's avatar
    schmittu committed
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
       "outputs": [],
       "source": [
    
        "# Recall: ?LogisticRegression\n",
    
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {
        "tags": [
         "solution"
        ]
       },
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
        "from sklearn.linear_model import LogisticRegression\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
    schmittu's avatar
    schmittu committed
        "classifier = LogisticRegression(C=2)\n",
    
        "classifier.fit(input_features, labels)\n",
    
        "predicted_labels = classifier.predict(input_features)\n",
    
    schmittu's avatar
    schmittu committed
        "assert predicted_labels.shape == labels.shape\n",
    
        "print(len(labels), \"examples\")\n",
    
    schmittu's avatar
    schmittu committed
        "print(sum(predicted_labels == labels), \"labeled correctly\")\n",
        "print(sum(predicted_labels == labels) / len(labels) * 100, \"% labeled correctly\")"
    
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "<div class=\"alert alert-block alert-warning\">\n",
    
    schmittu's avatar
    schmittu committed
        "<i class=\"fa fa-warning\"></i>&nbsp;<strong>Classifiers have hyper-parameters</strong>\n",
    
    schmittu's avatar
    schmittu committed
        "All classifiers have hyper-parameters, e.g. the `C` we have seen before. It is an incident that both, `LogisticRegression` and `SVC`, have parameter named `C`. Beyond that some classifiers have more than one parameter, e.g. `SVC` also has a parameter `gamma`. But more about these details later.\n",
    
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "## Optional exercise"
    
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
        "Load and inspect the cannonical Fisher's \"Iris\" data set, which is included in `scikit-learn`: see [docs for `sklearn.datasets.load_iris`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html). What's conceptually diffferent?\n",
    
    schmittu's avatar
    schmittu committed
        "Inspect the data using scatter plots.\n",
        "\n",
    
        "Apply `LogisticRegression` or `SVC` classifiers. Is it easier or more difficult than classification of the beers data?\n",
    
    schmittu's avatar
    schmittu committed
       ]
      },
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {},
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
       "source": [
        "from sklearn.datasets import load_iris\n",
        "\n",
        "data = load_iris()\n",
        "\n",
        "# labels as text\n",
    
    schmittu's avatar
    schmittu committed
        "print(data.target_names)\n",
    
        "\n",
        "# (rows, columns) of the feature matrix:\n",
    
    schmittu's avatar
    schmittu committed
        "print(data.data.shape)"
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
       "source": [
        "# transform the scikit-learn data structure into a data frame:\n",
        "df = pd.DataFrame(data.data, columns=data.feature_names)\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
        "# add new column\n",
    
        "df[\"class\"] = data.target\n",
        "df.head()"
       ]
      },
      {
       "cell_type": "code",
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
       "metadata": {},
       "outputs": [],
    
        "# SOLUTION STARTS HERE"
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
       "metadata": {
    
    schmittu's avatar
    schmittu committed
        "scrolled": true,
        "tags": [
         "solution"
        ]
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
       "source": [
        "import seaborn as sns\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "sns.set(style=\"ticks\")\n",
        "\n",
        "for_plot = df.copy()\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "def transform_label(class_):\n",
        "    return data.target_names[class_]\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "# seaborn does not work here if we use numeric values in the class\n",
        "# column, or strings which represent numbers. To fix this we\n",
        "# create textual class labels\n",
        "for_plot[\"class\"] = for_plot[\"class\"].apply(transform_label)\n",
    
    schmittu's avatar
    schmittu committed
        "sns.pairplot(for_plot, hue=\"class\", diag_kind=\"hist\");"
    
    schmittu's avatar
    schmittu committed
       "execution_count": null,
    
    schmittu's avatar
    schmittu committed
       "metadata": {
        "tags": [
         "solution"
        ]
       },
    
    schmittu's avatar
    schmittu committed
       "outputs": [],
    
    schmittu's avatar
    schmittu committed
       "source": [
    
        "features = df.iloc[:, :-1]\n",
        "labels = df.iloc[:, -1]\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "# classifier = SVC()\n",
    
    schmittu's avatar
    schmittu committed
        "classifier = LogisticRegression(max_iter=200)\n",
    
        "classifier.fit(features, labels)\n",
    
    schmittu's avatar
    schmittu committed
        "\n",
    
        "predicted_labels = classifier.predict(features)\n",
        "\n",
    
    schmittu's avatar
    schmittu committed
        "assert predicted_labels.shape == labels.shape\n",
    
        "print(len(labels), \"examples\")\n",
        "print(sum(predicted_labels == labels), \"labeled correctly\")"
    
    schmittu's avatar
    schmittu committed
       ]
    
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
    
        "Copyright (C) 2019-2021 ETH Zurich, SIS ID"
    
    schmittu's avatar
    schmittu committed
      }
     ],
     "metadata": {
    
    schmittu's avatar
    schmittu committed
      "celltoolbar": "Tags",
    
    schmittu's avatar
    schmittu committed
      "hide_input": false,
    
    schmittu's avatar
    schmittu committed
      "kernelspec": {
       "display_name": "Python 3",
       "language": "python",
       "name": "python3"
      },
      "language_info": {
       "codemirror_mode": {
        "name": "ipython",
        "version": 3
       },
       "file_extension": ".py",
       "mimetype": "text/x-python",
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
    
       "version": "3.7.7"
    
      },
      "latex_envs": {
       "LaTeX_envs_menu_present": true,
       "autoclose": false,
       "autocomplete": true,
       "bibliofile": "biblio.bib",
       "cite_by": "apalike",
       "current_citInitial": 1,
       "eqLabelWithNumbers": true,
       "eqNumInitial": 1,
       "hotkeys": {
        "equation": "Ctrl-E",
        "itemize": "Ctrl-I"
       },
       "labels_anchors": false,
       "latex_user_defs": false,
       "report_style_numbering": false,
       "user_envs_cfg": false
    
      },
      "toc": {
       "base_numbering": 1,
       "nav_menu": {},
       "number_sections": true,
       "sideBar": true,
       "skip_h1_title": true,
       "title_cell": "Table of Contents",
       "title_sidebar": "Contents",
       "toc_cell": false,
    
       "toc_position": {
        "height": "calc(100% - 180px)",
        "left": "10px",
        "top": "150px",
    
    schmittu's avatar
    schmittu committed
        "width": "178.6666717529297px"
    
       "toc_section_display": true,
       "toc_window_display": true
    
    schmittu's avatar
    schmittu committed
      }
     },
     "nbformat": 4,
    
    schmittu's avatar
    schmittu committed
     "nbformat_minor": 4
    
    schmittu's avatar
    schmittu committed
    }