From 8b19517ecdbe90803f7f04051689830f86ebe83d Mon Sep 17 00:00:00 2001
From: Mikolaj Rybinski <mikolaj.rybinski@id.ethz.ch>
Date: Fri, 23 Aug 2019 11:32:33 +0200
Subject: [PATCH] script 01 review: minor corrections/improvements

---
 01_introduction.ipynb | 138 ++++++++++++++++++++++++------------------
 1 file changed, 79 insertions(+), 59 deletions(-)

diff --git a/01_introduction.ipynb b/01_introduction.ipynb
index 85c4b5c..b484e03 100644
--- a/01_introduction.ipynb
+++ b/01_introduction.ipynb
@@ -161,22 +161,24 @@
     "\n",
     "- We have no explicit formula for such a task.\n",
     "\n",
-    "\n",
     "- We have a vague understanding of the problem domain, e.g. we know that some words are specific to spam emails and others are specific to my personal and work-related emails.\n",
     "\n",
-    "\n",
     "- We have enough example data, as my mailbox is full of both spam and non-spam emails.\n",
     "\n",
     "\n",
-    "We could handcraft a personal spam classifier by hard coding rules, like \"mail contains 'no prescription' and comes from russia or china\" plus some statistics which would be very tedious\n",
+    "We could handcraft a personal spam classifier by hard coding rules, like _\"mail contains 'no prescription' and comes from russia or china\"_, plus some statistics. This would be very tedious.\n",
     "\n",
     "<div class=\"alert alert-block alert-info\">\n",
     "<i class=\"fa fa-info-circle\"></i>\n",
     "    Systems with such hard coded rules are called <strong>expert systems</strong>\n",
     "</div>\n",
     "\n",
-    "**In such cases machine learning offers approaches to automatically build predictive models based on example data.**\n",
+    "In such cases machine learning is a better approach.\n",
     "\n",
+    "<div class=\"alert alert-block alert-warning\">\n",
+    "<i class=\"fa fa-info-circle\"></i>\n",
+    "<strong>Machine learning</strong> offers approaches to automatically build predictive models based on example data.\n",
+    "</div>\n",
     "\n",
     "<div class=\"alert alert-block alert-info\">\n",
     "<i class=\"fa fa-info-circle\"></i>\n",
@@ -195,12 +197,19 @@
     "</div>\n",
     "\n",
     "\n",
-    "All ML algorithms have in common that they rely on internal data structures and/or parameters. Learning then builds up such data structures or adjusts parameters based on the given data. After that such models can be used to explain observations or to answer questions.\n",
+    "All ML algorithms have in common that they rely on internal data structures and/or parameters.\n",
+    "\n",
+    "<div class=\"alert alert-block alert-warning\">\n",
+    "<i class=\"fa fa-info-circle\"></i>\n",
+    "<strong>Learning</strong> builds up internal data structures or adjusts parameters of a ML method, based on the given data.\n",
+    "</div>\n",
+    "\n",
+    "After ML method has learned the data, it can be used to explain observations or to answer questions.\n",
     "\n",
     "The important difference between explicit models and models learned from data:\n",
     "\n",
-    "- Explicit models usually offer exact answers to questions\n",
-    "- Models we learn from data usually come with inherent uncertainty."
+    "- Explicit models usually offer exact answers to questions, whereas\n",
+    "- Models that learn from data usually come with inherent uncertainty."
    ]
   },
   {
@@ -474,7 +483,7 @@
       "    :Date: July; 1998\n",
       "\n",
       "This is a copy of the test set of the UCI ML hand-written digits datasets\n",
-      "http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n",
+      "https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n",
       "\n",
       "The data set contains images of hand-written digits: 10 classes where\n",
       "each class refers to a digit.\n",
@@ -521,7 +530,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 5,
    "metadata": {},
    "outputs": [
     {
@@ -565,7 +574,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [
     {
@@ -574,7 +583,7 @@
        "array([0, 1, 2, ..., 8, 9, 8])"
       ]
      },
-     "execution_count": 20,
+     "execution_count": 6,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -585,7 +594,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [
     {
@@ -621,7 +630,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [
     {
@@ -700,7 +709,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
@@ -851,12 +860,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 1. Load the data and show the overall structure using `pandas`"
+    "### Step 1: Load the data and show the overall structure using `pandas`"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [
     {
@@ -877,7 +886,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 11,
    "metadata": {},
    "outputs": [
     {
@@ -962,7 +971,7 @@
        "4         4.148710    0.570586  1.461568    0.260218         0"
       ]
      },
-     "execution_count": 13,
+     "execution_count": 11,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -974,7 +983,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [
     {
@@ -1086,7 +1095,7 @@
        "max           5.955272    1.080170    7.221285    0.535315    1.000000"
       ]
      },
-     "execution_count": 14,
+     "execution_count": 12,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -1100,7 +1109,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 2. Visualy inspect data using `seaborn`\n",
+    "### Step 2: Visualy inspect data using `seaborn`\n",
     "\n",
     "Such checks are very useful before you start throwning ML on your data. Some vague understanding how features are distributed and correlate can later be very helpfull to optimize performance of ML procedures.\n",
     "\n"
@@ -1108,7 +1117,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 13,
    "metadata": {},
    "outputs": [
     {
@@ -1161,12 +1170,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 3. Prepare data: split features and labels"
+    "### Step 3: Prepare data: split features and labels"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 14,
    "metadata": {},
    "outputs": [
     {
@@ -1217,7 +1226,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 4. Start machine learning using `scikit-learn`"
+    "### Step 4: Start machine learning using `scikit-learn`"
    ]
   },
   {
@@ -1234,19 +1243,20 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 15,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
-       "          intercept_scaling=1, max_iter=100, multi_class='warn',\n",
-       "          n_jobs=None, penalty='l2', random_state=None, solver='warn',\n",
-       "          tol=0.0001, verbose=0, warm_start=False)"
+       "                   intercept_scaling=1, l1_ratio=None, max_iter=100,\n",
+       "                   multi_class='warn', n_jobs=None, penalty='l2',\n",
+       "                   random_state=None, solver='warn', tol=0.0001, verbose=0,\n",
+       "                   warm_start=False)"
       ]
      },
-     "execution_count": 17,
+     "execution_count": 15,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -1285,7 +1295,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 16,
    "metadata": {},
    "outputs": [
     {
@@ -1295,9 +1305,9 @@
      "traceback": [
       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
       "\u001b[0;31mNotFittedError\u001b[0m                            Traceback (most recent call last)",
-      "\u001b[0;32m<ipython-input-18-9e1ed3d39774>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Sanity check: can't predict if not fitted (trained)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mclassifier\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minput_features\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
-      "\u001b[0;32m~/Projects/machinelearning-introduction-workshop/venv37/lib/python3.7/site-packages/sklearn/linear_model/base.py\u001b[0m in \u001b[0;36mpredict\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m    279\u001b[0m             \u001b[0mPredicted\u001b[0m \u001b[0;32mclass\u001b[0m \u001b[0mlabel\u001b[0m \u001b[0mper\u001b[0m \u001b[0msample\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    280\u001b[0m         \"\"\"\n\u001b[0;32m--> 281\u001b[0;31m         \u001b[0mscores\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecision_function\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    282\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mscores\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    283\u001b[0m             \u001b[0mindices\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mscores\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mint\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
-      "\u001b[0;32m~/Projects/machinelearning-introduction-workshop/venv37/lib/python3.7/site-packages/sklearn/linear_model/base.py\u001b[0m in \u001b[0;36mdecision_function\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m    253\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mhasattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'coef_'\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcoef_\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    254\u001b[0m             raise NotFittedError(\"This %(name)s instance is not fitted \"\n\u001b[0;32m--> 255\u001b[0;31m                                  \"yet\" % {'name': type(self).__name__})\n\u001b[0m\u001b[1;32m    256\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    257\u001b[0m         \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'csr'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m<ipython-input-16-9e1ed3d39774>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Sanity check: can't predict if not fitted (trained)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mclassifier\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minput_features\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+      "\u001b[0;32m~/Workspace/courses/machinelearning-introduction-workshop/.venv/lib/python3.7/site-packages/sklearn/linear_model/base.py\u001b[0m in \u001b[0;36mpredict\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m    287\u001b[0m             \u001b[0mPredicted\u001b[0m \u001b[0;32mclass\u001b[0m \u001b[0mlabel\u001b[0m \u001b[0mper\u001b[0m \u001b[0msample\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    288\u001b[0m         \"\"\"\n\u001b[0;32m--> 289\u001b[0;31m         \u001b[0mscores\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecision_function\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    290\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mscores\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    291\u001b[0m             \u001b[0mindices\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mscores\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mint\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m~/Workspace/courses/machinelearning-introduction-workshop/.venv/lib/python3.7/site-packages/sklearn/linear_model/base.py\u001b[0m in \u001b[0;36mdecision_function\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m    261\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mhasattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'coef_'\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcoef_\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    262\u001b[0m             raise NotFittedError(\"This %(name)s instance is not fitted \"\n\u001b[0;32m--> 263\u001b[0;31m                                  \"yet\" % {'name': type(self).__name__})\n\u001b[0m\u001b[1;32m    264\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    265\u001b[0m         \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'csr'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
       "\u001b[0;31mNotFittedError\u001b[0m: This LogisticRegression instance is not fitted yet"
      ]
     }
@@ -1309,7 +1319,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 17,
    "metadata": {},
    "outputs": [
     {
@@ -1338,38 +1348,36 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": 18,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "0 0\n",
-      "0 1\n",
-      "1 1\n",
-      "1 1\n",
-      "0 0\n"
+      "0 predicted as 0\n",
+      "0 predicted as 1\n",
+      "1 predicted as 1\n"
      ]
     }
    ],
    "source": [
-    "for i in range(5):\n",
-    "    print(labels[i], predicted_labels[i])"
+    "for i in range(3):\n",
+    "    print(labels[i], \"predicted as\", predicted_labels[i])"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This looks suspicious !\n",
+    "What, \"0 predicted as 1\"? This looks suspicious!\n",
     "\n",
     "Lets investigate this further:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 23,
+   "execution_count": 19,
    "metadata": {},
    "outputs": [
     {
@@ -1400,9 +1408,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## What happened?\n",
+    "##### What happened?\n",
     "\n",
-    "Why were not  all labels  predicted correctly?\n",
+    "Why were not all labels predicted correctly?\n",
     "\n",
     "Neither `Python` nor `scikit-learn` is broken. What we observed above is very typical for machine-learning applications.\n",
     "\n",
@@ -1449,7 +1457,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": 20,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1460,7 +1468,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 25,
+   "execution_count": 21,
    "metadata": {
     "tags": [
      "solution"
@@ -1517,7 +1525,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 26,
+   "execution_count": 22,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1527,7 +1535,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 27,
+   "execution_count": 23,
    "metadata": {
     "tags": [
      "solution"
@@ -1591,7 +1599,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 28,
+   "execution_count": 24,
    "metadata": {},
    "outputs": [
     {
@@ -1617,7 +1625,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 29,
+   "execution_count": 25,
    "metadata": {},
    "outputs": [
     {
@@ -1709,7 +1717,7 @@
        "4      0  "
       ]
      },
-     "execution_count": 29,
+     "execution_count": 25,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -1725,7 +1733,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 26,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1734,7 +1742,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 31,
+   "execution_count": 27,
    "metadata": {
     "scrolled": true,
     "tags": [
@@ -1776,7 +1784,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 32,
+   "execution_count": 28,
    "metadata": {
     "tags": [
      "solution"
@@ -1806,6 +1814,13 @@
     "print(len(labels), \"examples\")\n",
     "print(sum(predicted_labels == labels), \"labeled correctly\")"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Copyright (C) 2019 ETH Zurich, SIS ID"
+   ]
   }
  ],
  "metadata": {
@@ -1825,7 +1840,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.2"
+   "version": "3.7.4"
   },
   "latex_envs": {
    "LaTeX_envs_menu_present": true,
@@ -1854,7 +1869,12 @@
    "title_cell": "Table of Contents",
    "title_sidebar": "Contents",
    "toc_cell": false,
-   "toc_position": {},
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "288px"
+   },
    "toc_section_display": true,
    "toc_window_display": true
   }
-- 
GitLab