From 5b7fa812460e830803000a0289994cf2dc6334b2 Mon Sep 17 00:00:00 2001
From: Uwe Schmitt <uwe.schmitt@id.ethz.ch>
Date: Mon, 6 May 2019 17:44:18 +0200
Subject: [PATCH] fixed broken explanation for cross-validation

---
 03_overfitting_and_cross_validation.ipynb | 57 +++++++++++++++++++++--
 1 file changed, 54 insertions(+), 3 deletions(-)

diff --git a/03_overfitting_and_cross_validation.ipynb b/03_overfitting_and_cross_validation.ipynb
index 290636a..dedacd8 100644
--- a/03_overfitting_and_cross_validation.ipynb
+++ b/03_overfitting_and_cross_validation.ipynb
@@ -612,8 +612,13 @@
     "\n",
     "\n",
     "## 2. How can we do better ?\n",
-    "\n",
-    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "There is no classifier which works out of the box in all situations. Depending on the \"geometry\" / \"shape\" of the data, classification algorithms and their settings can make a big difference.\n",
     "\n",
     "In our previous 2D examples we were able to visualize the data and classification results, this is not possible for higher dimensional data.\n",
@@ -621,8 +626,54 @@
     "The general way to handle this situation is as follows: \n",
     "\n",
     "- split our data into a learning data set and a test data set\n",
+    "\n",
+    "\n",
     "- train the classifier on the learning data set\n",
-    "- assess performance of the classifier on the test data set."
+    "\n",
+    "\n",
+    "- assess performance of the classifier on the test data set.\n",
+    "\n",
+    "\n",
+    "### Cross-validation\n",
+    "\n",
+    "<img src=\"https://i.imgflip.com/305azk.jpg\" title=\"made at imgflip.com\" width=40%/>\n",
+    "\n",
+    "\n",
+    "The procedure called *cross-validation* goes a step further: In this procedure the full dataset is split into learn-/test-set in various ways and statistics of the achieved metrics is computed to assess the classifier.\n",
+    "\n",
+    "A common approach is **K-fold cross-validation**:\n",
+    "\n",
+    "K-fold cross-validation has an advantage that we do not leave out part of our data from training. This is useful when we do not have a lot of data. \n",
+    "\n",
+    "### Example: 4-fold cross validation\n",
+    "\n",
+    "For 4-fold cross validation we split our data set into four equal sized partitions P1, P2, P3 and P4.\n",
+    "\n",
+    "We:\n",
+    "\n",
+    "- hold out `P1`: train the classifier on `P2 + P3 + P4`, compute accuracy `m1` on `P1`.\n",
+    "\n",
+    "<img src=\"cross_val_0.svg?2\" />\n",
+    "\n",
+    "-  hold out `P2`: train the classifier on `P1 + P3 + P4`, compute accuracy `m2` on `P2`.\n",
+    "\n",
+    "<img src=\"cross_val_1.svg?2\" />\n",
+    "\n",
+    "-  hold out `P3`: train the classifier on `P1 + P2 + P4`, compute accuray `m3` on `P3`.\n",
+    "\n",
+    "<img src=\"cross_val_2.svg?2\" />\n",
+    "\n",
+    "-  hold out `P4`: train the classifier on `P1 + P2 + P3`, compute accuracy `m4` on `P4`.\n",
+    "\n",
+    "<img src=\"cross_val_3.svg?2\" />\n",
+    "\n",
+    "Finally we can compute the average of `m1` .. `m4` as the final measure for accuracy.\n",
+    "\n",
+    "Some advice:\n",
+    "\n",
+    "- This can be done on the original data or on randomly shuffled data. It is recommended to shuffle the data first, as there might be some unknown underlying ordering in your dataset\n",
+    "\n",
+    "- Usually one uses 3- to 10-fold cross validation, depending on the amount of data available."
    ]
   },
   {
-- 
GitLab