From 246066717171539dd54955106a7de85f52fcdb99 Mon Sep 17 00:00:00 2001
From: Mikolaj Rybinski <mikolaj.rybinski@id.ethz.ch>
Date: Fri, 12 Feb 2021 10:18:26 +0100
Subject: [PATCH] Extend overview of preprocessing techniques; cosmetics on
 ColumnTransformer note

---
 ...ines_and_hyperparameter_optimization.ipynb | 34 ++++++++++++-------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/05_preprocessing_pipelines_and_hyperparameter_optimization.ipynb b/05_preprocessing_pipelines_and_hyperparameter_optimization.ipynb
index 42309bf..6a40532 100644
--- a/05_preprocessing_pipelines_and_hyperparameter_optimization.ipynb
+++ b/05_preprocessing_pipelines_and_hyperparameter_optimization.ipynb
@@ -188,7 +188,7 @@
    "source": [
     "We've seen before that adding polynomial features to the 2D `xor` and `circle` problem made both tasks treatable by a simple linear classifier.\n",
     "\n",
-    "Comment: we use *transformation* and *preprocessing* interchangably.\n",
+    "Note: we use data *transformation* and *preprocessing* interchangeably.\n",
     "\n",
     "Beyond adding polynomial features, there are other important preprocessors / transformers to mention:\n",
     "\n",
@@ -197,20 +197,22 @@
     "\n",
     "A scaler applies a linear transformation on every feature. Those transformations are individual per column.\n",
     "\n",
-    "The two most important ones in `scikit-learn` are\n",
+    "The two most important scalers in `sklearn.preprocessing` module are:\n",
     "\n",
-    "- `MinMaxScaler`:  after applying this scaler, the minumum in every column is 0, the maximum is 1.\n",
+    "- `MinMaxScaler`: after applying this scaler, the minumum in every column is 0, the maximum is 1.\n",
     "\n",
     "- `StandardScaler`: scales columns to mean value 0 and standard deviation 1.\n",
     "\n",
-    "The reason to use a scaler is to compensate for different orders of magnitudes of the features. Some classifiers like `SVC` and `KNeighborsClassifier` use eucledian distances between features internally which would impose more weight on features having large values. So **don't forget to scale your features when using SVC or KNeighborsClassifier** !\n",
+    "The reason to use a scaler is to compensate for different orders of magnitudes of the features. Some classifiers like `SVC` and `KNeighborsClassifier` use eucledian distances between features internally which would impose more weight on features having large values. So **don't forget to scale features when using `SVC` or `KNeighborsClassifier`**!\n",
     "\n",
     "\n",
-    "### PCA\n",
+    "### Dimensionality reduction (PCA)\n",
     "\n",
-    "Principal component analysis is a technique to reduce the dimensionality of a multi variate data set. One benefit of PCA is to remove redundancy in your data set, such as correlating columns or linear dependencies between columns.\n",
+    "Reducing the dimensionality of a multi variate data set removes redundancies in it, such as highly correlated columns. We've discussed before that reducing redundancy and noise can help to avoid overfitting.\n",
     "\n",
-    "We've discussed before that reducing redundancy and noise can help to avoid overfitting.\n",
+    "One of the most effective techniques for dimensionality reduction is a Principal Component Analysis (PCA). Its biggest downside is that the resulting few new features (principal components) cannot be directly interpreted in terms of original features.\n",
+    "\n",
+    "The `sklearn.decomposition` module contains the standard `PCA` utility, as well as many of its variants and other dimensionality reduction techniques, and some more general features matrix decomposition techniques.\n",
     "\n",
     "\n",
     "### Function transformers\n",
@@ -219,9 +221,16 @@
     "\n",
     "Lets assume you want to forecast the outcome of car crash experiments and one variable is the time $t$ needed for the distance $l$ from start to crash. Transforming this to the actual speed $\\frac{l}{t}$ could be a more informative feature then $t$.\n",
     "\n",
-    "### Imputing missing values\n",
+    "Use a `FunctionTransformer` utility from `sklearn.preprocessing` to define and apply a function transformer.\n",
+    "\n",
+    "### Missing values imputters\n",
+    "\n",
+    "Sometimes data contain missing values. Data imputation is a strategy to fill up missing values. Standard in statistics Missing (Completely/Not) At Random (MAR; MCAR; MNAR) approaches are not well-suited for machine learning tasks. Instead, in `sklearn.impute` module you will find:\n",
     "\n",
-    "Sometimes data contain missing values. Data imputation is a strategy to fill up missing values, e.g. by the columnwise mean or by applying another strategy.\n"
+    "* `SimpleImputer`: columnwise mean/median/most frequent value approach that works great with good classifier and a lot of non-missing data, otherwise use\n",
+    "* (semi-supervised) machine learning imputers:\n",
+    "    * `KNNImputer`: mean value from k-Nearest Neighbors (closest samples by non-missing feature values); note: do scale features before using it,\n",
+    "    * `IterativeImputer`: regresses each feature with missing values on other features, in an iterated round-robin fashion over each feature.\n"
    ]
   },
   {
@@ -1035,9 +1044,10 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<div class=\"alert alert-block alert-info\">\n",
-    "<i class=\"fa fa-info-circle\"></i>&nbsp;Up to now we applied preprocessing to the full feature table, whereas <code>scikit-learn</code> also allows preprocessing of single columns or a subset of them (e.g. to encode columns with categorical/string values). The concept in <code>scikit-learn</code> is called <code>ColumnTransformer</code>; a good overview is given in <a href=\"https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html\">a tutorial on applying <code>ColumnTransformer</code>s to mixed-type columns</a>\n",
-    "</div>\n"
+    "<div class=\"alert alert-block alert-info\"><p>\n",
+    "<i class=\"fa fa-info-circle\"></i>&nbsp;\n",
+    "Up to now we've applied preprocessing only to the full features table. <strong>To preprocess single columns or a subset of them, e.g. to apply function transformers, or to input missing values, or to encode categorical columns use a <code>ColumnTransformer</code> utility</strong>. A good overview is given in <a href=\"https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html\">a tutorial on applying <code>ColumnTransformer</code>s to mixed-type columns</a>.\n",
+    "</p></div>\n"
    ]
   },
   {
-- 
GitLab