From 246066717171539dd54955106a7de85f52fcdb99 Mon Sep 17 00:00:00 2001 From: Mikolaj Rybinski <mikolaj.rybinski@id.ethz.ch> Date: Fri, 12 Feb 2021 10:18:26 +0100 Subject: [PATCH] Extend overview of preprocessing techniques; cosmetics on ColumnTransformer note --- ...ines_and_hyperparameter_optimization.ipynb | 34 ++++++++++++------- 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/05_preprocessing_pipelines_and_hyperparameter_optimization.ipynb b/05_preprocessing_pipelines_and_hyperparameter_optimization.ipynb index 42309bf..6a40532 100644 --- a/05_preprocessing_pipelines_and_hyperparameter_optimization.ipynb +++ b/05_preprocessing_pipelines_and_hyperparameter_optimization.ipynb @@ -188,7 +188,7 @@ "source": [ "We've seen before that adding polynomial features to the 2D `xor` and `circle` problem made both tasks treatable by a simple linear classifier.\n", "\n", - "Comment: we use *transformation* and *preprocessing* interchangably.\n", + "Note: we use data *transformation* and *preprocessing* interchangeably.\n", "\n", "Beyond adding polynomial features, there are other important preprocessors / transformers to mention:\n", "\n", @@ -197,20 +197,22 @@ "\n", "A scaler applies a linear transformation on every feature. Those transformations are individual per column.\n", "\n", - "The two most important ones in `scikit-learn` are\n", + "The two most important scalers in `sklearn.preprocessing` module are:\n", "\n", - "- `MinMaxScaler`: after applying this scaler, the minumum in every column is 0, the maximum is 1.\n", + "- `MinMaxScaler`: after applying this scaler, the minumum in every column is 0, the maximum is 1.\n", "\n", "- `StandardScaler`: scales columns to mean value 0 and standard deviation 1.\n", "\n", - "The reason to use a scaler is to compensate for different orders of magnitudes of the features. Some classifiers like `SVC` and `KNeighborsClassifier` use eucledian distances between features internally which would impose more weight on features having large values. So **don't forget to scale your features when using SVC or KNeighborsClassifier** !\n", + "The reason to use a scaler is to compensate for different orders of magnitudes of the features. Some classifiers like `SVC` and `KNeighborsClassifier` use eucledian distances between features internally which would impose more weight on features having large values. So **don't forget to scale features when using `SVC` or `KNeighborsClassifier`**!\n", "\n", "\n", - "### PCA\n", + "### Dimensionality reduction (PCA)\n", "\n", - "Principal component analysis is a technique to reduce the dimensionality of a multi variate data set. One benefit of PCA is to remove redundancy in your data set, such as correlating columns or linear dependencies between columns.\n", + "Reducing the dimensionality of a multi variate data set removes redundancies in it, such as highly correlated columns. We've discussed before that reducing redundancy and noise can help to avoid overfitting.\n", "\n", - "We've discussed before that reducing redundancy and noise can help to avoid overfitting.\n", + "One of the most effective techniques for dimensionality reduction is a Principal Component Analysis (PCA). Its biggest downside is that the resulting few new features (principal components) cannot be directly interpreted in terms of original features.\n", + "\n", + "The `sklearn.decomposition` module contains the standard `PCA` utility, as well as many of its variants and other dimensionality reduction techniques, and some more general features matrix decomposition techniques.\n", "\n", "\n", "### Function transformers\n", @@ -219,9 +221,16 @@ "\n", "Lets assume you want to forecast the outcome of car crash experiments and one variable is the time $t$ needed for the distance $l$ from start to crash. Transforming this to the actual speed $\\frac{l}{t}$ could be a more informative feature then $t$.\n", "\n", - "### Imputing missing values\n", + "Use a `FunctionTransformer` utility from `sklearn.preprocessing` to define and apply a function transformer.\n", + "\n", + "### Missing values imputters\n", + "\n", + "Sometimes data contain missing values. Data imputation is a strategy to fill up missing values. Standard in statistics Missing (Completely/Not) At Random (MAR; MCAR; MNAR) approaches are not well-suited for machine learning tasks. Instead, in `sklearn.impute` module you will find:\n", "\n", - "Sometimes data contain missing values. Data imputation is a strategy to fill up missing values, e.g. by the columnwise mean or by applying another strategy.\n" + "* `SimpleImputer`: columnwise mean/median/most frequent value approach that works great with good classifier and a lot of non-missing data, otherwise use\n", + "* (semi-supervised) machine learning imputers:\n", + " * `KNNImputer`: mean value from k-Nearest Neighbors (closest samples by non-missing feature values); note: do scale features before using it,\n", + " * `IterativeImputer`: regresses each feature with missing values on other features, in an iterated round-robin fashion over each feature.\n" ] }, { @@ -1035,9 +1044,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div class=\"alert alert-block alert-info\">\n", - "<i class=\"fa fa-info-circle\"></i> Up to now we applied preprocessing to the full feature table, whereas <code>scikit-learn</code> also allows preprocessing of single columns or a subset of them (e.g. to encode columns with categorical/string values). The concept in <code>scikit-learn</code> is called <code>ColumnTransformer</code>; a good overview is given in <a href=\"https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html\">a tutorial on applying <code>ColumnTransformer</code>s to mixed-type columns</a>\n", - "</div>\n" + "<div class=\"alert alert-block alert-info\"><p>\n", + "<i class=\"fa fa-info-circle\"></i> \n", + "Up to now we've applied preprocessing only to the full features table. <strong>To preprocess single columns or a subset of them, e.g. to apply function transformers, or to input missing values, or to encode categorical columns use a <code>ColumnTransformer</code> utility</strong>. A good overview is given in <a href=\"https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html\">a tutorial on applying <code>ColumnTransformer</code>s to mixed-type columns</a>.\n", + "</p></div>\n" ] }, { -- GitLab