added scaling motivation

676aa164 · schmittu · Mikolaj Rybinski · aa62aafb · 676aa164 · 676aa164
Commit 676aa164 authored 4 years ago by schmittu Committed by Mikolaj Rybinski 4 years ago
--- a/05_preprocessing_pipelines_and_hyperparameter_optimization.ipynb
+++ b/05_preprocessing_pipelines_and_hyperparameter_optimization.ipynb
@@ -157,7 +157,8 @@
    }
   ],
   "source": [
-    "# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !\n",
+    " # IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !\n",
+    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "import warnings\n",
@@ -207,7 +208,46 @@
    "\n",
    "- `StandardScaler`: scales columns to mean value 0 and standard deviation 1.\n",
    "\n",
-    "The reason to use a scaler is to compensate for different orders of magnitudes of the features. Some classifiers like `SVC` and `KNeighborsClassifier` use eucledian distances between features internally which would impose more weight on features having large values. So **don't forget to scale features when using `SVC` or `KNeighborsClassifier`**!\n",
+    "#### Why scaling?\n",
+    "\n",
+    "Let us assume we have two features `x` and `y` with different ranges: `x` is in the range `0` to `3` and `y` in `0` to `1`: We have sampled both variables with the same resolution:\n",
+    "\n",
+    "<img src=\"images/different_scales.png\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Some classifiers like `SVC` and `KNeighborsClassifier` use eucledian distances between features internally. They assume that features which are close in distance also have simlilar classes or target values.\n",
+    "\n",
+    "So let us check what features have an eucledian distance to `(1.5, 0.5)` below `0.2`:\n",
+    "\n",
+    "<img src=\"images/before_rescaling.png\"/>\n",
+    "\n",
+    "As you can see the points in the circle cover a big range of `x` values, but only a few points from `y`. You can assume that the feature `x` has a strong influence on how a classifier works, and `y` has less influence.\n",
+    "\n",
+    "You can also see that on absolute scales the circle is not a circle:\n",
+    "\n",
+    "<img src=\"images/before_rescaling_skewed_circle.png\"/>\n",
+    "\n",
+    "\n",
+    "When we scale `x` to the range `0` to `1` the situation is as follows:\n",
+    "\n",
+    "<img src=\"images/after_rescaling.png\" >\n",
+    "\n",
+    "\n",
+    "The reason to use a scaler is to compensate\n",
+    "- for different orders of magnitudes\n",
+    "- and physical units\n",
+    "\n",
+    "of the features. So **don't forget to scale your features when using SVC or KNeighborsClassifier**!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
    "\n",
    "\n",
    "### Dimensionality reduction (PCA)\n",

 %% Cell type:code id: tags:

 ``` python
-# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !
+ # IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !
+import matplotlib.pyplot as plt
 %matplotlib inline
 %config InlineBackend.figure_format = 'retina'
 import warnings

 import matplotlib.pyplot as plt

 warnings.filterwarnings("ignore", category=FutureWarning)
 warnings.filterwarnings("ignore", category=DeprecationWarning)
 warnings.filterwarnings = lambda *a, **kw: None
 from IPython.core.display import HTML

 HTML(open("custom.html", "r").read())
 ```

 %% Output

    <IPython.core.display.HTML object>

 %% Cell type:markdown id: tags:

 # Chapter 5: Preprocessing, pipelines and hyperparameters optimization

 %% Cell type:markdown id: tags:

 ## About transformations / preprocessing

 %% Cell type:markdown id: tags:

 We've seen before that adding polynomial features to the 2D `xor` and `circle` problem made both tasks treatable by a simple linear classifier.

 Note: we use data *transformation* and *preprocessing* interchangeably.

 Beyond adding polynomial features, there are other important preprocessors / transformers to mention:


 ### Scaler

 A scaler applies a linear transformation on every feature. Those transformations are individual per column.

 The two most important scalers in `sklearn.preprocessing` module are:

 - `MinMaxScaler`: after applying this scaler, the minumum in every column is 0, the maximum is 1.

 - `StandardScaler`: scales columns to mean value 0 and standard deviation 1.

-The reason to use a scaler is to compensate for different orders of magnitudes of the features. Some classifiers like `SVC` and `KNeighborsClassifier` use eucledian distances between features internally which would impose more weight on features having large values. So **don't forget to scale features when using `SVC` or `KNeighborsClassifier`**!
+#### Why scaling?
+
+Let us assume we have two features `x` and `y` with different ranges: `x` is in the range `0` to `3` and `y` in `0` to `1`: We have sampled both variables with the same resolution:
+
+<img src="images/different_scales.png">
+
+%% Cell type:markdown id: tags:
+
+Some classifiers like `SVC` and `KNeighborsClassifier` use eucledian distances between features internally. They assume that features which are close in distance also have simlilar classes or target values.
+
+So let us check what features have an eucledian distance to `(1.5, 0.5)` below `0.2`:
+
+<img src="images/before_rescaling.png"/>
+
+As you can see the points in the circle cover a big range of `x` values, but only a few points from `y`. You can assume that the feature `x` has a strong influence on how a classifier works, and `y` has less influence.
+
+You can also see that on absolute scales the circle is not a circle:
+
+<img src="images/before_rescaling_skewed_circle.png"/>
+
+
+When we scale `x` to the range `0` to `1` the situation is as follows:
+
+<img src="images/after_rescaling.png" >
+
+
+The reason to use a scaler is to compensate
+- for different orders of magnitudes
+- and physical units
+
+of the features. So **don't forget to scale your features when using SVC or KNeighborsClassifier**!
+
+%% Cell type:markdown id: tags:
+


 ### Dimensionality reduction (PCA)

 Reducing the dimensionality of a multi variate data set removes redundancies in it, such as highly correlated columns. We've discussed before that reducing redundancy and noise can help to avoid overfitting.

 One of the most effective techniques for dimensionality reduction is a Principal Component Analysis (PCA). Its biggest downside is that the resulting few new features (principal components) cannot be directly interpreted in terms of original features.

 The `sklearn.decomposition` module contains the standard `PCA` utility, as well as many of its variants and other dimensionality reduction techniques, and some more general features matrix decomposition techniques.


 ### Function transformers

 It can help to apply functions like `log` or `exp` or `1/x` to features to improve classification performance.

 Lets assume you want to forecast the outcome of car crash experiments and one variable is the time $t$ needed for the distance $l$ from start to crash. Transforming this to the actual speed $\frac{l}{t}$ could be a more informative feature then $t$.

 Use a `FunctionTransformer` utility from `sklearn.preprocessing` to define and apply a function transformer.

 ### Missing values imputters

 Sometimes data contain missing values. Data imputation is a strategy to fill up missing values. Standard in statistics Missing (Completely/Not) At Random (MAR; MCAR; MNAR) approaches are not well-suited for machine learning tasks. Instead, in `sklearn.impute` module you will find:

 * `SimpleImputer`: columnwise mean/median/most frequent value approach that works great with good classifier and a lot of non-missing data, otherwise use
 * (semi-supervised) machine learning imputers:
    * `KNNImputer`: mean value from k-Nearest Neighbors (closest samples by non-missing feature values); note: do scale features before using it,
    * `IterativeImputer`: regresses each feature with missing values on other features, in an iterated round-robin fashion over each feature.

 %% Cell type:markdown id: tags:

 ## About scaling

 %% Cell type:markdown id: tags:

 As an example we demonstrante how a scaler can be implemented. Our scaling strategy will scale given values to the range 0 to 1.

 First we create a random data matrix and compute columnwise min and max values:

 %% Cell type:code id: tags:

 ``` python
 import numpy as np

 # for repducible numbers:
 np.random.seed(42)

 values = np.random.random((5,)) * 20 - 10

 min_value = values.min()
 max_value = values.max()

 print("values:", values)
 print()
 print("min value:", min_value)
 print("max value:", max_value)
 ```

 %% Output

    values: [-2.50919762  9.01428613  4.63987884  1.97316968 -6.87962719]
    
    min value: -6.87962719115127
    max value: 9.014286128198322

 %% Cell type:markdown id: tags:

 The strategy for scaling is as follows: Our values $v$ are in the range $v_{min}$ to $v_{max}$:

 $$
 v_{min} \le  v  \le v_{max}
 $$


 Then subtracting $v_{min}$ results in

 $$
 0 \le  v - v_{min} \le v_{max} - v_{min}
 $$

 Finally dividing by the right hand side delivers the property we are looking for:

 $$
 0 \le \frac{v - v_{min}}{v_{max} - v_{min}} \le 1
 $$


 In Python:

 %% Cell type:code id: tags:

 ``` python
 scaled_values = (values - min_value) / (max_value - min_value)

 print("scaled values:", scaled_values)
 ```

 %% Output

    scaled values: [0.27497505 1.         0.72477469 0.5569929  0.        ]

 %% Cell type:markdown id: tags:

 You can see that all values are now scaled as intended.

 To apply the same strategy column per column to a feature matrix, `scikit-learn` offers a `MinMaxScaler`:

 %% Cell type:code id: tags:

 ``` python
 features = np.random.random((5, 3)) * 20 - 10
 print(features)
 ```

 %% Output

    [[-6.88010959 -8.83832776  7.32352292]
     [ 2.02230023  4.16145156 -9.58831011]
     [ 9.39819704  6.64885282 -5.75321779]
     [-6.36350066 -6.3319098  -3.91515514]
     [ 0.49512863 -1.36109963 -4.1754172 ]]

 %% Cell type:code id: tags:

 ``` python
 from sklearn.preprocessing import MinMaxScaler

 # learning -> determine columnwise min/max values
 scaler = MinMaxScaler().fit(features)

 # transformation ! -> apply linear transformation based on min/max values:
 print(scaler.transform(features))
 ```

 %% Output

    [[0.         0.         1.        ]
     [0.54688796 0.83938966 0.        ]
     [1.         1.         0.22676976]
     [0.03173604 0.16183823 0.33545476]
     [0.45307159 0.48280112 0.32006542]]

 %% Cell type:code id: tags:

 ``` python
 # shorter !
 print(scaler.fit_transform(features))
 ```

 %% Output

    [[0.         0.         1.        ]
     [0.54688796 0.83938966 0.        ]
     [1.         1.         0.22676976]
     [0.03173604 0.16183823 0.33545476]
     [0.45307159 0.48280112 0.32006542]]

 %% Cell type:markdown id: tags:

 We can divide preprocessing into two classes:

 1. Preprocessing which depends on the full data set. E.g.

   - Scaling
   - PCA
   - Many variants for imputation of missing values


 2. Preprocessing which can be applied row per row individually. E.g.

   - Adding polynomial features
   - Functional transforms
   - Row-wise scaling (e.g. when a row represents an image and we want to compensate for different illumination).


 <div class="alert alert-block alert-warning">

 <div style="font-size: 150%;">
    <i class="fa fa-info-circle"></i>&nbsp;Important
    </div>

 When we include preprocessing in a classification approach, we must later **apply exactly the same preprocessing on new incoming data**!

 For preprocessors which depend on the full data set this implies that we never must preprocess data before cross-validation!

 Running such preprocessors on the full dataset lets information of "unseen" data sneak into the classifier.

 </div>


 %% Cell type:markdown id: tags:

 <div style="font-size: 150%;"> This is how we must proceed instead:</div>

 In case for the `MinMaxScaler` preprocessor:

 1. Determine column-wise minimum und maximum values of the training features.
 2. Use these min/max values to scale training data.
 3. Learn classifier `C` on the scaled training data.


 4. Use values from 1. to scale evaluation data (thus, we might create values outside `0..1`).
 5. Apply classifier `C` to the scaled evaluation data.
 6. Assess `C` performance.

 In general:

 1. Learn prprocessor `P` on the training data.
 2. Apply `P` to the training data.
 3. Learn classifier `C` on the preprocessed training data.


 4. Apply `P` from 1. to the evaluation data.
 5. Apply classifier `C` to the preprocessed evaluation data.
 6. Assess `C` performance.

 %% Cell type:markdown id: tags:

 <img src="./images/2xi5wt.jpg" width=50%/>

 %% Cell type:markdown id: tags:

 ## The scikit-learn API (quick recap)

 We've seen before that we can swap `scikit-learn` classifiers easily without changing much code.

 This is possible, because all classifiers have methods `.fit` and `.predict` which also have the same function signature (this means number and meaning of arguments is always the same for every implementation of `.fit` respectively `.predict`.)

 This consistend design within `scikit-learn` also applies for preprocessors transformers, which all have methods`.fit`, `.transform` and `.fit_transform`.

 This consistent API allows setting up **processing pipelines**:

 %% Cell type:markdown id: tags:

 ## Pipelines

 A so called classifiation pipeline consists of 0 or more pre processors plus a final classifier.

 Let us start with the following pipeline:

 1. Use PCA to reduce data to 3 dimensions
 2. Apply scaling to mean 0 and std deviation 1
 3. Train `SVC` classifier.



 %% Cell type:code id: tags:

 ``` python
 from sklearn.decomposition import PCA
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import StandardScaler
 from sklearn.svm import SVC

 p = make_pipeline(PCA(3), StandardScaler(), SVC())
 ```

 %% Cell type:markdown id: tags:

 <div class="alert alert-block alert-warning">
 <p><i class="fa fa-info-circle"></i>
 A pipeline "behaves" like a single classifier - it implements <code>.fit()</code> and <code>.predict()</code> methods.</p>
 </div>

 %% Cell type:code id: tags:

 ``` python
 print("p.fit    ", p.fit is not None)
 print("p.predict", p.predict is not None)
 ```

 %% Output

    p.fit     True
    p.predict True

 %% Cell type:markdown id: tags:

 Because of this we can also use cross-validation in the same way as we did before:

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd

 beer_data = pd.read_csv("data/beers.csv")

 features = beer_data.iloc[:, :-1]
 labels = beer_data.iloc[:, -1]


 from sklearn.model_selection import cross_val_score

 print(cross_val_score(p, features, labels, scoring="accuracy", cv=5).mean())
 ```

 %% Output

    0.928888888888889

 %% Cell type:markdown id: tags:

 <div class="alert alert-block alert-warning">

 <i class="fa fa-info-circle"></i>&nbsp;One benefit of using a pipeline is that you will not mistakenly scale the full data set first, instead we follow the strategy we've described above automatically.

 </div>

 %% Cell type:markdown id: tags:

 Bonus: you can easily visualize a more complex pipelines:

 %% Cell type:code id: tags:

 ``` python
 from sklearn import set_config

 set_config(display="diagram")
 p
 ```

 %% Output

    Pipeline(steps=[('pca', PCA(n_components=3)),
                    ('standardscaler', StandardScaler()), ('svc', SVC())])

 %% Cell type:markdown id: tags:

 ### How to setup a good pipeline ?

 Regrettably there is no recipe how to setup a good performing classification pipeline except reasonable preprocessing, especially feature engineering. After that it is up to experimentation and the advice on how to choose classifiers we gave in the last script.

 Let us try out different pipeplines and evaluate them:

 %% Cell type:code id: tags:

 ``` python
 from sklearn.decomposition import PCA
 from sklearn.linear_model import LogisticRegression
 from sklearn.neighbors import KNeighborsClassifier
 from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures, StandardScaler
 from sklearn.svm import SVC

 for p in [
    make_pipeline(KNeighborsClassifier()),
    make_pipeline(StandardScaler(), KNeighborsClassifier()),
    make_pipeline(SVC()),
    make_pipeline(StandardScaler(), SVC()),
    make_pipeline(MinMaxScaler(), SVC()),
    make_pipeline(LogisticRegression()),
    make_pipeline(StandardScaler(), PCA(3), LogisticRegression()),
    make_pipeline(StandardScaler(), PCA(2), LogisticRegression()),
    make_pipeline(PolynomialFeatures(), SVC()),
    # tech: max_iter to prevent convergence warning
    make_pipeline(PolynomialFeatures(), LogisticRegression(max_iter=10000)),
 ]:

    print(
        "{:.3f}".format(
            cross_val_score(p, features, labels, scoring="accuracy", cv=5).mean()
        ),
        end=" ",
    )
    print([pi[0] for pi in p.steps])
 ```

 %% Output

    0.840 ['kneighborsclassifier']
    0.938 ['standardscaler', 'kneighborsclassifier']
    0.827 ['svc']
    0.947 ['standardscaler', 'svc']
    0.947 ['minmaxscaler', 'svc']
    0.911 ['logisticregression']
    0.920 ['standardscaler', 'pca', 'logisticregression']
    0.893 ['standardscaler', 'pca', 'logisticregression']
    0.791 ['polynomialfeatures', 'svc']
    0.956 ['polynomialfeatures', 'logisticregression']

 %% Cell type:markdown id: tags:

 ### Exercise session

 1. Can you come up with a better performing classification pipeline?

 %% Cell type:code id: tags:solution

 ``` python
 # SOLUTION

 for p in [
    make_pipeline(StandardScaler(), SVC()),  # previouly best pipeline
    make_pipeline(StandardScaler(), SVC(C=25, gamma=0.05)),  # better !
 ]:

    print(
        "{:.3f}".format(
            cross_val_score(p, features, labels, scoring="accuracy", cv=5).mean()
        ),
        end=" ",
    )
    print([pi[0] for pi in p.steps])
 ```

 %% Output

    0.947 ['standardscaler', 'svc']
    0.969 ['standardscaler', 'svc']

 %% Cell type:markdown id: tags:

 ####  Optional exercises

 1. Build a classification pipeline to classifiy the 2D xor- and circle data sets with linear classifiers. Also assess their performance.

 2. Build a classification pipeline for the digits data set. This data set was described in the first script where we've shown how to flatten images to feature vectors. Experiment with `PCA` preprocessing, `ScandardScaler` and `SVC`. Show a few of the images with correct and incorrect labels.

 %% Cell type:code id: tags:solution

 ``` python
 # SOLUTION
 import pandas as pd
 from sklearn.linear_model import LogisticRegression
 from sklearn.model_selection import cross_val_score
 from sklearn.svm import LinearSVC


 def check_pipelines(data):
    features = data.iloc[:, :-1]
    labels = data.iloc[:, -1]

    for p in [
        make_pipeline(StandardScaler(), LogisticRegression()),
        make_pipeline(StandardScaler(), LinearSVC()),
        make_pipeline(PolynomialFeatures(2), LogisticRegression()),
        make_pipeline(PolynomialFeatures(2), LinearSVC()),
        make_pipeline(PolynomialFeatures(4), StandardScaler(), LogisticRegression()),
        make_pipeline(PolynomialFeatures(4), StandardScaler(), LinearSVC()),
    ]:

        print(
            "{:.3f}".format(
                cross_val_score(p, features, labels, scoring="accuracy", cv=5).mean()
            ),
            end=" ",
        )
        print([pi[0] for pi in p.steps])


 xor_data = pd.read_csv("data/xor.csv")
 check_pipelines(xor_data)
 print()

 circle_data = pd.read_csv("data/circle.csv")
 check_pipelines(circle_data)
 ```

 %% Output

    0.608 ['standardscaler', 'logisticregression']
    0.608 ['standardscaler', 'linearsvc']
    0.964 ['polynomialfeatures', 'logisticregression']
    0.962 ['polynomialfeatures', 'linearsvc']
    0.970 ['polynomialfeatures', 'standardscaler', 'logisticregression']
    0.966 ['polynomialfeatures', 'standardscaler', 'linearsvc']
    
    0.757 ['standardscaler', 'logisticregression']
    0.757 ['standardscaler', 'linearsvc']
    0.987 ['polynomialfeatures', 'logisticregression']
    0.983 ['polynomialfeatures', 'linearsvc']
    0.977 ['polynomialfeatures', 'standardscaler', 'logisticregression']
    0.983 ['polynomialfeatures', 'standardscaler', 'linearsvc']

 %% Cell type:code id: tags:solution

 ``` python
 from sklearn.datasets import load_digits
 from sklearn.decomposition import PCA
 from sklearn.model_selection import cross_val_score, train_test_split
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures, StandardScaler
 from sklearn.svm import SVC

 # classifying digits

 data_set = load_digits()
 features = data_set.data
 labels = data_set.target

 for p in [
    make_pipeline(StandardScaler(), SVC()),
    make_pipeline(PCA(10), SVC()),
    make_pipeline(PCA(10), StandardScaler(), SVC()),
    make_pipeline(PCA(15), StandardScaler(), SVC()),
    make_pipeline(PCA(20), StandardScaler(), SVC()),
    make_pipeline(PCA(20), StandardScaler(), SVC()),
    make_pipeline(PCA(17), StandardScaler(), SVC(C=2)),
    make_pipeline(PCA(17), SVC(C=2)),
    make_pipeline(PCA(17), StandardScaler(), SVC(C=6)),
    make_pipeline(PCA(17), StandardScaler(), SVC(C=4)),
 ]:

    print(
        "{:.3f}".format(
            cross_val_score(p, features, labels, scoring="accuracy", cv=5).mean()
        ),
        end=" ",
    )
    print([pi[0] for pi in p.steps])


 from sklearn.model_selection import train_test_split

 # split 80:20 with fixed randomization:
 (features_train, features_test, labels_train, labels_test) = train_test_split(
    features, labels, test_size=0.2, random_state=42
 )

 p.fit(features_train, labels_train)
 predicted = p.predict(features_test)
 incorrect = np.where(predicted != labels_test)[0]

 print(incorrect)


 def show(indices):
    plt.figure(figsize=(10, 7))

    for i, idx in enumerate(indices):
        plt.subplot(1, len(indices), i + 1)
        img = features_test[idx].reshape(8, 8)
        plt.imshow(
            img,
            cmap="gray",
        )
        plt.title("is {}, predicted as {}".format(labels_test[idx], predicted[idx]))


 show(incorrect)

 import random

 correct = np.where(predicted == labels_test)[0]
 correct = random.sample(list(correct), 4)  # from np array to list, this is required!
 show(correct)
 ```

 %% Output

    0.946 ['standardscaler', 'svc']
    0.948 ['pca', 'svc']
    0.947 ['pca', 'standardscaler', 'svc']
    0.960 ['pca', 'standardscaler', 'svc']
    0.958 ['pca', 'standardscaler', 'svc']
    0.959 ['pca', 'standardscaler', 'svc']
    0.963 ['pca', 'standardscaler', 'svc']
    0.971 ['pca', 'svc']
    0.962 ['pca', 'standardscaler', 'svc']
    0.963 ['pca', 'standardscaler', 'svc']
    [133 159 244 339]





 %% Cell type:markdown id: tags:

 <div class="alert alert-block alert-info"><p>
 <i class="fa fa-info-circle"></i>&nbsp;
 Up to now we've applied preprocessing only to the full features table. <strong>To preprocess single columns or a subset of them, e.g. to apply function transformers, or to input missing values, or to encode categorical columns use a <code>ColumnTransformer</code> utility</strong>. A good overview is given in <a href="https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html">a tutorial on applying <code>ColumnTransformer</code>s to mixed-type columns</a>.
 </p></div>

 %% Cell type:markdown id: tags:

 ## Reusing pipelines: persist on a disk

 Learning and finding a good performing pipeline can be time intensive. It makes sense to **store and reuse pipeline later for predictions**.

 To that end, **use a standard library Python module `pickle` to serialize and deserialize a pipeline object** (or any other "serializable" Python object, including a classifier itself, or hyper-parameters search result object):

 %% Cell type:code id: tags:

 ``` python
 import os
 import pickle
 import tempfile

 import pandas as pd
 from sklearn.decomposition import PCA
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import StandardScaler
 from sklearn.svm import SVC

 beer_data = pd.read_csv("data/beers.csv")

 features = beer_data.iloc[:, :-1]
 labels = beer_data.iloc[:, -1]


 p = make_pipeline(PCA(3), StandardScaler(), SVC())
 p.fit(features, labels)
 print("SVC learnt intercept:\n", p.named_steps["svc"].intercept_)


 # ".pkl" is the standard file extension for "pickled" objects
 p_path = os.path.join(tempfile.mkdtemp(), "pipeline.pkl")


 # open file to _w_rite _b_inary data
 with open(p_path, "wb") as p_file:
    pickle.dump(p, p_file)
 ```

 %% Output

    SVC learnt intercept:
     [-0.12318867]

 %% Cell type:code id: tags:

 ``` python
 # open file to _r_ead _b_inary data
 with open(p_path, "rb") as p_file:
    p_loaded = pickle.load(p_file)
 print("SVC learnt intercept (loaded):\n", p.named_steps["svc"].intercept_)
 ```

 %% Output

    SVC learnt intercept (loaded):
     [-0.12318867]

 %% Cell type:markdown id: tags:

 ### Caching pipeline data transformers

 Another common case is **reusing preprocessing steps that are already fitted to data - the data transformers**. This is useful when trying out pipeline composition variations, or when optimizing hyper-parameters of the classifier itself.

 To reuse fitted data transformers (preprocessors, decomposition etc), you can **cache pipeline on a disk using `memory=...` argument of the pipeline constructor**.

 Whenever you call `.fit(...)` with different inputs the pipeline transformers will be cached:

 %% Cell type:code id: tags:

 ``` python
 import glob
 import os
 import tempfile
 from pprint import pprint


 # Utility function to recursively get file names in a folder.
 def get_fns_rec(path):
    for f in glob.glob(f"{path}/**/*.*", recursive=True):
        yield os.path.relpath(f, path)


 tempdir = tempfile.mkdtemp()
 print("Caching to directory:", tempdir)

 # memory argument takes a path to a directory
 # (or a `joblib.Memory` object thar represent a directory)
 p = make_pipeline(PCA(3), StandardScaler(), SVC(), memory=tempdir)

 print("Files before fit:")
 pprint(list(get_fns_rec(tempdir)))
 print()

 p.fit(features, labels)
 print("Files after fit 1:")
 pprint(list(get_fns_rec(tempdir)))
 print()
 ```

 %% Output

    Caching to directory: /var/folders/tx/tp1pccjd5bzcrbbxkwcwt2v40000gn/T/tmpfimbtazm
    Files before fit:
    []
    
    Files after fit 1:
    ['joblib/sklearn/pipeline/_fit_transform_one/func_code.py',
     'joblib/sklearn/pipeline/_fit_transform_one/edf51e4c11d2f9b854e54238406fb1b0/output.pkl',
     'joblib/sklearn/pipeline/_fit_transform_one/9199153f0632c4c87f34ae75929e13e1/metadata.json',
     'joblib/sklearn/pipeline/_fit_transform_one/c293d3d37bc228d7f937f75d09a83fb7/metadata.json',
     'joblib/sklearn/pipeline/_fit_transform_one/26378566154f4c9bba5aa07414d6702e/output.pkl']
    

 %% Cell type:markdown id: tags:

 Each data transformer in our pipeline has a corresponding:
 * a "pickle" (binary `.pkl` file) containing serialized transformer object, and
 * a metadata file (human-readable `.json` file) containing 1) description of the transformer constructor and 2) input data used for training - both of these identify a cached transformer.

 Cache is extended on fit with different input data and re-used on fit with the same input data:

 %% Cell type:code id: tags:

 ``` python
 print("Files after fit 2:")
 # alter input data by dropping first 10 samples
 p.fit(features.iloc[:-10, :], labels.iloc[:-10])
 pprint(list(get_fns_rec(tempdir)))
 print()

 # if inputs are the same - cache will be used
 print("Files after fit 1 repeat:")
 p.fit(features, labels)
 pprint(list(get_fns_rec(tempdir)))
 print()
 ```

 %% Output

    Files after fit 2:
    ['joblib/sklearn/pipeline/_fit_transform_one/func_code.py',
     'joblib/sklearn/pipeline/_fit_transform_one/a87386a658c4ae56cb17fe42f15e2de9/metadata.json',
     'joblib/sklearn/pipeline/_fit_transform_one/edf51e4c11d2f9b854e54238406fb1b0/output.pkl',
     'joblib/sklearn/pipeline/_fit_transform_one/9199153f0632c4c87f34ae75929e13e1/metadata.json',
     'joblib/sklearn/pipeline/_fit_transform_one/c293d3d37bc228d7f937f75d09a83fb7/metadata.json',
     'joblib/sklearn/pipeline/_fit_transform_one/6a8e9a5081624368a9f55c58703a8e12/output.pkl',
     'joblib/sklearn/pipeline/_fit_transform_one/7e8194c954970b433409f60b893977c3/output.pkl',
     'joblib/sklearn/pipeline/_fit_transform_one/727184d08f9be55af24daabd650267e1/metadata.json',
     'joblib/sklearn/pipeline/_fit_transform_one/26378566154f4c9bba5aa07414d6702e/output.pkl']
    
    Files after fit 1 repeat:
    ['joblib/sklearn/pipeline/_fit_transform_one/func_code.py',
     'joblib/sklearn/pipeline/_fit_transform_one/a87386a658c4ae56cb17fe42f15e2de9/metadata.json',
     'joblib/sklearn/pipeline/_fit_transform_one/edf51e4c11d2f9b854e54238406fb1b0/output.pkl',
     'joblib/sklearn/pipeline/_fit_transform_one/9199153f0632c4c87f34ae75929e13e1/metadata.json',
     'joblib/sklearn/pipeline/_fit_transform_one/c293d3d37bc228d7f937f75d09a83fb7/metadata.json',
     'joblib/sklearn/pipeline/_fit_transform_one/6a8e9a5081624368a9f55c58703a8e12/output.pkl',
     'joblib/sklearn/pipeline/_fit_transform_one/7e8194c954970b433409f60b893977c3/output.pkl',
     'joblib/sklearn/pipeline/_fit_transform_one/727184d08f9be55af24daabd650267e1/metadata.json',
     'joblib/sklearn/pipeline/_fit_transform_one/26378566154f4c9bba5aa07414d6702e/output.pkl']
    

 %% Cell type:markdown id: tags:

 <div class="alert alert-block alert-info"><p>
    <i class="fa fa-info-circle"></i>&nbsp;<strong>Beware</strong> that creating a pipeline with <code>memory=...</code> argument clones the transformer objects passed in a constructor. In turn, to inspect fitted transformers you must use <code>steps</code> or <code>named_steps</code> attribute of the pipeline object and not a previous explicit reference to a transformer used in pipeline (in particular, previous references won't be fitted after the pipeline was fitted).
 </p></div>

 %% Cell type:markdown id: tags:

 ## Hyperparameter optimization

 Classifiers and pipelines have parameters which must be adapted for improving performance (e.g. `gamma` or `C`). Finding good parameters is also called *hyperparameter optimization* to distinguish from the optimization done during learning of many classification algorithms.

 <br/>
 <div style="font-size: 120%;">Up to now we adapted such hyperparameters manually, but there are more systematic approaches !</div>

 %% Cell type:markdown id: tags:

 The simplest approach is to specify valid values for each parameter involved and then try out all possible combinations. This is called *grid search*:

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd
 from sklearn.model_selection import GridSearchCV
 from sklearn.svm import SVC

 beer_data = pd.read_csv("data/beers.csv")

 features = beer_data.iloc[:, :-1]
 labels = beer_data.iloc[:, -1]

 svc = SVC()

 # optimize parameters of one single classifier

 # the keys in the dictionary match the argument names for SVC:
 parameters = {"kernel": ("linear", "rbf", "poly"), "C": [1, 5, 10, 15]}

 # run gridsearch, use CV to assess quality and determine best parameter
 # set:

 # tries all 3 x 4 = 12 combinations:
 search = GridSearchCV(svc, parameters, cv=5)
 search.fit(features, labels)

 print(search.best_score_, search.best_params_)
 ```

 %% Output

    0.9733333333333334 {'C': 15, 'kernel': 'poly'}

 %% Cell type:markdown id: tags:

 Such an optimization can also be applied to a full pipeline:

 %% Cell type:code id: tags:

 ``` python
 import tempfile

 from sklearn.linear_model import LogisticRegression
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import PolynomialFeatures, StandardScaler

 # using temporary transfomers caching for a bit better speed
 cache_dir = tempfile.mkdtemp()

 p = make_pipeline(
    PolynomialFeatures(),
    StandardScaler(),
    LogisticRegression(max_iter=10000),
    memory=cache_dir,
 )
 ```

 %% Cell type:markdown id: tags:

 The specification of the grid id now a bit more complicated `PROCESSOR__ARGUMENT`:

 - first the name of the processor / classifier in lower case letters,
 - then two underscores `__`,
 - finally the name of the argument of the processor / classifier.

 `StandardScaler` e.g. has parameters `with_mean` and `with_std` which can be `True` or `False`:

 %% Cell type:code id: tags:

 ``` python
 param_grid = {
    "polynomialfeatures__degree": [1, 2, 3, 4],
    "standardscaler__with_mean": [True, False],
    "standardscaler__with_std": [True, False],
    "logisticregression__C": [0.01, 0.1, 1, 10, 100],
 }
 ```

 %% Cell type:markdown id: tags:

 This grid has `4 x 2 x 2 x 5` thus `80` points. So we muss run crossvalidation for 80 different classifiers.

 To speed this up, we can specify `n_jobs = 2` to use `2` extra processor cores to run gridsearch in parallel (you might want to use more cores depending on your computer):

 %% Cell type:code id: tags:

 ``` python
 search = GridSearchCV(
    p,
    param_grid,
    cv=4,
    scoring="accuracy",
    n_jobs=2,
 )
 search.fit(features, labels)
 print("Best parameter (CV score=%0.3f):" % search.best_score_)
 print(search.best_params_)
 ```

 %% Output

    Best parameter (CV score=0.982):
    {'logisticregression__C': 100, 'polynomialfeatures__degree': 2, 'standardscaler__with_mean': True, 'standardscaler__with_std': False}

 %% Cell type:markdown id: tags:

 If you have more complicated pipelines, you also can assign explicitly names to the steps which can be used in the parameter grid. However, you need to use directly the `Pipeline(...)` constructor to do so:

 %% Cell type:code id: tags:

 ``` python
 from sklearn.pipeline import Pipeline

 p_names = Pipeline(
    steps=[
        ("poly", PolynomialFeatures()),
        ("scale", StandardScaler()),
        ("clf", LogisticRegression(max_iter=10000)),
    ],
    memory=cache_dir,
 )

 param_grid_short_names = {
    "poly__degree": [1, 2, 3, 4],
    "scale__with_mean": [True, False],
    "scale__with_std": [True, False],
    "clf__C": [0.01, 0.1, 1, 10, 100],
 }

 search = GridSearchCV(
    p_names,
    param_grid_short_names,
    cv=4,
    scoring="accuracy",
    n_jobs=2,
 )
 # using a built-in notebook line magic to measure the fit time
 %time search.fit(features, labels)
 print("Best parameter (CV score=%0.3f):" % search.best_score_)
 print(search.best_params_)
 ```

 %% Output

    CPU times: user 211 ms, sys: 8.42 ms, total: 220 ms
    Wall time: 12.8 s
    Best parameter (CV score=0.982):
    {'clf__C': 100, 'poly__degree': 2, 'scale__with_mean': True, 'scale__with_std': False}

 %% Cell type:markdown id: tags:

 These grid searches took some time. A less systematic but a much more efficient approach for a rather well performing classifier is `RandomizedSearchCV`.

 In this case, instead of a whole grid, we specify random distributions for the parameters to optimize:

 %% Cell type:code id: tags:

 ``` python
 from scipy.stats import loguniform, randint

 param_dist_short_names = {
    "poly__degree": randint(1, 4),  # random integer from 1 to 4
    "scale__with_mean": [True, False],  # random value from explicit set of values
    "scale__with_std": [True, False],
    "clf__C": loguniform(0.01, 100),  # log random number from .01 to 100
 }
 ```

 %% Cell type:markdown id: tags:

 We run now 30 iterations.

 %% Cell type:code id: tags:

 ``` python
 from sklearn.model_selection import RandomizedSearchCV

 search = RandomizedSearchCV(
    p_names,
    param_dist_short_names,
    cv=4,
    scoring="accuracy",
    n_jobs=2,
    n_iter=30,
    random_state=42,  # fix randomization for reproduciblity
 )
 %time search.fit(features, labels)
 print("Best parameter (CV score=%0.3f):" % search.best_score_)
 print(search.best_params_)
 ```

 %% Output

    CPU times: user 184 ms, sys: 7.77 ms, total: 192 ms
    Wall time: 1.53 s
    Best parameter (CV score=0.982):
    {'clf__C': 75.79479953347995, 'poly__degree': 2, 'scale__with_mean': False, 'scale__with_std': False}

 %% Cell type:markdown id: tags:

 <div class="alert alert-block alert-warning">
 <p><i class="fa fa-info-circle"></i>
 Hyperparameter search methods also "behave" like a single classifier - they implement <code>.fit()</code> and <code>.predict()</code> methods (*).</p>
 </div>

 <div>
 <p>(*) Prediction is done with the best parameters found. The underlying model or pipeline with the best parameters is available via <code>.best_estimator_</code> property. Importantly, the <strong>refit with the best parameters is done at the end</strong> of the CV-based search, <strong>using a whole training data set</strong>.</p>

 <p style="font-size: 80%;">The automatic refitting can be disabled by passing <code>refit=False</code> argument when specifying the search method. Then neither <code>.predict()</code>, nor <code>.best_estimator_</code> won't be available.</p>
 </div>




 %% Cell type:code id: tags:

 ``` python
 print("Best estimator:")
 print(search.best_estimator_)
 print()
 print()
 print("Training set accuracy:", sum(search.predict(features) == labels) / len(labels))
 ```

 %% Output

    Best estimator:
    Pipeline(memory='/var/folders/tx/tp1pccjd5bzcrbbxkwcwt2v40000gn/T/tmp1b6guogr',
             steps=[('poly', PolynomialFeatures()),
                    ('scale', StandardScaler(with_mean=False, with_std=False)),
                    ('clf',
                     LogisticRegression(C=75.79479953347995, max_iter=10000))])
    
    
    Training set accuracy: 0.9955555555555555

 %% Cell type:markdown id: tags:

 The <code>sklearn.model_selection</code> includes also a drop-in replacements for both cross-validation searches that use successive halving technique for selecting most promising hyperparameter values, while doubling the number of training samples used at each searched point: `HalvingGridSearchCV` and `HalvingRandomSearchCV`. Conceptually, the search starts as a very coarse-grained one and becomes finer for the best results.

 For large hyperparameter space/grid the results usually are as accurate as for standard searches, but found in much less time:

 %% Cell type:code id: tags:

 ``` python
 # halving searches are still an experimental feature (scikit-learn 0.24)
 # => explicitly enable it
 from sklearn.experimental import enable_halving_search_cv
 from sklearn.model_selection import HalvingGridSearchCV

 search = HalvingGridSearchCV(
    p_names,
    param_grid_short_names,
    cv=4,
    scoring="f1",
    n_jobs=2,
    factor=2,  # default is to actually third, not halve
    random_state=42,  # fix randomization for reproduciblity
 )
 %time search.fit(features, labels)
 print("Best parameter (CV score=%0.3f):" % search.best_score_)
 print(search.best_params_)
 ```

 %% Output

    CPU times: user 486 ms, sys: 34.1 ms, total: 520 ms
    Wall time: 7.32 s
    Best parameter (CV score=0.993):
    {'clf__C': 10, 'poly__degree': 4, 'scale__with_mean': True, 'scale__with_std': True}

 %% Cell type:markdown id: tags:

 ## Exercise section

 1. Try to find good parameters for the following two pipelines applied to the beer data set. Use grid search for the first one and randomized search for the second one. Cache the data transformers while doing the grid search. Does it pay-off to cache the data transformers during randomized search? Save the search results to the disk.

    `make_pipeline(StandardScaler(), SVC(gamma=..., C=...), memory=...)`

    `make_pipeline(StandardScaler(), PolynomialFeatures(degree=..), PCA(n_components=...), LinearSVC())`

 %% Cell type:code id: tags:solution

 ``` python
 import os
 import pickle
 import tempfile

 cache_dir = tempfile.mkdtemp()


 beer_data = pd.read_csv("data/beers.csv")

 features = beer_data.iloc[:, :-1]
 labels = beer_data.iloc[:, -1]


 p = make_pipeline(StandardScaler(), SVC(), memory=cache_dir)

 param_grid = {
    "standardscaler__with_mean": [True, False],
    "standardscaler__with_std": [True, False],
    "svc__C": [1, 10, 15, 20, 25],
    "svc__gamma": [0.01, 0.05, 0.1, 0.5, 1],
 }

 search = GridSearchCV(p, param_grid, cv=5, scoring="accuracy", n_jobs=5)
 search.fit(features, labels)
 print("Best parameter (CV score=%0.3f):" % search.best_score_)
 print(search.best_params_)

 search_path = os.path.join(tempdir, "search_1.pkl")
 with open(search_path, "wb") as search_file:
    pickle.dump(search, search_file)


 from scipy.stats import randint, uniform
 from sklearn.model_selection import RandomizedSearchCV
 from sklearn.svm import LinearSVC

 p = make_pipeline(StandardScaler(), PolynomialFeatures(), PCA(), LinearSVC())
 param_grid = {
    "polynomialfeatures__degree": randint(2, 5),
    "pca__n_components": randint(4, 15),
 }
 # Note: not using cache w/ randomized search for this pipeline.
 #       Only the fast StandardScaler would benefit from caching;
 #       overhead for caching all transformers w/ random values is bigger.

 search = RandomizedSearchCV(
    p,
    param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=5,
    random_state=42,  # fix randomization for reproduciblity
 )
 search.fit(features, labels)
 print("Best parameter (CV score=%0.3f):" % search.best_score_)
 print(search.best_params_)

 search_path = os.path.join(tempdir, "search_2.pkl")
 with open(search_path, "wb") as search_file:
    pickle.dump(search, search_file)
 ```

 %% Output

    Best parameter (CV score=0.969):
    {'standardscaler__with_mean': True, 'standardscaler__with_std': True, 'svc__C': 15, 'svc__gamma': 0.1}
    Best parameter (CV score=0.978):
    {'pca__n_components': 10, 'polynomialfeatures__degree': 2}

 %% Cell type:markdown id: tags:

 ### Optional Exercises

 Optimize configurations for using randomized search:

 - the spiral data set
 - the digits data set

 %% Cell type:code id: tags:solution

 ``` python
 from scipy.stats import uniform

 data = pd.read_csv("data/spiral.csv")

 features = data.iloc[:, :-1]
 labels = data.iloc[:, -1]

 import matplotlib.pyplot as plt

 plt.figure(figsize=(5, 5))
 plt.scatter(features.iloc[:, 0], features.iloc[:, 1], color=["rb"[l] for l in labels])
 labels = data.iloc[:, -1]

 clf = SVC()

 param_grid = {
    "C": uniform(0.1, 70),
    "gamma": uniform(0.01, 10),
 }

 search = RandomizedSearchCV(
    clf,
    param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=5,
    random_state=42,  # fix randomization for reproduciblity
 )
 search.fit(features, labels)
 print("Best parameter (CV score=%0.3f):" % search.best_score_)
 print(search.best_params_)
 ```

 %% Output

    Best parameter (CV score=0.920):
    {'C': 12.827747704497042, 'gamma': 1.8440450985343382}



 %% Cell type:code id: tags:solution

 ``` python
 import random

 from scipy.stats import randint, uniform
 from sklearn.datasets import load_digits
 from sklearn.decomposition import PCA
 from sklearn.model_selection import cross_val_score, train_test_split
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures, StandardScaler
 from sklearn.svm import SVC

 random.seed(42)

 # classifying digits

 data_set = load_digits()
 features = data_set.data
 labels = data_set.target


 param_grid = {
    "pca__n_components": randint(2, 30),
    "svc__C": uniform(0.1, 50),
    "svc__gamma": uniform(0.01, 0.5),
 }

 p = make_pipeline(PCA(), StandardScaler(), SVC())


 search = RandomizedSearchCV(
    p,
    param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=5,
    n_iter=20,
    random_state=42,  # fix randomization for reproduciblity
 )
 search.fit(features, labels)

 print("Best parameter (CV score=%0.3f):" % search.best_score_)
 print(search.best_params_)
 ```

 %% Output

    Best parameter (CV score=0.966):
    {'pca__n_components': 22, 'svc__C': 22.62496259847715, 'svc__gamma': 0.016632480579933266}

 %% Cell type:markdown id: tags:

 Copyright (C) 2019-2021 ETH Zurich, SIS ID

--- a/07_regression.ipynb
+++ b/07_regression.ipynb
--- a/extra_notebooks/scaling.ipynb
+++ b/extra_notebooks/scaling.ipynb
--- a/images/after_rescaling.png
+++ b/images/after_rescaling.png
--- a/images/before_rescaling.png
+++ b/images/before_rescaling.png
--- a/images/before_rescaling_skewed_circle.png
+++ b/images/before_rescaling_skewed_circle.png
--- a/images/different_scales.png
+++ b/images/different_scales.png
--- a/images/~$feature_matrix_document.pptx
+++ b/images/~$feature_matrix_document.pptx