Merge branch 'script_06_pipelines'

3d78420d · Mikolaj Rybinski · 9022101c · b526cf3f · 3d78420d
Commit 3d78420d authored 5 years ago by Mikolaj Rybinski
--- a/06_preprocessing_pipelines_and_hyperparameter_optimization.ipynb
+++ b/06_preprocessing_pipelines_and_hyperparameter_optimization.ipynb
@@ -172,7 +172,7 @@
    "\n",
    "Principal component analysis is a technique to reduce the dimensionality of a multi variate data set. One benefit of PCA is to remove redundancy in your data set, such as correlating columns or linear dependencies between columns.\n",
    "\n",
-    "We discussed before that reducing redundancy and noise can help to avoid overfitting.\n",
+    "We've discussed before that reducing redundancy and noise can help to avoid overfitting.\n",
    "\n",
    "\n",
    "### Function transformers\n",
@@ -387,11 +387,11 @@
    "\n",
    "<h3><i class=\"fa fa-info-circle\"></i>&nbsp;Important</h3>\n",
    "\n",
-    " When we include preprocessing in a classification approach, we must later apply **exactly the same preprocessing** on new incoming data!\n",
+    " When we include preprocessing in a classification approach, we must later **apply exactly the same preprocessing on new incoming data**!\n",
    "\n",
-    "For preprocessors which depend on the full data set this implies that we never must preprocess data before cross-validation !\n",
+    "For preprocessors which depend on the full data set this implies that we never must preprocess data before cross-validation!\n",
    "\n",
-    "Running such preprocessors on the full data set lets information of \"unseen\" data sneak into the classifier.\n",
+    "Running such preprocessors on the full dataset lets information of \"unseen\" data sneak into the classifier.\n",
    "\n",
    "</div>\n",
    "\n"
@@ -403,27 +403,27 @@
   "source": [
    "### This is how we must proceed instead:\n",
    "\n",
-    "In case for the `MinMaxScaler`:\n",
+    "In case for the `MinMaxScaler` preprocessor:\n",
    "\n",
-    "1. Determine columnwise minimum und maximum values of training features.\n",
-    "2. Use these to scale training features.\n",
-    "3. Learn Classifier.\n",
+    "1. Determine column-wise minimum und maximum values of the training features.\n",
+    "2. Use these min/max values to scale training data.\n",
+    "3. Learn classifier `C` on the scaled training data.\n",
    "\n",
    "\n",
-    "4. Use values from 1. to scale evaluation features (thus we might create values outside `0..1`).\n",
-    "5. Apply classifier to evaluation features.\n",
-    "6. Assess Performance.\n",
+    "4. Use values from 1. to scale evaluation data (thus, we might create values outside `0..1`).\n",
+    "5. Apply classifier `C` to the scaled evaluation data.\n",
+    "6. Assess `C` performance.\n",
    "\n",
    "In general:\n",
    "\n",
-    "1. Learn prprocessor `P` on training data set.\n",
-    "2. Apply `P` on training data set.\n",
-    "3. Learn classifier `C` on the training data set.\n",
+    "1. Learn prprocessor `P` on the training data.\n",
+    "2. Apply `P` to the training data.\n",
+    "3. Learn classifier `C` on the preprocessed training data.\n",
    "\n",
    "\n",
-    "4. Apply `P` from before to the evaluation data set.\n",
-    "5. Apply classifier `C` on the scaled evaluation data set.\n",
-    "6. Assess performance.\n"
+    "4. Apply `P` from 1. to the evaluation data.\n",
+    "5. Apply classifier `C` to the preprocessed evaluation data.\n",
+    "6. Assess `C` performance.\n"
   ]
  },
  {
@@ -546,7 +546,7 @@
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "\n",
-    "<i class=\"fa fa-info-circle\"></i>&nbsp;One benefit of using a pipeline is that you will  not mistakenly scale the full data set first, instead we follow the strategy we described above automatically.\n",
+    "<i class=\"fa fa-info-circle\"></i>&nbsp;One benefit of using a pipeline is that you will not mistakenly scale the full data set first, instead we follow the strategy we've described above automatically.\n",
    "\n",
    "</div>"
   ]
@@ -564,13 +564,15 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
+      "0.844 ['kneighborsclassifier']\n",
+      "0.937 ['standardscaler', 'kneighborsclassifier']\n",
      "0.863 ['svc']\n",
      "0.947 ['standardscaler', 'svc']\n",
      "0.915 ['minmaxscaler', 'svc']\n",
@@ -586,10 +588,13 @@
    "from sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeatures\n",
    "from sklearn.decomposition import PCA\n",
    "\n",
+    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
-    "for p in [make_pipeline(SVC()),\n",
+    "for p in [make_pipeline(KNeighborsClassifier()),\n",
+    "          make_pipeline(StandardScaler(), KNeighborsClassifier()),\n",
+    "          make_pipeline(SVC()),\n",
    "          make_pipeline(StandardScaler(), SVC()),\n",
    "          make_pipeline(MinMaxScaler(), SVC()),\n",
    "          make_pipeline(LogisticRegression()),\n",
@@ -617,7 +622,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 16,
   "metadata": {
    "tags": [
     "solution"
@@ -735,9 +740,7 @@
    "\n",
    "Classifiers and pipelines have parameters which must be adapted for improving performance (e.g. `gamma` or `C`). Finding good parameters is also called *hyperparameter optimization* to distinguish from the optimization done during learning of many classification algorithms.\n",
    "\n",
-    "### Up to now we adapted such hyperparameters manually, but there are more systematic approaches !\n",
-    "\n",
-    "<img src=\"https://i.imgflip.com/3040hg.jpg\" title=\"made at imgflip.com\" width=50%/>"
+    "### Up to now we adapted such hyperparameters manually, but there are more systematic approaches !"
   ]
  },
  {
@@ -801,10 +804,10 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The specification of the grid id now a bit more complicated: \n",
+    "The specification of the grid id now a bit more complicated `PROCESSOR__ARGUMENT`: \n",
    "\n",
-    "- first the name of the processor / classifier in lower case letters\n",
-    "- then two underscores `__` \n",
+    "- first the name of the processor / classifier in lower case letters,\n",
+    "- then two underscores `__`,\n",
    "- finally the name of the argument of the processor / classifier.\n",
    "\n",
    "`StandardScaler` e.g. has parameters `with_mean` and `with_std` which can be `True` or `False`:"
@@ -870,10 +873,10 @@
   "source": [
    "from scipy.stats import uniform, randint\n",
    "\n",
-    "param_dist = {'polynomialfeatures__degree': randint(1, 4),\n",
-    "              'standardscaler__with_mean': [True, False],\n",
+    "param_dist = {'polynomialfeatures__degree': randint(1, 4), # random integer from 1 to 4\n",
+    "              'standardscaler__with_mean': [True, False], # random value from explicit set of values\n",
    "              'standardscaler__with_std': [True, False],\n",
-    "              'logisticregression__C': uniform(.1, 20)\n",
+    "              'logisticregression__C': uniform(.1, 20) # random number from .1 to 20\n",
    "             }"
   ]
  },
@@ -894,7 +897,7 @@
     "output_type": "stream",
     "text": [
      "Best parameter (CV score=0.982):\n",
-      "{'logisticregression__C': 17.31461166512687, 'polynomialfeatures__degree': 3, 'standardscaler__with_mean': False, 'standardscaler__with_std': False}\n"
+      "{'logisticregression__C': 4.675963309832449, 'polynomialfeatures__degree': 3, 'standardscaler__with_mean': True, 'standardscaler__with_std': True}\n"
     ]
    }
   ],
@@ -939,14 +942,6 @@
      "Best parameter (CV score=0.978):\n",
      "{'pca__n_components': 10, 'polynomialfeatures__degree': 2}\n"
     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/Users/uweschmitt/Projects/machinelearning-introduction-workshop/venv37/lib/python3.7/site-packages/sklearn/svm/base.py:931: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.\n",
-      "  \"the number of iterations.\", ConvergenceWarning)\n"
-     ]
    }
   ],
   "source": [
@@ -970,10 +965,12 @@
    "print(search.best_params_)\n",
    "\n",
    "\n",
+    "from sklearn.svm import LinearSVC\n",
+    "\n",
    "p = make_pipeline(StandardScaler(), PolynomialFeatures(), PCA(), LinearSVC())\n",
    "param_grid = {\n",
    "              'polynomialfeatures__degree': [2, 3, 4],\n",
-    "              'pca__n_components': [10, 12, 14]\n",
+    "              'pca__n_components': [4, 6, 8, 10, 12]\n",
    "             }\n",
    "\n",
    "search = GridSearchCV(p, param_grid, cv=5, scoring=\"accuracy\", n_jobs=5)\n",
@@ -981,6 +978,13 @@
    "print(\"Best parameter (CV score=%0.3f):\" % search.best_score_)\n",
    "print(search.best_params_)"
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Copyright (C) 2019 ETH Zurich, SIS ID"
+   ]
  }
 ],
 "metadata": {
@@ -1000,7 +1004,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.2"
+   "version": "3.7.3"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,

 %% Cell type:code id: tags:

 ``` python
 # IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !
 import matplotlib.pyplot as plt
 %matplotlib inline
 %config InlineBackend.figure_format = 'retina'
 import warnings
 warnings.filterwarnings('ignore', category=FutureWarning)
 warnings.filterwarnings("ignore", category=DeprecationWarning)
 warnings.filterwarnings = lambda *a, **kw: None
 from IPython.core.display import HTML; HTML(open("custom.html", "r").read())
 ```

 %% Output

    <IPython.core.display.HTML object>

 %% Cell type:markdown id: tags:

 # Chapter 6: Preprocessing, pipelines and hyperparameters optimization

 %% Cell type:markdown id: tags:

 ## About transformations / preprocessing

 %% Cell type:markdown id: tags:

 We've seen before that adding polynomial features to the 2D `xor` and `circle` problem made both tasks treatable by a simple linear classifier.

 Comment: we use *transformation* and *preprocessing* interchangably.

 Beyond adding polynomial features, there are other important preprocessors / transformers to mention:


 ### Scaler

 A scaler applies a linear transformation on every feature. Those transformations are individual per column.

 The two most important ones in `scikit-learn` are

 - `MinMaxScaler`:  after applying this scaler, the minumum in every column is 0, the maximum is 1.

 - `StandardScaler`: scales columns to mean value 0 and standard deviation 1.

 The reason to use a scaler is to compensate for different orders of magnitudes of the features. Some classifiers like `SVC` and `KNeighborsClassifier` use eucledian distances between features internally which would impose more weight on features having large values. So **don't forget to scale your features when using SVC or KNeighborsClassifier** !


 ### PCA

 Principal component analysis is a technique to reduce the dimensionality of a multi variate data set. One benefit of PCA is to remove redundancy in your data set, such as correlating columns or linear dependencies between columns.

-We discussed before that reducing redundancy and noise can help to avoid overfitting.
+We've discussed before that reducing redundancy and noise can help to avoid overfitting.


 ### Function transformers

 It can help to apply functions like `log` or `exp` or `1/x` to features to improve classification performance.

 Lets assume you want to forecast the outcome of car crash experiments and one variable is the time $t$ needed for the distance $l$ from start to crash. Transforming this to the actual speed $\frac{l}{t}$ could be a more informative feature then $t$.

 ### Imputing missing values

 Sometimes data contain missing values. Data imputation is a strategy to fill up missing values, e.g. by the columnwise mean or by applying another strategy.

 %% Cell type:markdown id: tags:

 ## About scaling

 %% Cell type:markdown id: tags:

 As an example we demonstrante how a scaler can be implemented. Our scaling strategy will scale given values to the range 0 to 1.

 First we create a random data matrix and compute columnwise min and max values:

 %% Cell type:code id: tags:

 ``` python
 import numpy as np

 # for repducible numbers:
 np.random.seed(42)

 values = np.random.random((5,)) * 20 - 10

 min_value = values.min()
 max_value = values.max()

 print("values:", values)
 print()
 print("min value:", min_value)
 print("max value:", max_value)
 ```

 %% Output

    values: [-2.50919762  9.01428613  4.63987884  1.97316968 -6.87962719]
    
    min value: -6.87962719115127
    max value: 9.014286128198322

 %% Cell type:markdown id: tags:

 The strategy for scaling is as follows: Our values $v$ are in the range $v_{min}$ to $v_{max}$:

 $$
 v_{min} \le  v  \le v_{max}
 $$


 Then subtracting $v_{min}$ results in

 $$
 0 \le  v - v_{min} \le v_{max} - v_{min}
 $$

 Finally dividing by the right hand side delivers the property we are looking for:

 $$
 0 \le \frac{v - v_{min}}{v_{max} - v_{min}} \le 1
 $$


 In Python:

 %% Cell type:code id: tags:

 ``` python
 scaled_values = (values - min_value) / (max_value - min_value)

 print("scaled values:", scaled_values)
 ```

 %% Output

    scaled values: [0.27497505 1.         0.72477469 0.5569929  0.        ]

 %% Cell type:markdown id: tags:

 You can see that all values are now scaled as intended.

 To apply the same strategy column per column to a feature matrix, `scikit-learn` offers a `MinMaxScaler`:

 %% Cell type:code id: tags:

 ``` python
 features = np.random.random((5, 3)) * 20 - 10
 print(features)
 ```

 %% Output

    [[-6.88010959 -8.83832776  7.32352292]
     [ 2.02230023  4.16145156 -9.58831011]
     [ 9.39819704  6.64885282 -5.75321779]
     [-6.36350066 -6.3319098  -3.91515514]
     [ 0.49512863 -1.36109963 -4.1754172 ]]

 %% Cell type:code id: tags:

 ``` python
 from sklearn.preprocessing import MinMaxScaler

 # learning -> determine columnwise min/max values
 scaler = MinMaxScaler().fit(features)

 # transformation ! -> apply linear transformation based on min/max values:
 print(scaler.transform(features))
 ```

 %% Output

    [[0.         0.         1.        ]
     [0.54688796 0.83938966 0.        ]
     [1.         1.         0.22676976]
     [0.03173604 0.16183823 0.33545476]
     [0.45307159 0.48280112 0.32006542]]

 %% Cell type:code id: tags:

 ``` python
 # shorter !
 print(scaler.fit_transform(features))
 ```

 %% Output

    [[0.         0.         1.        ]
     [0.54688796 0.83938966 0.        ]
     [1.         1.         0.22676976]
     [0.03173604 0.16183823 0.33545476]
     [0.45307159 0.48280112 0.32006542]]

 %% Cell type:markdown id: tags:

 We can divide preprocessing into two classes:

 1. Preprocessing which depends on the full data set. E.g.

   - Scaling
   - PCA
   - Many variants for imputation of missing values


 2. Preprocessing which can be applied row per row individually. E.g.

   - Adding polynomial features
   - Functional transforms
   - Row-wise scaling (e.g. when a row represents an image and we want to compensate for different illumination).


 <div class="alert alert-block alert-warning">

 <h3><i class="fa fa-info-circle"></i>&nbsp;Important</h3>

- When we include preprocessing in a classification approach, we must later apply **exactly the same preprocessing** on new incoming data!
+ When we include preprocessing in a classification approach, we must later **apply exactly the same preprocessing on new incoming data**!

-For preprocessors which depend on the full data set this implies that we never must preprocess data before cross-validation !
+For preprocessors which depend on the full data set this implies that we never must preprocess data before cross-validation!

-Running such preprocessors on the full data set lets information of "unseen" data sneak into the classifier.
+Running such preprocessors on the full dataset lets information of "unseen" data sneak into the classifier.

 </div>


 %% Cell type:markdown id: tags:

 ### This is how we must proceed instead:

-In case for the `MinMaxScaler`:
+In case for the `MinMaxScaler` preprocessor:

-1. Determine columnwise minimum und maximum values of training features.
-2. Use these to scale training features.
-3. Learn Classifier.
+1. Determine column-wise minimum und maximum values of the training features.
+2. Use these min/max values to scale training data.
+3. Learn classifier `C` on the scaled training data.


-4. Use values from 1. to scale evaluation features (thus we might create values outside `0..1`).
-5. Apply classifier to evaluation features.
-6. Assess Performance.
+4. Use values from 1. to scale evaluation data (thus, we might create values outside `0..1`).
+5. Apply classifier `C` to the scaled evaluation data.
+6. Assess `C` performance.

 In general:

-1. Learn prprocessor `P` on training data set.
-2. Apply `P` on training data set.
-3. Learn classifier `C` on the training data set.
+1. Learn prprocessor `P` on the training data.
+2. Apply `P` to the training data.
+3. Learn classifier `C` on the preprocessed training data.


-4. Apply `P` from before to the evaluation data set.
-5. Apply classifier `C` on the scaled evaluation data set.
-6. Assess performance.
+4. Apply `P` from 1. to the evaluation data.
+5. Apply classifier `C` to the preprocessed evaluation data.
+6. Assess `C` performance.

 %% Cell type:markdown id: tags:

 <img src="https://i.imgflip.com/2xi5wt.jpg" width=50%/>

 %% Cell type:markdown id: tags:

 ## The scikit-learn API (quick recap)

 We've seen before that we can swap `scikit-learn` classifiers easily without changing much code.

 This is possible, because all classifiers have methods `.fit` and `.predict` which also have the same function signature (this means number and meaning of arguments is always the same for every implementation of `.fit` respectively `.predict`.)

 This consistend design within `scikit-learn` also applies for preprocessors transformers, which all have methods`.fit`, `.transform` and `.fit_transform`.

 This consistent API allows setting up **processing pipelines**:

 %% Cell type:markdown id: tags:

 ## Pipelines

 A so called classifiation pipeline consists of 0 or more pre processors plus a final classifier.

 Let us start with the following pipeline:

 1. Use PCA to reduce data to 3 dimensions
 2. Apply scaling to mean 0 and std deviation 1
 3. Train `SVC` classifier.



 %% Cell type:code id: tags:

 ``` python
 from sklearn.preprocessing import StandardScaler
 from sklearn.svm import SVC
 from sklearn.decomposition import PCA

 from sklearn.pipeline import make_pipeline

 p = make_pipeline(PCA(3), StandardScaler(), SVC())
 ```

 %% Cell type:markdown id: tags:

 Such a pipeline now "behaves" like a single classifier, as it implements `.fit` and `.predict`:

 %% Cell type:code id: tags:

 ``` python
 print("p.fit    ", p.fit is not None)
 print("p.predict", p.predict is not None)
 ```

 %% Output

    p.fit     True
    p.predict True

 %% Cell type:markdown id: tags:

 Because of this we can also use cross-validation in the same way as we did before:

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd

 beer_data = pd.read_csv("beers.csv")

 features = beer_data.iloc[:, :-1]
 labels = beer_data.iloc[:, -1]


 from sklearn.model_selection import cross_val_score

 print(cross_val_score(p, features, labels, scoring="accuracy", cv=5).mean())
 ```

 %% Output

    0.9330127360562145

 %% Cell type:markdown id: tags:

 <div class="alert alert-block alert-warning">

-<i class="fa fa-info-circle"></i>&nbsp;One benefit of using a pipeline is that you will  not mistakenly scale the full data set first, instead we follow the strategy we described above automatically.
+<i class="fa fa-info-circle"></i>&nbsp;One benefit of using a pipeline is that you will not mistakenly scale the full data set first, instead we follow the strategy we've described above automatically.

 </div>

 %% Cell type:markdown id: tags:

 ### How to setup a good pipeline ?

 Regrettably there is no recipe how to setup a good performing classification pipeline except reasonable preprocessing, especially feature engineering. After that it is up to experimentation and the advice on how to choose classifiers we gave in the last script.

 Let us try out different pipeplines and evaluate them:

 %% Cell type:code id: tags:

 ``` python
 from sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeatures
 from sklearn.decomposition import PCA

+from sklearn.neighbors import KNeighborsClassifier
 from sklearn.svm import SVC
 from sklearn.linear_model import LogisticRegression

-for p in [make_pipeline(SVC()),
+for p in [make_pipeline(KNeighborsClassifier()),
+          make_pipeline(StandardScaler(), KNeighborsClassifier()),
+          make_pipeline(SVC()),
          make_pipeline(StandardScaler(), SVC()),
          make_pipeline(MinMaxScaler(), SVC()),
          make_pipeline(LogisticRegression()),
          make_pipeline(StandardScaler(), PCA(3), LogisticRegression()),
          make_pipeline(StandardScaler(), PCA(2), LogisticRegression()),


          make_pipeline(PolynomialFeatures(), SVC()),
          make_pipeline(PolynomialFeatures(), LogisticRegression()),

         ]:

    print("{:.3f}".format(cross_val_score(p, features, labels, scoring="accuracy", cv=5).mean()), end=" ")
    print([pi[0] for pi in p.steps])
 ```

 %% Output

+    0.844 ['kneighborsclassifier']
+    0.937 ['standardscaler', 'kneighborsclassifier']
    0.863 ['svc']
    0.947 ['standardscaler', 'svc']
    0.915 ['minmaxscaler', 'svc']
    0.804 ['logisticregression']
    0.924 ['standardscaler', 'pca', 'logisticregression']
    0.893 ['standardscaler', 'pca', 'logisticregression']
    0.840 ['polynomialfeatures', 'svc']
    0.925 ['polynomialfeatures', 'logisticregression']

 %% Cell type:markdown id: tags:

 ## Exercise session:

 1. Can you come up with a better performing classification pipeline ?

 %% Cell type:code id: tags:solution

 ``` python
 # SOLUTION

 for p in [
          make_pipeline(StandardScaler(), SVC()),   # previouly best pipeline
          make_pipeline(StandardScaler(), SVC(C=25, gamma=.05)),   # better !
         ]:

    print("{:.3f}".format(cross_val_score(p, features, labels, scoring="accuracy", cv=5).mean()), end=" ")
    print([pi[0] for pi in p.steps])
 ```

 %% Output

    0.947 ['standardscaler', 'svc']
    0.978 ['standardscaler', 'svc']

 %% Cell type:markdown id: tags:

 ### (*) Optional exercise:

 Build a classification pipeline to classifiy the 2D xor- and circle-data sets with linear classifiers. Also assess their performance.

 %% Cell type:code id: tags:solution

 ``` python
 #SOLUTION

 from sklearn.linear_model import LogisticRegression
 from sklearn.svm import LinearSVC

 def check_pipelines(data):
    features = data.iloc[:, :-1]
    labels = data.iloc[:, -1]

    for p in [
               make_pipeline(StandardScaler(), LogisticRegression()),
               make_pipeline(StandardScaler(), LinearSVC()),
               make_pipeline(PolynomialFeatures(2), LogisticRegression()),
               make_pipeline(PolynomialFeatures(2), LinearSVC()),
               make_pipeline(PolynomialFeatures(4), StandardScaler(), LogisticRegression()),
               make_pipeline(PolynomialFeatures(4), StandardScaler(), LinearSVC()),


        ]:

        print("{:.3f}".format(cross_val_score(p, features, labels, scoring="accuracy", cv=5).mean()), end=" ")
        print([pi[0] for pi in p.steps])

 xor_data = pd.read_csv("xor.csv")
 check_pipelines(xor_data)
 print()

 circle_data = pd.read_csv("2d_points.csv")
 check_pipelines(circle_data)
 ```

 %% Output

    0.616 ['standardscaler', 'logisticregression']
    0.616 ['standardscaler', 'linearsvc']
    0.964 ['polynomialfeatures', 'logisticregression']
    0.962 ['polynomialfeatures', 'linearsvc']
    0.968 ['polynomialfeatures', 'standardscaler', 'logisticregression']
    0.966 ['polynomialfeatures', 'standardscaler', 'linearsvc']
    
    0.757 ['standardscaler', 'logisticregression']
    0.757 ['standardscaler', 'linearsvc']
    0.980 ['polynomialfeatures', 'logisticregression']
    0.977 ['polynomialfeatures', 'linearsvc']
    0.980 ['polynomialfeatures', 'standardscaler', 'logisticregression']
    0.987 ['polynomialfeatures', 'standardscaler', 'linearsvc']

 %% Cell type:markdown id: tags:

 <div class="alert alert-block alert-info">

 <i class="fa fa-info-circle"></i>&nbsp;Up to now we applied preprocessing to the full feature table. `scikit-learn` also allows preprocessing of single columns or a subset of them. the concept in `scikit-learn` is called `ColumnTransformer`, more about this
 [can be found here](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html)

 </div>

 %% Cell type:markdown id: tags:

 ## Hyperparameter optimization

 Classifiers and pipelines have parameters which must be adapted for improving performance (e.g. `gamma` or `C`). Finding good parameters is also called *hyperparameter optimization* to distinguish from the optimization done during learning of many classification algorithms.

 ### Up to now we adapted such hyperparameters manually, but there are more systematic approaches !

-<img src="https://i.imgflip.com/3040hg.jpg" title="made at imgflip.com" width=50%/>
-
 %% Cell type:markdown id: tags:

 The simplest approach is to specify valid values for each parameter involved and then try out all possible combinations. This is called *grid search*:

 %% Cell type:code id: tags:

 ``` python
 from sklearn.model_selection import GridSearchCV

 # optimize parameters of one single classifier

 parameters = {'kernel':('linear', 'rbf', 'poly'),
              'C':[1, 5, 10, 15]
              }

 svc = SVC()

 # run gridsearch, use CV to assess quality and determine best parameter
 # set:

 # tries all 3 x 4 = 12 combinations:
 search = GridSearchCV(svc, parameters, cv=5)
 search.fit(features, labels)

 print(search.best_score_, search.best_params_)
 ```

 %% Output

    0.9822222222222222 {'C': 5, 'kernel': 'poly'}

 %% Cell type:markdown id: tags:

 Such an optimization can also be applied to a full pipeline:

 %% Cell type:code id: tags:

 ``` python
 p = make_pipeline(PolynomialFeatures(), StandardScaler(), LogisticRegression())
 ```

 %% Cell type:markdown id: tags:

-The specification of the grid id now a bit more complicated:
+The specification of the grid id now a bit more complicated `PROCESSOR__ARGUMENT`:

- first the name of the processor / classifier in lower case letters
- then two underscores `__`
+- first the name of the processor / classifier in lower case letters,
+- then two underscores `__`,
 - finally the name of the argument of the processor / classifier.

 `StandardScaler` e.g. has parameters `with_mean` and `with_std` which can be `True` or `False`:

 %% Cell type:code id: tags:

 ``` python
 param_grid = {'polynomialfeatures__degree': [1, 2, 3, 4],
              'standardscaler__with_mean': [True, False],
              'standardscaler__with_std': [True, False],
              'logisticregression__C': [.1, .5, 1, 10, 20],
             }
 ```

 %% Cell type:markdown id: tags:

 This grid has `4 x 2 x 2 x 5` thus `80` points. So we muss run crossvalidation for 80 different classifiers.

 To speed this up, we can specify `n_jobs = 2` to use `2` extra processor cores to run gridsearch in parallel (you might want to use more cores depending on your computer):

 %% Cell type:code id: tags:

 ``` python
 search = GridSearchCV(p, param_grid, cv=4, scoring="accuracy", n_jobs=2)
 search.fit(features, labels)
 print("Best parameter (CV score=%0.3f):" % search.best_score_)
 print(search.best_params_)
 ```

 %% Output

    Best parameter (CV score=0.982):
    {'logisticregression__C': 10, 'polynomialfeatures__degree': 3, 'standardscaler__with_mean': True, 'standardscaler__with_std': True}

 %% Cell type:markdown id: tags:

 A more efficient, approach is `RandomizedSearchCV`.

 In this case we can also specify random distributions for the parameters to optimize:

 %% Cell type:code id: tags:

 ``` python
 from scipy.stats import uniform, randint

-param_dist = {'polynomialfeatures__degree': randint(1, 4),
-              'standardscaler__with_mean': [True, False],
+param_dist = {'polynomialfeatures__degree': randint(1, 4), # random integer from 1 to 4
+              'standardscaler__with_mean': [True, False], # random value from explicit set of values
              'standardscaler__with_std': [True, False],
-              'logisticregression__C': uniform(.1, 20)
+              'logisticregression__C': uniform(.1, 20) # random number from .1 to 20
             }
 ```

 %% Cell type:markdown id: tags:

 We run now 30 iterations

 %% Cell type:code id: tags:

 ``` python
 from sklearn.model_selection import RandomizedSearchCV

 search = RandomizedSearchCV(p, param_dist, cv=4, n_jobs=2, n_iter=30)

 search.fit(features, labels)
 print("Best parameter (CV score=%0.3f):" % search.best_score_)
 print(search.best_params_)
 ```

 %% Output

    Best parameter (CV score=0.982):
-    {'logisticregression__C': 17.31461166512687, 'polynomialfeatures__degree': 3, 'standardscaler__with_mean': False, 'standardscaler__with_std': False}
+    {'logisticregression__C': 4.675963309832449, 'polynomialfeatures__degree': 3, 'standardscaler__with_mean': True, 'standardscaler__with_std': True}

 %% Cell type:markdown id: tags:

 ## Exercise section 2

 1. Try to find good parameters for the following two pipelines applied to the beer data set. Use grid search as well as randomized search for both.

    `make_pipeline(StandardScaler(), SVC(gamma=..., C=...))`

    `make_pipeline(StandardScaler(), PolynomialFeatures(degree=..), PCA(n_components=...), LinearSVC())`

 %% Cell type:code id: tags:solution

 ``` python
 beer_data = pd.read_csv("beers.csv")

 features = beer_data.iloc[:, :-1]
 labels = beer_data.iloc[:, -1]

 p = make_pipeline(StandardScaler(), SVC())

 param_grid = {
              'standardscaler__with_mean': [True, False],
              'standardscaler__with_std': [True, False],
              'svc__C': [1, 10, 15, 20, 25],
              'svc__gamma': [.01, .05, .1, .5, 1]
             }

 search = GridSearchCV(p, param_grid, cv=5, scoring="accuracy", n_jobs=5)
 search.fit(features, labels)
 print("Best parameter (CV score=%0.3f):" % search.best_score_)
 print(search.best_params_)


+from sklearn.svm import LinearSVC
+
 p = make_pipeline(StandardScaler(), PolynomialFeatures(), PCA(), LinearSVC())
 param_grid = {
              'polynomialfeatures__degree': [2, 3, 4],
-              'pca__n_components': [10, 12, 14]
+              'pca__n_components': [4, 6, 8, 10, 12]
             }

 search = GridSearchCV(p, param_grid, cv=5, scoring="accuracy", n_jobs=5)
 search.fit(features, labels)
 print("Best parameter (CV score=%0.3f):" % search.best_score_)
 print(search.best_params_)
 ```

 %% Output

    Best parameter (CV score=0.978):
    {'standardscaler__with_mean': True, 'standardscaler__with_std': True, 'svc__C': 15, 'svc__gamma': 0.1}
    Best parameter (CV score=0.978):
    {'pca__n_components': 10, 'polynomialfeatures__degree': 2}

-    /Users/uweschmitt/Projects/machinelearning-introduction-workshop/venv37/lib/python3.7/site-packages/sklearn/svm/base.py:931: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
-      "the number of iterations.", ConvergenceWarning)
+%% Cell type:markdown id: tags:
+
+Copyright (C) 2019 ETH Zurich, SIS ID