Ghost User · 34eb55da
--- a/06_classifiers_overview-part_2.ipynb

+ 114

− 30
+++ b/06_classifiers_overview-part_2.ipynb

+ 114

− 30
 %% Cell type:code id: tags:

 ``` python
 # IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !
 %matplotlib inline
 # `sklearn.tree.plot_tree` does not work well with "retina" backend - use "svg" instead
 %config InlineBackend.figure_format = "svg"
 import warnings

 import matplotlib.pyplot as plt

 warnings.filterwarnings("ignore", category=FutureWarning)
 warnings.filterwarnings = lambda *a, **kw: None
 from IPython.core.display import HTML

 HTML(open("custom.html", "r").read())
 ```

 %% Cell type:markdown id: tags:

 # Chapter 6: An overview of classifiers, Part 2

 <span style="font-size: 150%;">Decision trees, ensemble methods and summary</span>

 %% Cell type:markdown id: tags:

 Let's repeat our helper functions from previous part:

 %% Cell type:code id: tags:

 ``` python
 import matplotlib
 import matplotlib.pyplot as plt
 import numpy as np


 def samples_color(ilabels, colors=["steelblue", "chocolate"]):
    """Return colors list from labels list given as indices."""
    return [colors[int(i)] for i in ilabels]


 def plot_decision_surface(
    features_2d,
    labels,
    classifier,
    preprocessing=None,
    plt=plt,
    marker=".",
    N=100,
    alpha=0.2,
    colors=["steelblue", "chocolate"],
    title=None,
    test_features_2d=None,
    test_labels=None,
    test_s=60,
 ):
    """Plot a 2D decision surface for a already trained classifier."""

    # sanity check
    assert len(features_2d.columns) == 2

    # pandas to numpy array; get min/max values
    xy = np.array(features_2d)
    min_x, min_y = xy.min(axis=0)
    max_x, max_y = xy.max(axis=0)

    # create mesh of NxN points; tech: `N*1j` is spec for including max value
    XX, YY = np.mgrid[min_x : max_x : N * 1j, min_y : max_y : N * 1j]
    points = np.c_[XX.ravel(), YY.ravel()]  # shape: (N*N)x2

    # apply scikit-learn API preprocessing
    if preprocessing is not None:
        points = preprocessing.transform(points)

    # classify grid points
    classes = classifier.predict(points)

    # plot classes color mesh
    ZZ = classes.reshape(XX.shape)  # shape: NxN
    plt.pcolormesh(
        XX,
        YY,
        ZZ,
        alpha=alpha,
        cmap=matplotlib.colors.ListedColormap(colors),
        shading="auto",
    )
    # plot points
    plt.scatter(
        xy[:, 0],
        xy[:, 1],
        marker=marker,
        color=samples_color(labels, colors=colors),
    )
    # set title
    if title:
        if hasattr(plt, "set_title"):
            plt.set_title(title)
        else:
            plt.title(title)
    # plot test points
    if test_features_2d is not None:
        assert test_labels is not None
        assert len(test_features_2d.columns) == 2
        test_xy = np.array(test_features_2d)
        plt.scatter(
            test_xy[:, 0],
            test_xy[:, 1],
            s=test_s,
            facecolors="none",
            linewidths=2,
            color=samples_color(test_labels),
        );
 ```

 %% Cell type:markdown id: tags:

 ## Decision trees

 Let's see what a decision tree is by looking at an (artificial) example:

 <table>
    <tr><td><img src="./images/decision_tree-work.png" width=600px></td></tr>
 </table>

 %% Cell type:markdown id: tags:

 ### How are the decision tree splits selected?

 Starting from the top the decision tree is build by selecting **best split of the dataset using a single feature**. Best feature and its split value are ones that make the resulting **subsets more pure** in terms of variety of classes they contain (i.e. that minimize misclassification error, or Gini index/impurity, or maximize entropy/information gain).

 <table>
    <tr><td><img src="./images/decision_tree-split.png" width=600px></td></tr>
 </table>

 Features can repeat within a sub-tree (and there is no way to control it in scikit-learn), but usualy categorical features appear at most once on each path. They do, however, repeat across different tree branches.

 %% Cell type:markdown id: tags:

 ### XOR decision tree

 Let's try out decision trees with the XOR dataset, in which samples have class `True` when the two coordinates `x` and `y` have different sign, otherwise they have class `False`.

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd


 df = pd.read_csv("data/xor.csv")
 features_2d = df.loc[:, ("x", "y")]
 labelv = df["label"]

 plt.figure(figsize=(5, 5))
 plt.xlabel("x")
 plt.ylabel("y")
 plt.title("Orange is True, blue is False")
 plt.scatter(features_2d.iloc[:, 0], features_2d.iloc[:, 1], color=samples_color(labelv));
 ```

 %% Cell type:markdown id: tags:

 Decision trees live in the `sklearn.tree` module.

 %% Cell type:code id: tags:

 ``` python
 from sklearn.model_selection import train_test_split
 from sklearn.tree import DecisionTreeClassifier


 # Note: split randomness picked manually for educational purpose
 X_train, X_test, y_train, y_test = train_test_split(
    features_2d, labelv, random_state=10
 )

 # Note: features are permuted reandomly in case equally good splits are found
 # fix randomization for reproduciblity
 classifier = DecisionTreeClassifier(random_state=0)
 classifier.fit(X_train, y_train)

 print("train score: {:.2f}%".format(100 * classifier.score(X_train, y_train)))
 print("test score: {:.2f}%".format(100 * classifier.score(X_test, y_test)))

 plt.figure(figsize=(5, 5))
 plot_decision_surface(
    features_2d,
    labelv,
    classifier,
    test_features_2d=X_test,
    test_labels=y_test,
 )
 ```

 %% Cell type:markdown id: tags:

 About the plot: **the points surrounded with a circle are from the test data set** (not used for learning), all other points belong to the training data.

 This surface seems a bit rough on edges. One of the biggest advantages of the decision trees is interpretability of the model. Let's **inspect the model by looking at the tree that was built**:

 %% Cell type:code id: tags:

 ``` python
 from sklearn.tree import plot_tree


 fig = plt.figure(figsize=(12, 8))
 fig.suptitle("XOR Decision Tree")
 plot_tree(classifier, feature_names=["x", "y"], class_names=["False", "True"]);
 ```

 %% Cell type:markdown id: tags:

 <span style="font-size: 150%">Whoaaa .. what happened here?</span>

 XOR is the **anti-example** for DTs: they cannot make the "natural" split at value `0` because splits are selected to promote more pure sub-nodes. We're fitting data representation noise here.

 Moreover, the tree is quite deep because, by default, it is built until all nodes are "pure" (`gini = 0.0`). This tree is **overfitted**.

 %% Cell type:markdown id: tags:

 ### How to avoid overfitting?

 There is no regularization penalty like in logistic regression or SVM methods when bulding a decision tree. Instead we can set learning hyperparameters such as:
 * tree pruning (based on minimal cost-complexity; `ccp_alpha`) - this is actually done only after the tree has been built, or
 * maximum tree depth (`max_depth`), or
 * a minimum number of samples required at a node or at a leaf node (`min_samples_split`, `min_samples_leaf`), or
 * an early stopping criteria based on minumum value of impurity or on minimum decrease in impurity (`min_impurity_split`, `min_impurity_decrease`),
 * ... and few more - see `DecisionTreeClassifier` docs.

 %% Cell type:markdown id: tags:

 ### Exercise section

 1. In theory for the XOR dataset it should suffice to use each feature exactly once with splits at `0`, but the decision tree learning algorithm is unable to find such a solution. Play around with `max_depth` to get a smaller but similarly performing decision tree for the XOR dataset.<br/>
  Bonus question: which other hyperparameter you could have used to get the same result?

 2. Build a decision tree for the beers dataset. Use maximum depth and tree pruning strategies to get a much smaller tree that performs as well as the default tree.<br/>
  Note: `classifier.tree_` instance has attributes such as `max_depth`, `node_count`, or `n_leaves`, which measure size of the tree.

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd
 from sklearn.model_selection import train_test_split
 from sklearn.tree import DecisionTreeClassifier, plot_tree


 df = pd.read_csv("data/xor.csv")
 features_2d = df.loc[:, ("x", "y")]
 labelv = df["label"]

 max_depths = [2, 3, 4]
 # ...
 ```

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd
 from sklearn.model_selection import train_test_split
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import StandardScaler
 from sklearn.tree import DecisionTreeClassifier, plot_tree

-
 df = pd.read_csv("data/beers.csv")
 features = df.iloc[:, :-1]
 labelv = df.iloc[:, -1]
 # ...
 ```

 %% Cell type:code id: tags:solution

 ``` python
 # SOLUTION 1
 import pandas as pd
 from sklearn.model_selection import train_test_split
 from sklearn.tree import DecisionTreeClassifier, plot_tree


 df = pd.read_csv("data/xor.csv")
 features_2d = df.loc[:, ("x", "y")]
 labelv = df["label"]

 X_train, X_test, y_train, y_test = train_test_split(
    features_2d, labelv, random_state=10
 )

 max_depths = [2, 3, 4]

 n_params = len(max_depths)
 fig, ax_arr = plt.subplots(ncols=n_params, nrows=2, figsize=(7 * n_params, 7 * 2))
 fig.suptitle("smaller XOR Decision Trees")
 for i, max_depth in enumerate(max_depths):

    classifier = DecisionTreeClassifier(
        max_depth=max_depth,
        random_state=0,
    )
    classifier.fit(X_train, y_train)

    ax = ax_arr[0, i]
    plot_tree(
        classifier,
        feature_names=features_2d.columns.values,
        class_names=["False", "True"],
        ax=ax,
        fontsize=7,
    )
    ax.set_title(
        (
            f"max depth = {max_depth}\n"
            f"train score: {100 * classifier.score(X_train, y_train):.2f}%\n"
            f"test score: {100 * classifier.score(X_test, y_test):.2f}%"
        )
    )

    ax = ax_arr[1, i]
    plot_decision_surface(
        features_2d,
        labelv,
        classifier,
        test_features_2d=X_test,
        test_labels=y_test,
        plt=ax,
    )

 # We could have used equivalently `min_impurity_split` early stopping criterium with any (gini) value between 0.15 and 0.4
 ```

 %% Cell type:code id: tags:

 ``` python
 # SOLUTION 2
 import pandas as pd
 from sklearn.model_selection import train_test_split
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import StandardScaler
 from sklearn.tree import DecisionTreeClassifier, plot_tree


 df = pd.read_csv("data/beers.csv")
 print(df.head(2))

 features = df.iloc[:, :-1]
 labelv = df.iloc[:, -1]

 X_train, X_test, y_train, y_test = train_test_split(features, labelv, random_state=10)

 # default
 classifier = DecisionTreeClassifier(random_state=0)
 pipeline = make_pipeline(StandardScaler(), classifier)
 pipeline.fit(X_train, y_train)
 print()
 print("#### default Beers Decision Tree")
 print(
    f"depth: {classifier.tree_.max_depth}, ",
    f"#nodes: {classifier.tree_.node_count}, ",
    f"#leaves: {classifier.tree_.n_leaves}",
 )
 print(f"train score: {100 * pipeline.score(X_train, y_train):.2f}%")
 print(f" test score: {100 * pipeline.score(X_test, y_test):.2f}%")

 # smaller
 classifier = DecisionTreeClassifier(max_depth=4, ccp_alpha=0.02, random_state=0)
 pipeline = make_pipeline(StandardScaler(), classifier)
 pipeline.fit(X_train, y_train)
 print()
 print("#### smaller Beers Decision Tree")
 print(
    f"depth: {classifier.tree_.max_depth}, ",
    f"#nodes: {classifier.tree_.node_count}, ",
    f"#leaves: {classifier.tree_.n_leaves}",
 )
 print(f"train score: {100 * pipeline.score(X_train, y_train):.2f}%")
 print(f" test score: {100 * pipeline.score(X_test, y_test):.2f}%")

 fig = plt.figure(figsize=(10, 6))
 plot_tree(classifier, feature_names=features.columns.values);
 ```

 %% Cell type:markdown id: tags:

 One **issue with decision trees is their instability** - a small changes in the training data usually results in a completely different order of splits (different tree structure).

 %% Cell type:markdown id: tags:

 ## Ensemble Averaging: Random Forests

 The idea of Random Forest method is to generate **ensemble of many "weak" decision trees** and by **averaging out their probabilistic predictions**. (The original Random Forests method used voting.)


 Weak classifier here are **shallow trees with feature-splits picked only out of random subsets of features** (*features bagging*). Random subset of features is selected per each split, not for the whole classifier.

 <table>
    <tr><td><img src="./images/random_forest.png" width=800px></td></tr>
    <tr><td><center><sub>Source: <a href="https://towardsdatascience.com/random-forests-and-decision-trees-from-scratch-in-python-3e4fa5ae4249">https://towardsdatascience.com/random-forests-and-decision-trees-from-scratch-in-python-3e4fa5ae4249</a></sub></center></td></tr>
 </table>

 %% Cell type:markdown id: tags:

 ### Demonstration

 You will find Random Forest method implementation in the `sklearn.ensemble` module.

 The main parameters are:
 * number of trees (`n_estimators`),
 * each tree max. depth 2 (`max_depth`), and
 * max. number of randomly selected features to pick from when building each tree (`max_features`).

 Let's build a small Random Forest and have a look at its trees, available under `.estimators_` property.

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.model_selection import train_test_split
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import StandardScaler
 from sklearn.tree import plot_tree


 df = pd.read_csv("data/beers.csv")
 print(df.head(2))

 features = df.iloc[:, :-1]
 labelv = df.iloc[:, -1]

 X_train, X_test, y_train, y_test = train_test_split(features, labelv, random_state=10)

 # 4 shallow (depth 2) trees, each using only 3 randomly selected features
 # total: up to 4*3 decision nodes, up to 4*4 class nodes
 n_trees = 4
 classifier = RandomForestClassifier(
    max_depth=2,
    n_estimators=n_trees,
    max_features=3,
    random_state=0,
 )
 pipeline = make_pipeline(StandardScaler(), classifier)
 pipeline.fit(X_train, y_train)

 print()
 print("#### Random Forest")
 print(f"train score: {100 * pipeline.score(X_train, y_train):.2f}%")
 print(f" test score: {100 * pipeline.score(X_test, y_test):.2f}%")

 # to evaluate ensemble estimators, we need to use transformed data
 X_train_trans = pipeline[:-1].transform(X_train)
 X_test_trans = pipeline[:-1].transform(X_test)

 fig, ax_arr = plt.subplots(ncols=n_trees, nrows=1, figsize=(7 * n_trees, 5))
 for i, internal_classifier in enumerate(classifier.estimators_):
    ax = ax_arr[i]
    plot_tree(internal_classifier, feature_names=features.columns.values, ax=ax)
    ax.set_title(
        (
            f"Tree #{i}\n"
            f"train score: {100 * internal_classifier.score(X_train_trans, y_train):.2f}%\n"
            f" test score: {100 * internal_classifier.score(X_test_trans, y_test):.2f}%"
        )
    )
 ```

 %% Cell type:markdown id: tags:

 Random forests are fast and shine with high dimensional data (many features).

 <div class="alert alert-block alert-info">
    <p><i class="fa fa-info-circle"></i>
        Random Forest can estimate <em>out-of-bag error</em> (OOB) while learning; set <code>oob_score=True</code>. (The out-of-bag (OOB) error is the average error for each data sample, calculated using predictions from the trees that do not contain that sample in their respective bootstrap samples.)
    OOB is a generalisation/predictive error that, together with <code>warm_start=True</code>, can be used for efficient search for a good-enough number of trees, i.e. the <code>n_estimators</code> hyperparameter value (see: <a href=https://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html>OOB Errors for Random Forests</a>).
    </p>
 </div>

 %% Cell type:markdown id: tags:

 ## Boosting: AdaBoost

 <span style="font-size: 125%;">What is it?</span>

 Boosting is another sub-type of ensemble learning. Same as in averaging, the idea is to generate many **weak classifiers to create a single strong classifier**, but in contrast to averaging, the classifiers are learnt **iteratively**.

 <span style="font-size: 125%;">How does it work?</span>

 Each iteration focuses more on **previously misclassified samples**. To that end, **data samples are weighted**, and after each learning iteration the data weights are readjusted.

 <table>
    <tr><td><img src="./images/AdaBoost.png" width=800px></td></tr>
    <tr><td><center><sub>Source: Marsh, B., (2016), <em>Multivariate Analysis of the Vector Boson Fusion Higgs Boson</em>.</sub></center></td></tr>
 </table>

 The final prediction is a weighted majority vote or weighted sum of predictions of the weighted weak classifiers.

 Boosting works very well out of the box. There is usually no need to fine tune method hyperparameters to get good performance.

 <span style="font-size: 125%;">Where do i start?</span>

 **AdaBoost (“Adaptive Boosting”) is a baseline boosting algorithm** that originally used decisoin trees as weak classifiers, but, in principle, works with any classification method (`base_estimator` parameter).

 In each AdaBoost learning iteration, additionally to samples weights, the **weak classifiers are weighted**. Their weights are readjusted, such that **the more accurate a weak classifier is, the larger its weight is**.

 %% Cell type:markdown id: tags:

 ### Demonstration

 You will find AdaBoost algorithm implementation in the `sklearn.ensemble` module.

 We'll use `n_estimators` parameter to determine number of weak classifiers. These by default are single node decision trees (`base_estimator = DecisionTreeClassifier(max_depth=1)`). We can examine them via `.estimators_` property of a trained method.

 For presentation, in order to weight the classifiers, we will use the original discrete AdaBoost learning method (`algorithm="SAMME"`). Because the classifiers learn iteratively on differently weighted samples, to understand the weights we have to look at internal train errors and not at the final scores on the training data.

 %% Cell type:code id: tags:

 ``` python
 from math import ceil, floor

 import pandas as pd
 from sklearn.ensemble import AdaBoostClassifier
 from sklearn.model_selection import train_test_split
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import StandardScaler
 from sklearn.tree import plot_tree


 df = pd.read_csv("data/beers.csv")
 print(df.head(2))

 features = df.iloc[:, :-1]
 labelv = df.iloc[:, -1]

 X_train, X_test, y_train, y_test = train_test_split(features, labelv, random_state=10)

 # 9 single node decision trees
 # total: 9*1 decision nodes, 9*2 class nodes
 # (Note: with default real AdaBoost "SAMME.R" algorithm all weights are 1 at the end)
 n_trees = 9
 classifier = AdaBoostClassifier(n_estimators=n_trees, algorithm="SAMME", random_state=0)
 pipeline = make_pipeline(StandardScaler(), classifier)
 pipeline.fit(X_train, y_train)

 print()
 print("AdaBoost")
 print(f"train score: {100 * pipeline.score(X_train, y_train):.2f}%")
 print(f"test score: {100 * pipeline.score(X_test, y_test):.2f}%")

 # to evaluate ensemble estimators, we need to use transformed data
 X_train_trans = pipeline[:-1].transform(X_train)
 X_test_trans = pipeline[:-1].transform(X_test)

 fig, ax_arr = plt.subplots(ncols=n_trees, nrows=1, figsize=(5 * n_trees, 4))
 for i, internal_classifier in enumerate(classifier.estimators_):
    ax = ax_arr[i]
    plot_tree(internal_classifier, feature_names=features.columns.values, ax=ax)
    ax.set_title(
        (
            f"Tree #{i}, weight: {classifier.estimator_weights_[i]:.2f}\n"
            f"train error: {classifier.estimator_errors_[i]:.2f}\n"
            f"(train score: {100 * internal_classifier.score(X_train_trans, y_train):.2f}%)\n"
            f"test score: {100 * internal_classifier.score(X_test_trans, y_test):.2f}%"
        )
    )
 ```

 %% Cell type:markdown id: tags:

 ### Other boosting methods

 In practice you will mostly want to use other than AdaBoost methods for boosting.

 ####  Gradient Tree Boosting (GTB)

 It re-formulates boosting problem as an optimization problem which is solved with efficient Stochastic Gradient Descent optimization method (more on that in the neuronal networks script).

 In contrast to AdaBoost, GTB relies on using decision trees.

 In particular, try out [XGboost](https://xgboost.readthedocs.io/en/latest/); it's a package that won many competitions, cf. [XGboost@Kaggle](https://www.kaggle.com/dansbecker/xgboost). It is not part of scikit-learn, but it offers a `scikit-learn` API (see https://www.kaggle.com/stuarthallows/using-xgboost-with-scikit-learn ); a `scikit-learn` equivalent is [`GradientBoostingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).

 #### Histogram-based Gradient Boosting Classification Tree.

 A new `scikit-learn` implementation of boosting based on decision trees is [`HistGradientBoostingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html). It is much faster then `GradientBoostingClassifier` for big datasets (`n_samples >= 10 000`).



 %% Cell type:markdown id: tags:

 ## Ensemble Stacking: a honorary mention

 Stacking is used often in case of different types of base models, when it's not clear which type of model will perform best.

 **The base models learn in parallel and their (cross-validated) predictions are used to train a meta-model** (as opposed e.g. to selecting only one model or doing a naive voting). The meta-model (called also combiner, blender, or generalizer), never "sees" the input data.

 <table>
    <tr><td><img src="./images/ensemble-learning-stacking.png" width="400px"></td></tr>
    <tr><td><center><sub><a href="https://data-science-blog.com/blog/2017/12/03/ensemble-learning/">https://data-science-blog.com/blog/2017/12/03/ensemble-learning/</a></sub></center></td></tr>
 </table>

 Stacking combines strengths of different models and usually slightly outperforms best individual model. In practice often multiple stacking layers are used with groups of different but repeating types of classifiers.

 <table>
    <tr><td><center><img src="./images/ensemble-learning-stacking-kdd_2015_winner.png" width="800px"></center></td></tr>
    <tr><td><center><sub>KDD Cup 2015 winner</sub></center></td></tr>
    <tr><td><center><sub>GBM: Gradient Boosting Machine; NN: Neural Network; FM: Factorization Machine; LR: Logistic Regression; KRR: Kernel Ridge Regression; ET: Extra Trees; RF: Random Forests; KNN: K-Nearest Neighbors</sub></center></td></tr>
    <tr><td><center><sub><a href="https://www.slideshare.net/jeongyoonlee/winning-data-science-competitions-74391113"> Jeong-Yoon Lee, <em>Winning Data Science Competitions</em>, Apr 2017</a></sub></center></td></tr>
 </table>

 In the `sklearn.ensemble` the stacking is implemented by `StackingClassifier` and `StackingRegressor`.

 %% Cell type:markdown id: tags:

 ## Why does ensemble learning work?

 * Probability of making an error by majority of the classifiers in the ensemble is much lower then error that each of the weak classifiers makes alone.

 * An ensemble classifier is more roboust (has lower variance) with respect to the training data.

 * The weak classifiers are small, fast to learn, and, in case of averaging, they can be learnt in parallel.

 In general, **usually ensemble classifier performs better than any of the weak classifiers in the ensemble**.

 %% Cell type:markdown id: tags:

 ## Coding session

-For the beers data compare mean cross validation accuracy, precision, recall and f1 scores for all classifiers shown so far. Try to squeeze better than default performance out of the classifiers by tuning their hyperparameters. Which ones perform best?
+Apply all classifiers seen so far to the MNIST images dataset (`from sklearn.datasets import load_digits`). To that end:
+
+1. split your data into training and a test "holdout" data (`from sklearn.model_selection import train_test_split`) in 80:20 proporition; stratify the 10 target classes (`stratify=...`);
+2. compare cross validation mean f1 scores; use the stratified k-fold CV strategy (`from sklearn.model_selection import cross_val_score, StratifiedKFold`; attention: this is a non-binary multi-class problem, you will have to use an adjusted f1 score, e.g. unweighted per class mean via `scoring="f1_macro"` keyword arg - see [the `scoring` parameter predefined values](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules));
+3. train the final model on the whole dataset and, use the test "holdout" data to repot a final model f1 performance; report also accuracy, precision and recall scores (`from sklearn.metrics import classification_report`);
+4. next, try to manually tune hyperparameters to minimize the train test gap and to squeeze out at least 90% cross validation f1 score performance out of each classifier; try using PCA preprocessing (`sklearn.pipeline.make_pipeline`, `sklearn.decomposition.PCA`); what about data scaling? which models are most effective and easiest to tune manually?
+5. optionally, once you get a feel of good preprocessing and classifier hyperparameters, define parameters ranges and apply a CV-based search to find optimal values for the hyperparameters (`from sklearn.model_selection import GridSearchCV, RandomizedSearchCV`).

 %% Cell type:code id: tags:

 ``` python
+import numpy as np
+from sklearn.datasets import load_digits
+from sklearn.decomposition import PCA
+
+# classifiers
 from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
 from sklearn.linear_model import LogisticRegression
-from sklearn.model_selection import cross_val_score
+from sklearn.metrics import classification_report
+from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
 from sklearn.pipeline import make_pipeline
-from sklearn.preprocessing import StandardScaler
 from sklearn.svm import SVC, LinearSVC
 from sklearn.tree import DecisionTreeClassifier


-df = pd.read_csv("data/beers.csv")
-features = df.iloc[:, :-1]
-labelv = df.iloc[:, -1]
+digits = load_digits()

-# ...
-# classifier = ...
-# pipeline = make_pipeline(StandardScaler(), classifier)
-# scores = cross_val_score(pipeline, features, labelv, scoring="f1", cv=5)
+labels = digits.target
+n_samples = len(labels)
+
+# flatten images of shape N_SAMPLES x 8 x 8 to N_SAMPLES x 64:
+print("digits data set shape:", digits.images.shape)
+features = digits.images.reshape((n_samples, -1))
+print("feature matrix shape:", features.shape)
+
+# (
+#     features_train,
+#     features_test,
+#     labels_train,
+#     labels_test,
+# ) = train_test_split(..., test_size=..., stratify=..., random_state=42)
+#
+# classifiers = [
+#     LogisticRegression(...),
+#     LinearSVC(...),
+#     SVC(...),
+#     DecisionTreeClassifier(..., random_state=42),  # rather won't do very well when not used in an ensemble method
+#     RandomForestClassifier(..., random_state=42),
+#     AdaBoostClassifier(..., random_state=42),
+# ]
+#
+# pipeline = make_pipeline(...)
+#
+# cross_validator = StratifiedKFold(...)
+# scores = cross_val_score(..., scoring=..., cv=...)
 # ...
 ```

 %% Cell type:code id: tags:solution

 ``` python
 # SOLUTION
+import numpy as np
+from sklearn.datasets import load_digits
+from sklearn.decomposition import PCA
+
+# classifiers
 from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
 from sklearn.linear_model import LogisticRegression
-from sklearn.model_selection import cross_val_score
+from sklearn.metrics import classification_report
+from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
 from sklearn.pipeline import make_pipeline
-from sklearn.preprocessing import StandardScaler
 from sklearn.svm import SVC, LinearSVC
 from sklearn.tree import DecisionTreeClassifier


+digits = load_digits()
+
+labels = digits.target
+n_samples = len(labels)
+
+# flatten images of shape N_SAMPLES x 8 x 8 to N_SAMPLES x 64:
+print("digits data set shape:", digits.images.shape)
+features = digits.images.reshape((n_samples, -1))
+print("feature matrix shape:", features.shape)
+
+(
+    features_train,
+    features_test,
+    labels_train,
+    labels_test,
+) = train_test_split(features, labels, test_size=0.2, stratify=labels, random_state=42)
+
 classifiers = [
-    LogisticRegression(C=100),
-    LinearSVC(C=10, max_iter=30000),
-    SVC(C=30, gamma=0.1),
-    DecisionTreeClassifier(max_depth=7, random_state=0),
+    LogisticRegression(C=10, penalty="l1", solver="liblinear", max_iter=10_000),
+    LinearSVC(C=1, penalty="l1", dual=False, max_iter=50_000),
+    SVC(C=10, gamma=0.001),
+    DecisionTreeClassifier(
+        max_depth=9, random_state=42
+    ),  # rather won't do very well when not used in an ensemble method
    RandomForestClassifier(
-        max_depth=4,
-        n_estimators=10,
-        max_features=2,
-        random_state=0,
+        max_depth=6,
+        n_estimators=50,
+        max_features="log2",
+        random_state=42,
+    ),
+    AdaBoostClassifier(
+        base_estimator=DecisionTreeClassifier(
+            max_depth=3
+        ),  # in CV search use: adaboostclassifier__base_estimator__max_depth
+        n_estimators=100,
+        algorithm="SAMME",  # works better than default for a multi-class problem
+        random_state=42,
    ),
-    AdaBoostClassifier(n_estimators=20, random_state=0),
 ]

-df = pd.read_csv("data/beers.csv")
-features = df.iloc[:, :-1]
-labelv = df.iloc[:, -1]
+# We do already account for class imbalances in train/test and CV splits via stratification
+# => take an unweighted f1 score per class mean ("f1_macro"), thus, keeping f1 score in between
+#    the precision and recall scores.
+scoring = "f1_macro"
+pca_n_components = features.shape[1] // 3  # accuracy-wise does not add much but speeds up training
+cv_n_splits = 10

 for classifier in classifiers:
-    print(classifier.__class__.__name__)
-    pipeline = make_pipeline(StandardScaler(), classifier)
-    for scoring in ["accuracy", "precision", "recall", "f1"]:
-        scores = cross_val_score(pipeline, features, labelv, scoring=scoring, cv=5)
-        print(f"\t5-fold CV mean {scoring}: {scores.mean():.2f} +/- {scores.std():.2f}")
    print()
+    print(f"#### {classifier}")
+    pipeline = make_pipeline(PCA(n_components=pca_n_components), classifier)
+    cross_validator = StratifiedKFold(
+        n_splits=cv_n_splits, shuffle=True, random_state=42
+    )
+    scores = cross_val_score(
+        pipeline, features_train, labels_train, scoring=scoring, cv=cross_validator
+    )
+    print()
+    print(f"train {cv_n_splits}-fold CV mean {scoring}:")
+    print(f"\t{scores.mean():.2f} +/- {scores.std():.2f}")
+    print()
+    # the final model - train on the whole dataset
+    classifier.fit(features_train, labels_train)
+    # the final "holdout" test
+    print(f"test score:")
+    predicted_test = classifier.predict(features_test)
+    print(
+        classification_report(
+            labels_test,
+            predicted_test,
+        )
+    )
 ```

 %% Cell type:markdown id: tags:

 ## Summary

 %% Cell type:markdown id: tags:

 Below you will find a table with some guidelines, as well as pros and cons of different classication methods available in scikit-learn.

 <div class="alert alert-block alert-warning">
    <p><i class="fa fa-warning"></i>&nbsp;<strong>Summary table</strong></p>

 <p>
 <em>Disclaimer</em>: this table is neither a single source of truth nor complete - it's intended only to provide some first considerations when starting out. At the end of the day, you have to try and pick a method that works for your problem/data.
 </p>

 <table>
 <thead>
 <tr>
 <th style="text-align: center;">Classifier type</th>
 <th style="text-align: center;">When?</th>
 <th style="text-align: center;">Advantages</th>
 <th style="text-align: center;">Disadvantages</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td style="text-align: left;">Nearest Neighbors<br><br><code>KNeighborsClassifier</code></td>
 <td style="text-align: left;">- numeric data<br> - when (fast) linear classifiers do not work</td>
 <td style="text-align: left;">- simple (not many parameters to tweak), hence, a good baseline classifier</td>
 <td style="text-align: left;">- known not to work well for many dimensions (20 or even less features)</td>
 </tr>
 <tr>
 <td style="text-align: left;">Logistic Regression<br><br><code>LogisticRegression</code></td>
 <td style="text-align: left;">- high-dimensional data<br> - a lot of data</td>
 <td style="text-align: left;">- fast, also in high dimensions<br> - weights can be interpreted</td>
 <td style="text-align: left;">- data has to be linearly separable (happens often in higher dimensions)<br> - not very efficient with large number of samples</td>
 </tr>
 <tr>
 <td style="text-align: left;">Linear SVM<br><br><code>LinearSVC</code></td>
 <td style="text-align: left;">same as above but might be better for text analysis (many features)</td>
 <td style="text-align: left;">same as above but might be better with very large number of features</td>
 <td style="text-align: left;">same as above but possibly a bit better with large number of samples</td>
 </tr>
 <tr>
 <td style="text-align: left;">Kernel SVM<br><br><code>SVC</code></td>
 <td style="text-align: left;">same as above but when linear SVM does not work<br>- not too many data points</td>
 <td style="text-align: left;">same as above but learns non-linear boundaries</td>
 <td style="text-align: left;">same as above but much slower and requires data scaling<br>- model is not easily interpretable</td>
 </tr>
 <tr>
 <td style="text-align: left;">Decision Tree<br><br><code>DecisionTreeClassifier</code></td>
 <td style="text-align: left;">- for illustration/insight<br> - with multi-class problems <br> - with categorical or mixed categorical and numerical data</td>
 <td style="text-align: left;">- simple to interpret<br> - good classification speed and performance</td>
 <td style="text-align: left;">- prone to overfitting<br> - unstable: small change in the training data can give very different model</td>
 </tr>
 <tr>
 <td style="text-align: left;">Ensemble Averaging<br><br><code>RandomForestClassifier</code></td>
 <td style="text-align: left;">- when decision tree would be used but for performance</td>
 <td style="text-align: left;">- fixes decision tree issues: does not overfit easily and is stable with respect to training data<br> - takes into account features dependencies<br> - can compute predicitve error when learning<br> ...</td>
 <td style="text-align: left;">- harder to interpret than a single decision tree</td>
 </tr>
 <tr>
 <td style="text-align: left;">Boosting<br><br><code>AdaBoostClassifier</code> (<code>XGBClassifier</code>, <code>HistGradientBoostingClassifier</code>)</td>
 <td style="text-align: left;">same as above</td>
 <td style="text-align: left;">- works very well out-of-the-box<br>- better performance and more interpretable than random forest when using depth 1 trees</td>
 <td style="text-align: left;">- more prone to overfitting than random forest</td>
 </tr>
 <tr>
 <td style="text-align: left;">Stacking<br><br><code>StackingClassifier</code></td>
 <td style="text-align: left;">- when having multiple various learners (with different weaknesses)<br>- when not having enough data to use neuronal networks</td>
 <td style="text-align: left;">- works well out-of-the-box<br>- improves performance of even already good learners</td>
 <td style="text-align: left;">- complicates interpretability of results<br>- takes time to train and to build a multi-layer architecture (if enough data, it's easier to use neuronal networks)</td>
 </tr>
 <tr style="border-bottom:1px solid black">
    <td colspan="100%"></td>
 </tr>
 <tr>
 <td colspan="100%" style="text-align: center;"><em>[not shown here]</em></td>
 </tr>
 <tr>
 <td style="text-align: left;">Naive Bayes<br><br><code>ComplementNB</code>, ...</td>
 <td style="text-align: left;">- with text data</td>
 <td style="text-align: left;">...</td>
 <td style="text-align: left;">...</td>
 </tr>
 <tr>
 <td style="text-align: left;">Stochastic Gradient<br><br><code>SGDClassifier</code></td>
 <td style="text-align: left;">- with really big data</td>
 <td style="text-align: left;">...</td>
 <td style="text-align: left;">...</td>
 </tr>
 <tr>
 <td style="text-align: left;">Kernel Approximation<br><br>pipeline: <code>RBFSampler</code> or <code>Nystroem</code> + <code>LinearSVC</code></td>
 <td style="text-align: left;">- with really big data and on-line training</td>
 <td style="text-align: left;">...</td>
 <td style="text-align: left;">...</td>
 </tr>
 </tbody>
 </table>

 </div>

 %% Cell type:markdown id: tags:

 You should be able now to understand better the classification part of the ["Choosing the right estimator" scikit-learn chart ](https://scikit-learn.org/stable/tutorial/machine_learning_map/):


 <table>
    <tr><td><img src="./images/scikit-learn_ml_map-classification.png" width=800px></td></tr>
    <tr><td><center><sub>Source: <a href="https://scikit-learn.org/stable/tutorial/machine_learning_map/">https://scikit-learn.org/stable/tutorial/machine_learning_map/</a></sub></center></td></tr>
 </table>

 %% Cell type:markdown id: tags:

 Copyright (C) 2019-2021 ETH Zurich, SIS ID