Index notebook, notebooks re-ordering and some extra comments in layout MD.

b7f9d212 · Mikolaj Rybinski · 9b959ae3 · b7f9d212 · b7f9d212 · b7f9d212
Commit b7f9d212 authored 6 years ago by Mikolaj Rybinski
--- a/05_classifiers_overview.ipynb
+++ b/05_classifiers_overview.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Chapter 5: An overview of classifiers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+
+# Chapter 5: An overview of classifiers
+
+%% Cell type:code id: tags:
+
+``` python
+```
--- a/05_preprocessing_pipelines_and_hyperparameter_optimization.ipynb
+++ b/05_preprocessing_pipelines_and_hyperparameter_optimization.ipynb
 {
 "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Chapter 6: Preprocessing pipelines and hyperparameters optmization"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -341,7 +348,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.6.6"
+   "version": "3.7.2"
  }
 },
 "nbformat": 4,

 %% Cell type:markdown id: tags:

+# Chapter 6: Preprocessing pipelines and hyperparameters optmization
+
+%% Cell type:markdown id: tags:
+
 # Draft

 - Scicit learn api:  recall what we have seen up to now.
 - pipelines, preprocessing (scaler, PCA)
 - cross validatioon on pipeline
 - parameter tuning: grid search / random search.

 %% Cell type:markdown id: tags:

 Up to now all classifiers had methods `fit` and `predict`

 %% Cell type:markdown id: tags:

 ## Preprocessing

 Scaler: SVC work better when all columns of feature matrix are in the same numerical range.

 PCA can reduce redundancy / correlations (see overfitting script) and thus avoid overfitting.

 Polynomial features: extend feature matrix by computing products between and within feature matrix columns

 FunctionTransformer (sklearn): Apply functions like log

 Danger: PCA and Scaler learn on data sets. Make sure that test/validation datasets do not sneak in !

 DONT DO: scale on full dataset first, then cross validation etc.

 %% Cell type:markdown id: tags:

 Preprocessors in `sklearn` all have `fit`, `transform` and `fit_transform` methods.

 %% Cell type:markdown id: tags:

 ## Pipelines

 Combine different preprocessing steps into one "classifier".

 Pipeline is a list for preprocessors followed by a classifier.

 Thus for a pipeline of len $n$ there are $n - 1$ objects having `fit` and `transform` methods, the last element has `fit` and `predict` methods.


 %% Cell type:code id: tags:

 ``` python
 from sklearn.preprocessing import PolynomialFeatures, StandardScaler
 from sklearn.decomposition import PCA
 ```

 %% Cell type:code id: tags:

 ``` python
 from sklearn.svm import SVC
 from sklearn.linear_model import LogisticRegression
 ```

 %% Cell type:markdown id: tags:

 https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline

 %% Cell type:code id: tags:

 ``` python
 from sklearn.pipeline import make_pipeline
 from sklearn.cross_validation import cross_val_score
 ```

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd

 beer_data = pd.read_csv("beers.csv")

 features = beer_data.iloc[:, :-1]
 labels = beer_data.iloc[:, -1];
 ```

 %% Cell type:code id: tags:

 ``` python
 print(cross_val_score(SVC(), features, labels, "accuracy", cv=5).mean())

 # here we see benefit of scaling for SVC
 p = make_pipeline(StandardScaler(), SVC())

 print(cross_val_score(p, features, labels, "accuracy", cv=5).mean())

 print()
 print("pipeline")
 print()


 for name, step in p.steps:
    print("{:20s} {}".format(name, step))

 print()

 # this is how we can set parameters of a single step in the pipeline:
 p.set_params(svc__C = 3)

 print(cross_val_score(p, features, labels, "accuracy", cv=5).mean())
 ```

 %% Output

    0.8625911286780852
    0.94655248133509
    
    pipeline
    
    standardscaler       StandardScaler(copy=True, with_mean=True, with_std=True)
    svc                  SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
      max_iter=-1, probability=False, random_state=None, shrinking=True,
      tol=0.001, verbose=False)
    
    0.9286736934563022

 %% Cell type:code id: tags:

 ``` python
 for p in [make_pipeline(SVC()),
          make_pipeline(StandardScaler(), SVC()),
          make_pipeline(LogisticRegression()),
          make_pipeline(StandardScaler(), PCA(), LogisticRegression()),

          make_pipeline(PolynomialFeatures(), SVC()),
          make_pipeline(PolynomialFeatures(), StandardScaler(), SVC()),
          make_pipeline(PolynomialFeatures(), LogisticRegression()),
          make_pipeline(PolynomialFeatures(), StandardScaler(), LogisticRegression()),
          ]:

    print("{:.3f}".format(cross_val_score(p, features, labels, "accuracy", cv=5).mean()), end=" ")
    print([pi[0] for pi in p.steps])
 ```

 %% Output

    0.863 ['svc']
    0.947 ['standardscaler', 'svc']
    0.804 ['logisticregression']
    0.920 ['standardscaler', 'pca', 'logisticregression']
    0.840 ['polynomialfeatures', 'svc']
    0.942 ['polynomialfeatures', 'standardscaler', 'svc']
    0.925 ['polynomialfeatures', 'logisticregression']
    0.964 ['polynomialfeatures', 'standardscaler', 'logisticregression']

 %% Cell type:markdown id: tags:

 # Exercises:
 - use the beer data frame with random / redundant features from cross val script and demo benefit of PCA

 %% Cell type:code id: tags:

 ``` python
 from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

 # optimize parameters of one single classifier

 parameters = {'kernel':('linear', 'rbf', 'poly'),
              'C':[1, 5, 10, 15]
              }

 svc = SVC()
 search = GridSearchCV(svc, parameters, cv=5)
 search.fit(features, labels)
 print(search.best_score_, search.best_params_)
 ```

 %% Output

    0.9822222222222222 {'C': 5, 'kernel': 'poly'}

 %% Cell type:markdown id: tags:

 Now we optimize a pipeline

 %% Cell type:code id: tags:

 ``` python
 p = make_pipeline(PolynomialFeatures(), StandardScaler(), LogisticRegression())
 ```

 %% Cell type:markdown id: tags:

 TODO: explain param_grid

 %% Cell type:code id: tags:

 ``` python
 param_grid = {'polynomialfeatures__degree': [1, 2, 3],
              'standardscaler__with_mean': [True, False],
              'standardscaler__with_std': [True, False],
              'logisticregression__C': [1, 10, 15, 20, 25],
             }
 ```

 %% Cell type:code id: tags:

 ``` python
 search = GridSearchCV(p, param_grid, cv=5, scoring="f1", return_train_score=False, n_jobs=5)
 search.fit(features, labels)
 print("Best parameter (CV score=%0.3f):" % search.best_score_)
 print(search.best_params_)
 ```

 %% Output

    Best parameter (CV score=0.983):
    {'logisticregression__C': 20, 'polynomialfeatures__degree': 2, 'standardscaler__with_mean': True, 'standardscaler__with_std': False}

 %% Cell type:code id: tags:

 ``` python
 from scipy.stats import uniform, randint

 param_dist = {'polynomialfeatures__degree': randint(1, 5),
              'standardscaler__with_mean': [True, False],
              'standardscaler__with_std': [True, False],
              'logisticregression__C': uniform(0.1, 20)
             }
 ```

 %% Cell type:code id: tags:

 ``` python
 search = RandomizedSearchCV(p, param_dist, n_jobs=5, n_iter=100)

 search.fit(features, labels)
 print("Best parameter (CV score=%0.3f):" % search.best_score_)
 print(search.best_params_)
 ```

 %% Output

    Best parameter (CV score=0.982):
    {'logisticregression__C': 15.053760390091858, 'polynomialfeatures__degree': 3, 'standardscaler__with_mean': False, 'standardscaler__with_std': True}

--- a/course_layout.md
+++ b/course_layout.md
@@ -39,7 +39,7 @@ TBD: prepare coding session
 ## Part 1: Introduction  (UWE)

 - What is machine learning ?
-  
+
  - learning from examples
  - working with hard to understand data.
  - automatation
@@ -47,39 +47,39 @@ TBD: prepare coding session
 - What are features / samples / feature matrix ?
  - always numerical / categorical vectors
  - examples: beer, movies, images, text to numerical examples
- 
- Learning problems: 
-   
+
+- Learning problems:
+
    - unsupervised:
-    
+
      - find structure in set of features
      - beers: find groups of beer types
-    
+
    - supervised:
-    
+
      - classification: do I like this beer ?
        example: draw decision tree
-  
-  
+
+

 ## Part 2a: supervised learning: classification

  Intention: demonstrate one / two simple examples of classifiers, also
             introduce the concept of decision boundary
-             
+
  - idea of simple linear classifier: take features, produce real value ("uwes beer score"), use threshold to decide
    -> simple linear classifier (linear SVM e.g.)
    -> beer example with some weights
- 
+
  - show code example with logistic regression for beer data, show weights, plot decision function

 ### Coding session:

  - change given code to use a linear SVM classifier
-  
+
  - use different data (TBD) set which can not be classified well with a linear classifier
  - tell to transform data and run again (TBD: how exactly ?)
- 
+

 ## Part 2b: supervised learning: regression (TBD: skip this ?)

@@ -130,7 +130,7 @@ Intention: accuracy is usefull but has pitfalls
 - how to measure accuracy ?

  - (TDB: skip ?) regression accuracy
-  - 
+  -
  - classifier accuracy:
    - confusion matrix
    - accurarcy
@@ -138,7 +138,7 @@ Intention: accuracy is usefull but has pitfalls
        e.g. diagnose HIV
    - precision / recall
    - ROC ?
-    
+
 - exercise: do cross val with other metrics

 ### Coding session
@@ -152,29 +152,7 @@ Intention: accuracy is usefull but has pitfalls

 # Day 2

-
-## Part 5: pipelines / parameter tuning with scikit-learn
-
- Scicit learn api:  recall what we have seen up to now.
- pipelines, preprocessing (scaler, PCA)
- cross validatioon
- parameter tuning: grid search / random search.
-
-
-### Coding session
-
- build SVM and LinearRegression crossval pipelines for previous examples
- use PCA in pipeline for (+) to improve performance
- find optimal SVM parameters
- find optimal pca components number
-
-### Coding par
-
-Planning: stop here, make time estimates.
-
-
-
-## Part 6: classifiers overview
+## Part 5: classifiers overview

 Intention: quick walk throught throug reliable classifiers, give some background
 idea if suitable, let them play withs some incl. modification of parameters.
@@ -188,7 +166,14 @@ diagram.
 - Random forests
 - Gradient Tree Boosting

-show decision surfaces of these classifiers on 2d examples.
+topics to include:
+
+- interoperability of results (in terms features importance, e.g. SVN w/ hig deg poly
+  kernel)
+- some rules of thumbs: don't use KNN classifiers for 10 or more dimensions (why? paper
+  link)
+- show decision surfaces for diff classifiers (extend exercise in sec 3 using
+  hyperparams)

 ### Coding session

@@ -197,10 +182,26 @@ show decision surfaces of these classifiers on 2d examples.
 - MNIST example


-## Part 7: Start with neural networks. .5 day
+## Part 6: pipelines / parameter tuning with scikit-learn
+
+- Scikit-learn API:  recall what we have seen up to now.
+- pipelines, preprocessing (scaler, PCA)
+- cross validation
+- parameter tuning: grid search / random search.
+
+### Coding session

+- build SVM and LinearRegression crossval pipelines for previous examples
+- use PCA in pipeline for (+) to improve performance
+- find optimal SVM parameters
+- find optimal pca components number
+
+
+## Part 7: Start with neural networks. .5 day

+## Planning

+Stop here, make time estimates.




--- a/index.ipynb
+++ b/index.ipynb
 {
 "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-block alert-danger\"><p>\n",
+    "<strong>TODOs</strong>\n",
+    "<ol>\n",
+    "<li>Write script which removes the solution proposals (cells starting with <code>#SOLUTION</code>) and creates a new notebook.</li>\n",
+    "</ol>\n",
+    "</p></div>\n"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -13,12 +25,34 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<div class=\"alert alert-block alert-danger\"><p>\n",
-    "<strong>TODOs</strong>\n",
-    "<ol>\n",
-    "<li>Write script which removes the solution proposals (cells starting with <code>#SOLUTION</code>) and creates a new notebook.</li>\n",
-    "</ol>\n",
-    "</p></div>\n"
+    "# Course: Introduction to Machine Learning with Python\n",
+    "\n",
+    "<div class=\"alert alert-block alert-warning\">\n",
+    "    <p><i class=\"fa fa-warning\"></i>&nbsp;<strong>Goal</strong></p>\n",
+    "    <p>Quickly get your hands dirty with Machine Learning and know what your doing.<p>\n",
+    "</div>\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## What will you learn?\n",
+    "\n",
+    "* Basic concepts of Machine Learning (ML).\n",
+    "* General overview of supervised learning and related methods.\n",
+    "* How to quickly start with ML using `scikit-learn` Python library."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## What will you NOT learn?\n",
+    "\n",
+    "* How to program with Python.\n",
+    "* How exactly ML methods work.\n",
+    "* Unsupervised learning methods."
   ]
  },
  {
@@ -30,6 +64,10 @@
    "<ol>\n",
    "    <li><a href=\"01_introduction.ipynb\">Introduction</a></li>\n",
    "    <li><a href=\"02_classification.ipynb\">Classification</a></li>\n",
+    "    <li><a href=\"03_overfitting_and_cross_validation.ipynb\">Overfitting and cross-validation</a></li>\n",
+    "    <li><a href=\"04_measuring_quality_of_a_classifier.ipynb\">Metrics for evaluating the performance</a></li>\n",
+    "    <li><a href=\"05_classifiers_overview.ipynb\">An overview of classifiers</a></li>\n",
+    "    <li><a href=\"06_preprocessing_pipelines_and_hyperparameter_optimization.ipynb\">Preprocessing pipelines and hyperparameters optmization</a></li>\n",
    "    <li>...</li>\n",
    "    \n",
    "</ol>"

 %% Cell type:markdown id: tags:

-<div class="alert alert-block alert-danger">
-Course layout w/ local notebooks links .. anything in scope of org/general comments goes also here.
-</div>
-
-%% Cell type:markdown id: tags:
-
 <div class="alert alert-block alert-danger"><p>
 <strong>TODOs</strong>
 <ol>
 <li>Write script which removes the solution proposals (cells starting with <code>#SOLUTION</code>) and creates a new notebook.</li>
 </ol>
 </p></div>

 %% Cell type:markdown id: tags:

+<div class="alert alert-block alert-danger">
+Course layout w/ local notebooks links .. anything in scope of org/general comments goes also here.
+</div>
+
+%% Cell type:markdown id: tags:
+
+# Course: Introduction to Machine Learning with Python
+
+<div class="alert alert-block alert-warning">
+    <p><i class="fa fa-warning"></i>&nbsp;<strong>Goal</strong></p>
+    <p>Quickly get your hands dirty with Machine Learning and know what your doing.<p>
+</div>
+
+%% Cell type:markdown id: tags:
+
+## What will you learn?
+
+* Basic concepts of Machine Learning (ML).
+* General overview of supervised learning and related methods.
+* How to quickly start with ML using `scikit-learn` Python library.
+
+%% Cell type:markdown id: tags:
+
+## What will you NOT learn?
+
+* How to program with Python.
+* How exactly ML methods work.
+* Unsupervised learning methods.
+
+%% Cell type:markdown id: tags:
+
 ## Course scripts

 <ol>
    <li><a href="01_introduction.ipynb">Introduction</a></li>
    <li><a href="02_classification.ipynb">Classification</a></li>
+    <li><a href="03_overfitting_and_cross_validation.ipynb">Overfitting and cross-validation</a></li>
+    <li><a href="04_measuring_quality_of_a_classifier.ipynb">Metrics for evaluating the performance</a></li>
+    <li><a href="05_classifiers_overview.ipynb">An overview of classifiers</a></li>
+    <li><a href="06_preprocessing_pipelines_and_hyperparameter_optimization.ipynb">Preprocessing pipelines and hyperparameters optmization</a></li>
    <li>...</li>

 </ol>

 %% Cell type:code id: tags:

 ``` python
 ```