Skip to content
Snippets Groups Projects
Commit b7f9d212 authored by Mikolaj Rybinski's avatar Mikolaj Rybinski
Browse files

Index notebook, notebooks re-ordering and some extra comments in layout MD.

parent 9b959ae3
No related branches found
No related tags found
1 merge request!3Layout update
%% Cell type:markdown id: tags:
# Chapter 5: An overview of classifiers
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Chapter 6: Preprocessing pipelines and hyperparameters optmization
%% Cell type:markdown id: tags:
# Draft
- Scicit learn api: recall what we have seen up to now.
- pipelines, preprocessing (scaler, PCA)
- cross validatioon on pipeline
- parameter tuning: grid search / random search.
%% Cell type:markdown id: tags:
Up to now all classifiers had methods `fit` and `predict`
%% Cell type:markdown id: tags:
## Preprocessing
Scaler: SVC work better when all columns of feature matrix are in the same numerical range.
PCA can reduce redundancy / correlations (see overfitting script) and thus avoid overfitting.
Polynomial features: extend feature matrix by computing products between and within feature matrix columns
FunctionTransformer (sklearn): Apply functions like log
Danger: PCA and Scaler learn on data sets. Make sure that test/validation datasets do not sneak in !
DONT DO: scale on full dataset first, then cross validation etc.
%% Cell type:markdown id: tags:
Preprocessors in `sklearn` all have `fit`, `transform` and `fit_transform` methods.
%% Cell type:markdown id: tags:
## Pipelines
Combine different preprocessing steps into one "classifier".
Pipeline is a list for preprocessors followed by a classifier.
Thus for a pipeline of len $n$ there are $n - 1$ objects having `fit` and `transform` methods, the last element has `fit` and `predict` methods.
%% Cell type:code id: tags:
``` python
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.decomposition import PCA
```
%% Cell type:code id: tags:
``` python
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
```
%% Cell type:markdown id: tags:
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline
%% Cell type:code id: tags:
``` python
from sklearn.pipeline import make_pipeline
from sklearn.cross_validation import cross_val_score
```
%% Cell type:code id: tags:
``` python
import pandas as pd
beer_data = pd.read_csv("beers.csv")
features = beer_data.iloc[:, :-1]
labels = beer_data.iloc[:, -1];
```
%% Cell type:code id: tags:
``` python
print(cross_val_score(SVC(), features, labels, "accuracy", cv=5).mean())
# here we see benefit of scaling for SVC
p = make_pipeline(StandardScaler(), SVC())
print(cross_val_score(p, features, labels, "accuracy", cv=5).mean())
print()
print("pipeline")
print()
for name, step in p.steps:
print("{:20s} {}".format(name, step))
print()
# this is how we can set parameters of a single step in the pipeline:
p.set_params(svc__C = 3)
print(cross_val_score(p, features, labels, "accuracy", cv=5).mean())
```
%% Output
0.8625911286780852
0.94655248133509
pipeline
standardscaler StandardScaler(copy=True, with_mean=True, with_std=True)
svc SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
0.9286736934563022
%% Cell type:code id: tags:
``` python
for p in [make_pipeline(SVC()),
make_pipeline(StandardScaler(), SVC()),
make_pipeline(LogisticRegression()),
make_pipeline(StandardScaler(), PCA(), LogisticRegression()),
make_pipeline(PolynomialFeatures(), SVC()),
make_pipeline(PolynomialFeatures(), StandardScaler(), SVC()),
make_pipeline(PolynomialFeatures(), LogisticRegression()),
make_pipeline(PolynomialFeatures(), StandardScaler(), LogisticRegression()),
]:
print("{:.3f}".format(cross_val_score(p, features, labels, "accuracy", cv=5).mean()), end=" ")
print([pi[0] for pi in p.steps])
```
%% Output
0.863 ['svc']
0.947 ['standardscaler', 'svc']
0.804 ['logisticregression']
0.920 ['standardscaler', 'pca', 'logisticregression']
0.840 ['polynomialfeatures', 'svc']
0.942 ['polynomialfeatures', 'standardscaler', 'svc']
0.925 ['polynomialfeatures', 'logisticregression']
0.964 ['polynomialfeatures', 'standardscaler', 'logisticregression']
%% Cell type:markdown id: tags:
# Exercises:
- use the beer data frame with random / redundant features from cross val script and demo benefit of PCA
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# optimize parameters of one single classifier
parameters = {'kernel':('linear', 'rbf', 'poly'),
'C':[1, 5, 10, 15]
}
svc = SVC()
search = GridSearchCV(svc, parameters, cv=5)
search.fit(features, labels)
print(search.best_score_, search.best_params_)
```
%% Output
0.9822222222222222 {'C': 5, 'kernel': 'poly'}
%% Cell type:markdown id: tags:
Now we optimize a pipeline
%% Cell type:code id: tags:
``` python
p = make_pipeline(PolynomialFeatures(), StandardScaler(), LogisticRegression())
```
%% Cell type:markdown id: tags:
TODO: explain param_grid
%% Cell type:code id: tags:
``` python
param_grid = {'polynomialfeatures__degree': [1, 2, 3],
'standardscaler__with_mean': [True, False],
'standardscaler__with_std': [True, False],
'logisticregression__C': [1, 10, 15, 20, 25],
}
```
%% Cell type:code id: tags:
``` python
search = GridSearchCV(p, param_grid, cv=5, scoring="f1", return_train_score=False, n_jobs=5)
search.fit(features, labels)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
```
%% Output
Best parameter (CV score=0.983):
{'logisticregression__C': 20, 'polynomialfeatures__degree': 2, 'standardscaler__with_mean': True, 'standardscaler__with_std': False}
%% Cell type:code id: tags:
``` python
from scipy.stats import uniform, randint
param_dist = {'polynomialfeatures__degree': randint(1, 5),
'standardscaler__with_mean': [True, False],
'standardscaler__with_std': [True, False],
'logisticregression__C': uniform(0.1, 20)
}
```
%% Cell type:code id: tags:
``` python
search = RandomizedSearchCV(p, param_dist, n_jobs=5, n_iter=100)
search.fit(features, labels)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
```
%% Output
Best parameter (CV score=0.982):
{'logisticregression__C': 15.053760390091858, 'polynomialfeatures__degree': 3, 'standardscaler__with_mean': False, 'standardscaler__with_std': True}
......
......@@ -39,7 +39,7 @@ TBD: prepare coding session
## Part 1: Introduction (UWE)
- What is machine learning ?
- learning from examples
- working with hard to understand data.
- automatation
......@@ -47,39 +47,39 @@ TBD: prepare coding session
- What are features / samples / feature matrix ?
- always numerical / categorical vectors
- examples: beer, movies, images, text to numerical examples
- Learning problems:
- Learning problems:
- unsupervised:
- find structure in set of features
- beers: find groups of beer types
- supervised:
- classification: do I like this beer ?
example: draw decision tree
## Part 2a: supervised learning: classification
Intention: demonstrate one / two simple examples of classifiers, also
introduce the concept of decision boundary
- idea of simple linear classifier: take features, produce real value ("uwes beer score"), use threshold to decide
-> simple linear classifier (linear SVM e.g.)
-> beer example with some weights
- show code example with logistic regression for beer data, show weights, plot decision function
### Coding session:
- change given code to use a linear SVM classifier
- use different data (TBD) set which can not be classified well with a linear classifier
- tell to transform data and run again (TBD: how exactly ?)
## Part 2b: supervised learning: regression (TBD: skip this ?)
......@@ -130,7 +130,7 @@ Intention: accuracy is usefull but has pitfalls
- how to measure accuracy ?
- (TDB: skip ?) regression accuracy
-
-
- classifier accuracy:
- confusion matrix
- accurarcy
......@@ -138,7 +138,7 @@ Intention: accuracy is usefull but has pitfalls
e.g. diagnose HIV
- precision / recall
- ROC ?
- exercise: do cross val with other metrics
### Coding session
......@@ -152,29 +152,7 @@ Intention: accuracy is usefull but has pitfalls
# Day 2
## Part 5: pipelines / parameter tuning with scikit-learn
- Scicit learn api: recall what we have seen up to now.
- pipelines, preprocessing (scaler, PCA)
- cross validatioon
- parameter tuning: grid search / random search.
### Coding session
- build SVM and LinearRegression crossval pipelines for previous examples
- use PCA in pipeline for (+) to improve performance
- find optimal SVM parameters
- find optimal pca components number
### Coding par
Planning: stop here, make time estimates.
## Part 6: classifiers overview
## Part 5: classifiers overview
Intention: quick walk throught throug reliable classifiers, give some background
idea if suitable, let them play withs some incl. modification of parameters.
......@@ -188,7 +166,14 @@ diagram.
- Random forests
- Gradient Tree Boosting
show decision surfaces of these classifiers on 2d examples.
topics to include:
- interoperability of results (in terms features importance, e.g. SVN w/ hig deg poly
kernel)
- some rules of thumbs: don't use KNN classifiers for 10 or more dimensions (why? paper
link)
- show decision surfaces for diff classifiers (extend exercise in sec 3 using
hyperparams)
### Coding session
......@@ -197,10 +182,26 @@ show decision surfaces of these classifiers on 2d examples.
- MNIST example
## Part 7: Start with neural networks. .5 day
## Part 6: pipelines / parameter tuning with scikit-learn
- Scikit-learn API: recall what we have seen up to now.
- pipelines, preprocessing (scaler, PCA)
- cross validation
- parameter tuning: grid search / random search.
### Coding session
- build SVM and LinearRegression crossval pipelines for previous examples
- use PCA in pipeline for (+) to improve performance
- find optimal SVM parameters
- find optimal pca components number
## Part 7: Start with neural networks. .5 day
## Planning
Stop here, make time estimates.
......
%% Cell type:markdown id: tags:
<div class="alert alert-block alert-danger">
Course layout w/ local notebooks links .. anything in scope of org/general comments goes also here.
</div>
%% Cell type:markdown id: tags:
<div class="alert alert-block alert-danger"><p>
<strong>TODOs</strong>
<ol>
<li>Write script which removes the solution proposals (cells starting with <code>#SOLUTION</code>) and creates a new notebook.</li>
</ol>
</p></div>
%% Cell type:markdown id: tags:
<div class="alert alert-block alert-danger">
Course layout w/ local notebooks links .. anything in scope of org/general comments goes also here.
</div>
%% Cell type:markdown id: tags:
# Course: Introduction to Machine Learning with Python
<div class="alert alert-block alert-warning">
<p><i class="fa fa-warning"></i>&nbsp;<strong>Goal</strong></p>
<p>Quickly get your hands dirty with Machine Learning and know what your doing.<p>
</div>
%% Cell type:markdown id: tags:
## What will you learn?
* Basic concepts of Machine Learning (ML).
* General overview of supervised learning and related methods.
* How to quickly start with ML using `scikit-learn` Python library.
%% Cell type:markdown id: tags:
## What will you NOT learn?
* How to program with Python.
* How exactly ML methods work.
* Unsupervised learning methods.
%% Cell type:markdown id: tags:
## Course scripts
<ol>
<li><a href="01_introduction.ipynb">Introduction</a></li>
<li><a href="02_classification.ipynb">Classification</a></li>
<li><a href="03_overfitting_and_cross_validation.ipynb">Overfitting and cross-validation</a></li>
<li><a href="04_measuring_quality_of_a_classifier.ipynb">Metrics for evaluating the performance</a></li>
<li><a href="05_classifiers_overview.ipynb">An overview of classifiers</a></li>
<li><a href="06_preprocessing_pipelines_and_hyperparameter_optimization.ipynb">Preprocessing pipelines and hyperparameters optmization</a></li>
<li>...</li>
</ol>
%% Cell type:code id: tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment