Skip to content
Snippets Groups Projects
Commit 6b70aaa7 authored by schmittu's avatar schmittu :beer:
Browse files

rough draft for 05 about pipelines et al

parent 0c9e5344
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Draft
- Scicit learn api: recall what we have seen up to now.
- pipelines, preprocessing (scaler, PCA)
- cross validatioon on pipeline
- parameter tuning: grid search / random search.
%% Cell type:markdown id: tags:
Up to now all classifiers had methods `fit` and `predict`
%% Cell type:markdown id: tags:
## Preprocessing
Scaler: SVC work better when all columns of feature matrix are in the same numerical range.
PCA can reduce redundancy / correlations (see overfitting script) and thus avoid overfitting.
Polynomial features: extend feature matrix by computing products between and within feature matrix columns
FunctionTransformer (sklearn): Apply functions like log
Danger: PCA and Scaler learn on data sets. Make sure that test/validation datasets do not sneak in !
DONT DO: scale on full dataset first, then cross validation etc.
%% Cell type:markdown id: tags:
Preprocessors in `sklearn` all have `fit`, `transform` and `fit_transform` methods.
%% Cell type:markdown id: tags:
## Pipelines
Combine different preprocessing steps into one "classifier".
Pipeline is a list for preprocessors followed by a classifier.
Thus for a pipeline of len $n$ there are $n - 1$ objects having `fit` and `transform` methods, the last element has `fit` and `predict` methods.
%% Cell type:code id: tags:
``` python
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.decomposition import PCA
```
%% Cell type:code id: tags:
``` python
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
```
%% Cell type:markdown id: tags:
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline
%% Cell type:code id: tags:
``` python
from sklearn.pipeline import make_pipeline
from sklearn.cross_validation import cross_val_score
```
%% Cell type:code id: tags:
``` python
import pandas as pd
beer_data = pd.read_csv("beers.csv")
features = beer_data.iloc[:, :-1]
labels = beer_data.iloc[:, -1];
```
%% Cell type:code id: tags:
``` python
print(cross_val_score(SVC(), features, labels, "accuracy", cv=5).mean())
# here we see benefit of scaling for SVC
p = make_pipeline(StandardScaler(), SVC())
print(cross_val_score(p, features, labels, "accuracy", cv=5).mean())
print()
print("pipeline")
print()
for name, step in p.steps:
print("{:20s} {}".format(name, step))
print()
# this is how we can set parameters of a single step in the pipeline:
p.set_params(svc__C = 3)
print(cross_val_score(p, features, labels, "accuracy", cv=5).mean())
```
%% Output
0.8625911286780852
0.94655248133509
pipeline
standardscaler StandardScaler(copy=True, with_mean=True, with_std=True)
svc SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
0.9286736934563022
%% Cell type:code id: tags:
``` python
for p in [make_pipeline(SVC()),
make_pipeline(StandardScaler(), SVC()),
make_pipeline(LogisticRegression()),
make_pipeline(StandardScaler(), PCA(), LogisticRegression()),
make_pipeline(PolynomialFeatures(), SVC()),
make_pipeline(PolynomialFeatures(), StandardScaler(), SVC()),
make_pipeline(PolynomialFeatures(), LogisticRegression()),
make_pipeline(PolynomialFeatures(), StandardScaler(), LogisticRegression()),
]:
print("{:.3f}".format(cross_val_score(p, features, labels, "accuracy", cv=5).mean()), end=" ")
print([pi[0] for pi in p.steps])
```
%% Output
0.863 ['svc']
0.947 ['standardscaler', 'svc']
0.804 ['logisticregression']
0.920 ['standardscaler', 'pca', 'logisticregression']
0.840 ['polynomialfeatures', 'svc']
0.942 ['polynomialfeatures', 'standardscaler', 'svc']
0.925 ['polynomialfeatures', 'logisticregression']
0.964 ['polynomialfeatures', 'standardscaler', 'logisticregression']
%% Cell type:markdown id: tags:
# Exercises:
- use the beer data frame with random / redundant features from cross val script and demo benefit of PCA
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# optimize parameters of one single classifier
parameters = {'kernel':('linear', 'rbf', 'poly'),
'C':[1, 5, 10, 15]
}
svc = SVC()
search = GridSearchCV(svc, parameters, cv=5)
search.fit(features, labels)
print(search.best_score_, search.best_params_)
```
%% Output
0.9822222222222222 {'C': 5, 'kernel': 'poly'}
%% Cell type:markdown id: tags:
Now we optimize a pipeline
%% Cell type:code id: tags:
``` python
p = make_pipeline(PolynomialFeatures(), StandardScaler(), LogisticRegression())
```
%% Cell type:markdown id: tags:
TODO: explain param_grid
%% Cell type:code id: tags:
``` python
param_grid = {'polynomialfeatures__degree': [1, 2, 3],
'standardscaler__with_mean': [True, False],
'standardscaler__with_std': [True, False],
'logisticregression__C': [1, 10, 15, 20, 25],
}
```
%% Cell type:code id: tags:
``` python
search = GridSearchCV(p, param_grid, cv=5, scoring="f1", return_train_score=False, n_jobs=5)
search.fit(features, labels)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
```
%% Output
Best parameter (CV score=0.983):
{'logisticregression__C': 20, 'polynomialfeatures__degree': 2, 'standardscaler__with_mean': True, 'standardscaler__with_std': False}
%% Cell type:code id: tags:
``` python
from scipy.stats import uniform, randint
param_dist = {'polynomialfeatures__degree': randint(1, 5),
'standardscaler__with_mean': [True, False],
'standardscaler__with_std': [True, False],
'logisticregression__C': uniform(0.1, 20)
}
```
%% Cell type:code id: tags:
``` python
search = RandomizedSearchCV(p, param_dist, n_jobs=5, n_iter=100)
search.fit(features, labels)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
```
%% Output
Best parameter (CV score=0.982):
{'logisticregression__C': 15.053760390091858, 'polynomialfeatures__degree': 3, 'standardscaler__with_mean': False, 'standardscaler__with_std': True}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment