Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
M
machinelearning-introduction-workshop
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
sispub
courses
machinelearning-introduction-workshop
Commits
6b70aaa7
"git@sissource.ethz.ch:sispub/openbis.git" did not exist on "1fe65062956dcaedce4adbe553b587336db723a6"
Commit
6b70aaa7
authored
6 years ago
by
schmittu
Browse files
Options
Downloads
Patches
Plain Diff
rough draft for 05 about pipelines et al
parent
0c9e5344
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
05_preprocessing_pipelines_and_hyperparameter_optimization.ipynb
+349
-0
349 additions, 0 deletions
...rocessing_pipelines_and_hyperparameter_optimization.ipynb
with
349 additions
and
0 deletions
05_preprocessing_pipelines_and_hyperparameter_optimization.ipynb
0 → 100644
+
349
−
0
View file @
6b70aaa7
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Draft\n",
"\n",
"- Scicit learn api: recall what we have seen up to now.\n",
"- pipelines, preprocessing (scaler, PCA)\n",
"- cross validatioon on pipeline\n",
"- parameter tuning: grid search / random search."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Up to now all classifiers had methods `fit` and `predict`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preprocessing\n",
"\n",
"Scaler: SVC work better when all columns of feature matrix are in the same numerical range.\n",
"\n",
"PCA can reduce redundancy / correlations (see overfitting script) and thus avoid overfitting.\n",
"\n",
"Polynomial features: extend feature matrix by computing products between and within feature matrix columns\n",
"\n",
"FunctionTransformer (sklearn): Apply functions like log\n",
"\n",
"Danger: PCA and Scaler learn on data sets. Make sure that test/validation datasets do not sneak in !\n",
"\n",
"DONT DO: scale on full dataset first, then cross validation etc."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Preprocessors in `sklearn` all have `fit`, `transform` and `fit_transform` methods."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pipelines\n",
"\n",
"Combine different preprocessing steps into one \"classifier\".\n",
"\n",
"Pipeline is a list for preprocessors followed by a classifier.\n",
"\n",
"Thus for a pipeline of len $n$ there are $n - 1$ objects having `fit` and `transform` methods, the last element has `fit` and `predict` methods.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import PolynomialFeatures, StandardScaler\n",
"from sklearn.decomposition import PCA\n"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.svm import SVC\n",
"from sklearn.linear_model import LogisticRegression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.pipeline import make_pipeline\n",
"from sklearn.cross_validation import cross_val_score"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"beer_data = pd.read_csv(\"beers.csv\")\n",
"\n",
"features = beer_data.iloc[:, :-1]\n",
"labels = beer_data.iloc[:, -1];"
]
},
{
"cell_type": "code",
"execution_count": 149,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.8625911286780852\n",
"0.94655248133509\n",
"\n",
"pipeline\n",
"\n",
"standardscaler StandardScaler(copy=True, with_mean=True, with_std=True)\n",
"svc SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n",
" decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n",
" max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
" tol=0.001, verbose=False)\n",
"\n",
"0.9286736934563022\n"
]
}
],
"source": [
"print(cross_val_score(SVC(), features, labels, \"accuracy\", cv=5).mean())\n",
"\n",
"# here we see benefit of scaling for SVC\n",
"p = make_pipeline(StandardScaler(), SVC())\n",
"\n",
"print(cross_val_score(p, features, labels, \"accuracy\", cv=5).mean())\n",
"\n",
"print()\n",
"print(\"pipeline\")\n",
"print()\n",
"\n",
"\n",
"for name, step in p.steps:\n",
" print(\"{:20s} {}\".format(name, step))\n",
" \n",
"print()\n",
"\n",
"# this is how we can set parameters of a single step in the pipeline:\n",
"p.set_params(svc__C = 3)\n",
"\n",
"print(cross_val_score(p, features, labels, \"accuracy\", cv=5).mean())\n"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.863 ['svc']\n",
"0.947 ['standardscaler', 'svc']\n",
"0.804 ['logisticregression']\n",
"0.920 ['standardscaler', 'pca', 'logisticregression']\n",
"0.840 ['polynomialfeatures', 'svc']\n",
"0.942 ['polynomialfeatures', 'standardscaler', 'svc']\n",
"0.925 ['polynomialfeatures', 'logisticregression']\n",
"0.964 ['polynomialfeatures', 'standardscaler', 'logisticregression']\n"
]
}
],
"source": [
"for p in [make_pipeline(SVC()),\n",
" make_pipeline(StandardScaler(), SVC()),\n",
" make_pipeline(LogisticRegression()),\n",
" make_pipeline(StandardScaler(), PCA(), LogisticRegression()),\n",
"\n",
" make_pipeline(PolynomialFeatures(), SVC()),\n",
" make_pipeline(PolynomialFeatures(), StandardScaler(), SVC()),\n",
" make_pipeline(PolynomialFeatures(), LogisticRegression()),\n",
" make_pipeline(PolynomialFeatures(), StandardScaler(), LogisticRegression()),\n",
" ]:\n",
" \n",
" print(\"{:.3f}\".format(cross_val_score(p, features, labels, \"accuracy\", cv=5).mean()), end=\" \")\n",
" print([pi[0] for pi in p.steps])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercises:\n",
"- use the beer data frame with random / redundant features from cross val script and demo benefit of PCA"
]
},
{
"cell_type": "code",
"execution_count": 154,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.9822222222222222 {'C': 5, 'kernel': 'poly'}\n"
]
}
],
"source": [
"from sklearn.model_selection import GridSearchCV, RandomizedSearchCV\n",
"\n",
"# optimize parameters of one single classifier\n",
"\n",
"parameters = {'kernel':('linear', 'rbf', 'poly'), \n",
" 'C':[1, 5, 10, 15]\n",
" }\n",
"\n",
"svc = SVC()\n",
"search = GridSearchCV(svc, parameters, cv=5)\n",
"search.fit(features, labels)\n",
"print(search.best_score_, search.best_params_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we optimize a pipeline"
]
},
{
"cell_type": "code",
"execution_count": 155,
"metadata": {},
"outputs": [],
"source": [
"p = make_pipeline(PolynomialFeatures(), StandardScaler(), LogisticRegression())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TODO: explain param_grid"
]
},
{
"cell_type": "code",
"execution_count": 157,
"metadata": {},
"outputs": [],
"source": [
"param_grid = {'polynomialfeatures__degree': [1, 2, 3],\n",
" 'standardscaler__with_mean': [True, False],\n",
" 'standardscaler__with_std': [True, False],\n",
" 'logisticregression__C': [1, 10, 15, 20, 25],\n",
" }"
]
},
{
"cell_type": "code",
"execution_count": 164,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Best parameter (CV score=0.983):\n",
"{'logisticregression__C': 20, 'polynomialfeatures__degree': 2, 'standardscaler__with_mean': True, 'standardscaler__with_std': False}\n"
]
}
],
"source": [
"search = GridSearchCV(p, param_grid, cv=5, scoring=\"f1\", return_train_score=False, n_jobs=5)\n",
"search.fit(features, labels)\n",
"print(\"Best parameter (CV score=%0.3f):\" % search.best_score_)\n",
"print(search.best_params_)"
]
},
{
"cell_type": "code",
"execution_count": 161,
"metadata": {},
"outputs": [],
"source": [
"from scipy.stats import uniform, randint\n",
"\n",
"param_dist = {'polynomialfeatures__degree': randint(1, 5),\n",
" 'standardscaler__with_mean': [True, False],\n",
" 'standardscaler__with_std': [True, False],\n",
" 'logisticregression__C': uniform(0.1, 20)\n",
" }"
]
},
{
"cell_type": "code",
"execution_count": 162,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Best parameter (CV score=0.982):\n",
"{'logisticregression__C': 15.053760390091858, 'polynomialfeatures__degree': 3, 'standardscaler__with_mean': False, 'standardscaler__with_std': True}\n"
]
}
],
"source": [
"search = RandomizedSearchCV(p, param_dist, n_jobs=5, n_iter=100)\n",
"\n",
"search.fit(features, labels)\n",
"print(\"Best parameter (CV score=%0.3f):\" % search.best_score_)\n",
"print(search.best_params_)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
%% Cell type:markdown id: tags:
# Draft
-
Scicit learn api: recall what we have seen up to now.
-
pipelines, preprocessing (scaler, PCA)
-
cross validatioon on pipeline
-
parameter tuning: grid search / random search.
%% Cell type:markdown id: tags:
Up to now all classifiers had methods
`fit`
and
`predict`
%% Cell type:markdown id: tags:
## Preprocessing
Scaler: SVC work better when all columns of feature matrix are in the same numerical range.
PCA can reduce redundancy / correlations (see overfitting script) and thus avoid overfitting.
Polynomial features: extend feature matrix by computing products between and within feature matrix columns
FunctionTransformer (sklearn): Apply functions like log
Danger: PCA and Scaler learn on data sets. Make sure that test/validation datasets do not sneak in !
DONT DO: scale on full dataset first, then cross validation etc.
%% Cell type:markdown id: tags:
Preprocessors in
`sklearn`
all have
`fit`
,
`transform`
and
`fit_transform`
methods.
%% Cell type:markdown id: tags:
## Pipelines
Combine different preprocessing steps into one "classifier".
Pipeline is a list for preprocessors followed by a classifier.
Thus for a pipeline of len $n$ there are $n - 1$ objects having
`fit`
and
`transform`
methods, the last element has
`fit`
and
`predict`
methods.
%% Cell type:code id: tags:
```
python
from
sklearn.preprocessing
import
PolynomialFeatures
,
StandardScaler
from
sklearn.decomposition
import
PCA
```
%% Cell type:code id: tags:
```
python
from
sklearn.svm
import
SVC
from
sklearn.linear_model
import
LogisticRegression
```
%% Cell type:markdown id: tags:
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline
%% Cell type:code id: tags:
```
python
from
sklearn.pipeline
import
make_pipeline
from
sklearn.cross_validation
import
cross_val_score
```
%% Cell type:code id: tags:
```
python
import
pandas
as
pd
beer_data
=
pd
.
read_csv
(
"
beers.csv
"
)
features
=
beer_data
.
iloc
[:,
:
-
1
]
labels
=
beer_data
.
iloc
[:,
-
1
];
```
%% Cell type:code id: tags:
```
python
print
(
cross_val_score
(
SVC
(),
features
,
labels
,
"
accuracy
"
,
cv
=
5
).
mean
())
# here we see benefit of scaling for SVC
p
=
make_pipeline
(
StandardScaler
(),
SVC
())
print
(
cross_val_score
(
p
,
features
,
labels
,
"
accuracy
"
,
cv
=
5
).
mean
())
print
()
print
(
"
pipeline
"
)
print
()
for
name
,
step
in
p
.
steps
:
print
(
"
{:20s} {}
"
.
format
(
name
,
step
))
print
()
# this is how we can set parameters of a single step in the pipeline:
p
.
set_params
(
svc__C
=
3
)
print
(
cross_val_score
(
p
,
features
,
labels
,
"
accuracy
"
,
cv
=
5
).
mean
())
```
%% Output
0.8625911286780852
0.94655248133509
pipeline
standardscaler StandardScaler(copy=True, with_mean=True, with_std=True)
svc SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
0.9286736934563022
%% Cell type:code id: tags:
```
python
for
p
in
[
make_pipeline
(
SVC
()),
make_pipeline
(
StandardScaler
(),
SVC
()),
make_pipeline
(
LogisticRegression
()),
make_pipeline
(
StandardScaler
(),
PCA
(),
LogisticRegression
()),
make_pipeline
(
PolynomialFeatures
(),
SVC
()),
make_pipeline
(
PolynomialFeatures
(),
StandardScaler
(),
SVC
()),
make_pipeline
(
PolynomialFeatures
(),
LogisticRegression
()),
make_pipeline
(
PolynomialFeatures
(),
StandardScaler
(),
LogisticRegression
()),
]:
print
(
"
{:.3f}
"
.
format
(
cross_val_score
(
p
,
features
,
labels
,
"
accuracy
"
,
cv
=
5
).
mean
()),
end
=
"
"
)
print
([
pi
[
0
]
for
pi
in
p
.
steps
])
```
%% Output
0.863 ['svc']
0.947 ['standardscaler', 'svc']
0.804 ['logisticregression']
0.920 ['standardscaler', 'pca', 'logisticregression']
0.840 ['polynomialfeatures', 'svc']
0.942 ['polynomialfeatures', 'standardscaler', 'svc']
0.925 ['polynomialfeatures', 'logisticregression']
0.964 ['polynomialfeatures', 'standardscaler', 'logisticregression']
%% Cell type:markdown id: tags:
# Exercises:
-
use the beer data frame with random / redundant features from cross val script and demo benefit of PCA
%% Cell type:code id: tags:
```
python
from
sklearn.model_selection
import
GridSearchCV
,
RandomizedSearchCV
# optimize parameters of one single classifier
parameters
=
{
'
kernel
'
:(
'
linear
'
,
'
rbf
'
,
'
poly
'
),
'
C
'
:[
1
,
5
,
10
,
15
]
}
svc
=
SVC
()
search
=
GridSearchCV
(
svc
,
parameters
,
cv
=
5
)
search
.
fit
(
features
,
labels
)
print
(
search
.
best_score_
,
search
.
best_params_
)
```
%% Output
0.9822222222222222 {'C': 5, 'kernel': 'poly'}
%% Cell type:markdown id: tags:
Now we optimize a pipeline
%% Cell type:code id: tags:
```
python
p
=
make_pipeline
(
PolynomialFeatures
(),
StandardScaler
(),
LogisticRegression
())
```
%% Cell type:markdown id: tags:
TODO: explain param_grid
%% Cell type:code id: tags:
```
python
param_grid
=
{
'
polynomialfeatures__degree
'
:
[
1
,
2
,
3
],
'
standardscaler__with_mean
'
:
[
True
,
False
],
'
standardscaler__with_std
'
:
[
True
,
False
],
'
logisticregression__C
'
:
[
1
,
10
,
15
,
20
,
25
],
}
```
%% Cell type:code id: tags:
```
python
search
=
GridSearchCV
(
p
,
param_grid
,
cv
=
5
,
scoring
=
"
f1
"
,
return_train_score
=
False
,
n_jobs
=
5
)
search
.
fit
(
features
,
labels
)
print
(
"
Best parameter (CV score=%0.3f):
"
%
search
.
best_score_
)
print
(
search
.
best_params_
)
```
%% Output
Best parameter (CV score=0.983):
{'logisticregression__C': 20, 'polynomialfeatures__degree': 2, 'standardscaler__with_mean': True, 'standardscaler__with_std': False}
%% Cell type:code id: tags:
```
python
from
scipy.stats
import
uniform
,
randint
param_dist
=
{
'
polynomialfeatures__degree
'
:
randint
(
1
,
5
),
'
standardscaler__with_mean
'
:
[
True
,
False
],
'
standardscaler__with_std
'
:
[
True
,
False
],
'
logisticregression__C
'
:
uniform
(
0.1
,
20
)
}
```
%% Cell type:code id: tags:
```
python
search
=
RandomizedSearchCV
(
p
,
param_dist
,
n_jobs
=
5
,
n_iter
=
100
)
search
.
fit
(
features
,
labels
)
print
(
"
Best parameter (CV score=%0.3f):
"
%
search
.
best_score_
)
print
(
search
.
best_params_
)
```
%% Output
Best parameter (CV score=0.982):
{'logisticregression__C': 15.053760390091858, 'polynomialfeatures__degree': 3, 'standardscaler__with_mean': False, 'standardscaler__with_std': True}
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment