Newer
Older
# Targeted audience
- Researchers from DBIOL, BSSE and DGESS having no machine learning experience yet.
- Basic Python knowledge.
- Almost no math knowledge.
# Concepts
- two days workshop, 1.5 days workshop + .5 day working on own data / prepared data.
- explain fundamental concepts first, discuss exceptions, corner cases,
pitfalls late.
- plotting / pandas? / numpy first. Else participants might be fight with these
basics during coding sessions and will be disctracted from the actual
learning goal of an exercise.
- jupyter notebooks / conda, extra notebooks with solutions.
- use prepared computers in computer room, setting up personal computer during last day if required.
- exercises: empty holes to fill
## Part 1: Introduction (UWE)
- What is machine learning ?
Mikolaj Rybinski
committed
- learning from examples
- working with hard to understand data.
- automatation
- What are features / samples / feature matrix ?
- always numerical / categorical vectors
- examples: beer, movies, images, text to numerical examples
Mikolaj Rybinski
committed
- Learning problems:
Mikolaj Rybinski
committed
- find structure in set of features
- beers: find groups of beer types
Mikolaj Rybinski
committed
Mikolaj Rybinski
committed
- classification: do I like this beer ?
example: draw decision tree
Mikolaj Rybinski
committed
## Part 2a: supervised learning: classification
Intention: demonstrate one / two simple examples of classifiers, also
introduce the concept of decision boundary
Mikolaj Rybinski
committed
- idea of simple linear classifier: take features, produce real value ("uwes beer score"), use threshold to decide
-> simple linear classifier (linear SVM e.g.)
-> beer example with some weights
Mikolaj Rybinski
committed
- show code example with logistic regression for beer data, show weights, plot decision function
### Coding session:
- change given code to use a linear SVM classifier
Mikolaj Rybinski
committed
- use different data (TBD) set which can not be classified well with a linear classifier
- tell to transform data and run again (TBD: how exactly ?)
Mikolaj Rybinski
committed
## Part 2b: supervised learning: regression (TBD: skip this ?)
Intention: demonstrate one / two simple examples of regression
- regression: how would I rate this movie ?
example: use weighted sum, also example for linear regresor
example: fit a quadratic function
- learn regressor for movie scores.
## Part 3: underfitting/overfitting
needs: simple accuracy measure.
classifiers / regressors have parameters / degrees of freedom.
- underfitting:
- linear classifier for points on a quadratic function
- overfitting:
- features have actual noise, or not enough information
not enough information: orchid example in 2d. elevate to 3d using another feature.
- polnome of degree 5 to fit points on a line + noise
- points in a circle: draw very exact boundary line
- how to check underfitting / overfitting ?
- cross validation
### Coding session:
- How to do cross validation with scikit-learn
- use different beer feature set with redundant feature (+)
- run crossvalidation on classifier
## Part 4: accuracy, F1, ROC, ...
Intention: accuracy is usefull but has pitfalls
- how to measure accuracy ?
- (TDB: skip ?) regression accuracy
Mikolaj Rybinski
committed
-
- classifier accuracy:
- confusion matrix
- accurarcy
- pitfalls for unbalanced data sets~
e.g. diagnose HIV
- precision / recall
- ROC ?
Mikolaj Rybinski
committed
- exercise: do cross val with other metrics
### Coding session
- evaluate accuracy of linear beer classifier from latest section
- determine precision / recall
- fool them: give them other dataset where classifier fails.
# Day 2
Mikolaj Rybinski
committed
## Part 5: classifiers overview
Intention: quick walk through reliable classifiers, give some background idea if
suitable, let them play with some, incl. modification of parameters.
To consider: decision graph from sklearn, come up with easy to understand
- Nearest neighbours
- SVM classifier (SVC)
- demo for Radial Basis Function (RBF) kernel trick: different parameters influence on
decision line
- ?Decision trees or only in random forests?
- Random forests (ensemble method - averaging)
- Gradient Tree Boosting (ensemble method - boosting)
- Naive Bayes for text classification
- mentions - big data:
- Stochastic Gradient Descent classifier,
- kernel approximation transformation (explicitly approx. kernel trick)
- compare SVC incl. RBF vs. Random Kitchen Sinks (RBFSampler) + linear SVC (https://scikit-learn.org/stable/auto_examples/plot_kernel_approximation.html#sphx-glr-auto-examples-plot-kernel-approximation-py)
Topics to include:
Mikolaj Rybinski
committed
- interoperability of results (in terms features importance, e.g. SVN w/ hig deg poly
kernel)
- some rules of thumbs: don't use KNN classifiers for 10 or more dimensions (why? paper
link)
- show decision surfaces for diff classifiers (extend exercise in sec 3 using
hyperparams)
### Coding session
- apply SVM, Random Forests, Gradient boosting to previous examples
- apply clustering to previous examples
- MNIST example
Mikolaj Rybinski
committed
## Part 6: pipelines / parameter tuning with scikit-learn
- Scikit-learn API: recall what we have seen up to now.
- pipelines, preprocessing (scaler, PCA)
- cross validation
- parameter tuning: grid search / random search.
### Coding session
Mikolaj Rybinski
committed
- build SVM and LinearRegression crossval pipelines for previous examples
- use PCA in pipeline for (+) to improve performance
- find optimal SVM parameters
- find optimal pca components number
## Part 7: Start with neural networks. .5 day
Mikolaj Rybinski
committed
## Planning
Mikolaj Rybinski
committed
Stop here, make time estimates.
- visualize features: pairwise scatter, tSNE
- PCA to undertand data
- check balance of data set, what if not ?
- start with baseline classifier / regressor
- augment data to introduce variance
- overview, history
- perceptron
- multi layer
- multi layer demoe with google online tool
- where neural networks work well
- keras demo
### Coding Session
- keras reuse network and play with it.