-
Mikolaj Rybinski authoredMikolaj Rybinski authored
- Targeted audience
- Concepts
- Course structure
- Home prep
- Day 1
- Part 0: Preparation
- Part 1: General introduction
- Part 2: Supervised learning: concepts of classification
- Coding session:
- Part 3: Overfitting and cross-validation
- Coding session:
- Part 4: accuracy, F1, ROC, ...
- Coding session
- Part 5: Pipelines and hyperparameters tuning w/ extended exercise
- Coding session
- Day 2
- Part 6 a+b: classifiers overview (NNs & regression-based + tree-based & ensembles)
- Part 6a
- Part 6b
- Topics to include
- Coding session
- Part 7: Supervised learning: regression
- Part 8: Supervised learning: neuronal networks
- Coding Session
- Day 3
- Misc
- Best practices
Targeted audience
- Researchers from DBIOL and DGESS having no machine learning experience yet.
- Basic Python knowledge.
- Almost no math knowledge.
Concepts
- 3 days workshop: 2 days lectures with exercises + 0.5 day real life example walk through + 0.5 day working on own data / prepared data.
- smooth learning curve
- explain fundamental concepts first, discuss exceptions, corner cases, pitfalls late.
- plotting / pandas / numpy first. Else participants might be fight with these basics during coding sessions and will be disctracted from the actual learning goal of an exercise.
- jupyter notebooks / conda, extra notebooks with solutions.
- use prepared computers in computer room, setting up personal computer during last day if required.
- exercises: empty holes to fill
Course structure
Home prep
Introductions to NumPy, Pandas and Matplotlib (plus Python, if needed).
Prep materials to send out:
- Python, ca. 6h: https://siscourses.ethz.ch/python_one_day/script.html
- NumPy, ca. 3h: https://siscourses.ethz.ch/python-scientific/01_numpy.html
- WARN: a bit too advanced
- alt, ext: http://scipy-lectures.org/intro/numpy/index.html
- Pandas, ca. 1.5h: https://siscourses.ethz.ch/python-scientific/02_pandas.html
- Matplotlib + Seaborn
- ext:
- cheat sheets:
Day 1
Intro and superficial overview of classifiers including quality assessment, pipelines and hyperparams optim.
Total time: 6h (8 x uni hour (uh))
Part 0: Preparation
Time: 15 min (1/3 uh)
- organizational announcements
- installation/machines preparation
Part 1: General introduction
Time: 75 min (5/3 uh)
-
What is machine learning?
- learning from examples
- working with hard to understand data.
- automation
-
What are features / samples / feature matrix?
- always numerical / categorical vectors
- examples: beer, movies, images, text to numerical examples
-
Learning problems:
-
unsupervised:
- find structure in set of features
- beers: find groups of beer types
-
supervised:
- classification: do I like this beer? example: draw decision tree or surface
-
Part 2: Supervised learning: concepts of classification
Time: 60 min (4/3 uh)
Intention: demonstrate one / two simple examples of classifiers, also introduce the concept of decision boundary
-
idea of simple linear classifier: take features, produce real value ("beer score"), use threshold to decide
- simple linear classifier (linear SVM e.g.)
- beer example with some weights
-
show code example with logistic regression for beer data, show weights, plot decision surface
Coding session:
- change given code to use a linear SVM classifier
- use different data set which can not be classified well with a linear classifier
- tell to transform data and run again
Part 3: Overfitting and cross-validation
Time: 60 min (4/3 uh)
Needs: simple accuracy measure.
Classifiers (regressors) have parameters / degrees of freedom.
-
underfitting:
- linear classifier for points on a quadratic function
-
overfitting:
- features have actual noise, or not enough information not enough information: orchid example in 2d. elevate to 3d using another feature.
- polnome of degree 5 to fit points on a line + noise
- points in a circle: draw very exact boundary line
-
how to check underfitting / overfitting?
- measure accuracy or other metric on test dataset
- cross validation
Coding session:
- How to do cross validation with scikit-learn
- use different beer feature set with redundant feature (+)
- run crossvalidation on classifier
Part 4: accuracy, F1, ROC, ...
Time: 60 min (4/3 uh)
Intention: pitfalls of simple accuracy
-
how to measure accuracy?
- classifier accuracy:
- confusion matrix metrics
- pitfalls for unbalanced data sets e.g. diagnose HIV
- precision / recall
- mention ROC?
- classifier accuracy:
-
exercise (pen and paper): determine precision / recall
Coding session
- do cross val with multiple metrics: evaluate linear beer classifier from latest section
- fool them: give them other dataset where classifier fails.
Part 5: Pipelines and hyperparameters tuning w/ extended exercise
Time: 1.5h (2 uh)
- Scikit-Learn API: recall what we have seen up to now.
- preprocessing (scaler, PCA, function/column transformers)
- cross validation
- parameter tuning: grid search / random search.
Coding session
- build SVM and LinearRegression crossval pipelines for previous examples
- use PCA in pipeline for (+) to improve performance
- find optimal SVM parameters
- find optimal pca components number
- extended: full process for best pipeline/model selection incl. preprocessing steps selection, hyperparams tunning w/ cross-validation
Day 2
Total time: 6h (8 x uni hour (uh))
Part 6 a+b: classifiers overview (NNs & regression-based + tree-based & ensembles)
Intention: quick walk through reliable classifiers, give some background idea if suitable, let them play with some, incl. modification of parameters.
Summary: decision graph (mind-map) from ScikitLearn, and come up with easy to understand summary table.
Part 6a
Time: 1h (4/3 uh)
- Nearest neighbours
- Logistic regression
- Linear + kernel SVM classifier (SVC)
- demo for Radial Basis Function (RBF) kernel trick: different parameters influence on decision line
Part 6b
Time: 1h (4/3 uh)
- Decision trees
- Averaging: Random forests
- Boosting AdaBoost and mention Gradient Tree Boosting (hist; xgboost)
- mentions
- text classification: Naive Bayes for text classification
- big data:
- Stochastic Gradient Descent classifier,
- kernel approximation transformation (explicitly approx. kernel trick)
- opt, compare SVC incl. RBF vs. Random Kitchen Sinks (RBFSampler) + linear SVC (https://scikit-learn.org/stable/auto_examples/plot_kernel_approximation.html#sphx-glr-auto-examples-plot-kernel-approximation-py)
- summary/overview
Topics to include
- interoperability of results (in terms features importance, e.g. SVN w/ high deg. poly. kernel)
- some rules of thumbs: don't use kNN classifiers for 10 or more dimensions (why? paper link)
- show decision surfaces for diff classifiers (extend exercise in sec 3 using hyperparams)
Coding session
- apply SVM, Random Forests, boosting to specific examples
- MNIST example
Part 7: Supervised learning: regression
Time: 1h (4/3 uh)
Intention: demonstrate one / two simple examples of regression
-
regression: how would I rate this movie? example: use weighted sum, also example for linear regressor example: fit a quadratic function
-
learn regressor for movie scores / salmon weight.
Part 8: Supervised learning: neuronal networks
Time: 3h (4 uh)
Intention: Introduction to neural networks and deep learning with keras
-
include real-life tumor example (maybe in day 3 walk-through)
-
overview, history
-
perceptron
-
multi layer
-
multi layer demo with google online tool
-
where neural networks work well
-
keras demo
Coding Session
- keras reuse network and play with it.
Day 3
Total time: 6h (8 uh)
- Hands-on walk-through real life example.
- Assisted programming session where participants can start to work on their own machine learning application. Assist to setup own machines. Offer some example data sets from https://www.kaggle.com/datasets
Misc
Best practices
Rather include/repeat in relevant workshop parts/examples
- visualize features: pairwise scatter, UMAP/tSNE
- PCA to simplify/understand data
- check balance of data set, what if not?
- start with baseline classifier/regressor
- augment data to introduce variance