diff --git a/course_layout.md b/course_layout.md index e9b7624cd2c1023d96186a7f81fdf4c4d9ae8671..3419c95f855d3d57d0fbedaf41963aee6e754ef1 100644 --- a/course_layout.md +++ b/course_layout.md @@ -2,49 +2,72 @@ # Targeted audience -- Researchers from DBIOL, BSSE and DGESS having no machine learning experience yet. +- Researchers from DBIOL and DGESS having no machine learning experience yet. - Basic Python knowledge. - Almost no math knowledge. # Concepts -- two days workshop, 1.5 days workshop + .5 day working on own data / prepared data. +- 3 days workshop: 2 days lectures with exercises + 0.5 day real life example walk + through + 0.5 day working on own data / prepared data. - smooth learning curve -- explain fundamental concepts first, discuss exceptions, corner cases, - pitfalls late. -- plotting / pandas? / numpy first. Else participants might be fight with these - basics during coding sessions and will be disctracted from the actual - learning goal of an exercise. +- explain fundamental concepts first, discuss exceptions, corner cases, pitfalls late. +- plotting / pandas / numpy first. Else participants might be fight with these basics + during coding sessions and will be disctracted from the actual learning goal of an + exercise. - jupyter notebooks / conda, extra notebooks with solutions. -- use prepared computers in computer room, setting up personal computer during last day if required. +- use prepared computers in computer room, setting up personal computer during last day + if required. - exercises: empty holes to fill -TBD: +# Course structure +## Home prep +Introductions to NumPy, Pandas and Matplotlib (plus Python, if needed). -# Course structure +Prep materials to send out: +* Python, ca. 6h: https://siscourses.ethz.ch/python_one_day/script.html +* NumPy, ca. 3h: https://siscourses.ethz.ch/python-scientific/01_numpy.html + * WARN: a bit too advanced + * alt, ext: http://scipy-lectures.org/intro/numpy/index.html +* Pandas, ca. 1.5h: https://siscourses.ethz.ch/python-scientific/02_pandas.html + * alt, ext: http://www.scipy-lectures.org/packages/statistics/index.html#data-representation-and-interaction + * cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf +* Matplotlib + Seaborn + * ext: + * http://scipy-lectures.org/intro/matplotlib/index.html + * http://scipy-lectures.org/packages/statistics/index.html#more-visualization-seaborn-for-statistical-exploration + * cheat sheets: + * https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf + * https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Seaborn_Cheat_Sheet.pdf -## Part 0: Preparation (UWE) +## Day 1 -- quick basics matplotlib, numpy, pandas?: +Intro and superficial overview of classifiers including quality assessment, pipelines +and hyperparams optim. -TBD: installation instructions preparation. +Total time: 6h (8 x uni hour (uh)) -TBD: prepare coding session +### Part 0: Preparation +Time: 15 min (1/3 uh) +- organizational announcements +- installation/machines preparation +### Part 1: General introduction -## Part 1: Introduction (UWE) +Time: 75 min (5/3 uh) -- What is machine learning ? +- What is machine learning? - learning from examples - working with hard to understand data. - - automatation + - automation + +- What are features / samples / feature matrix? -- What are features / samples / feature matrix ? - always numerical / categorical vectors - examples: beer, movies, images, text to numerical examples @@ -57,46 +80,37 @@ TBD: prepare coding session - supervised: - - classification: do I like this beer ? - example: draw decision tree - - - -## Part 2a: supervised learning: classification - - Intention: demonstrate one / two simple examples of classifiers, also - introduce the concept of decision boundary + - classification: do I like this beer? + example: draw decision tree or surface - - idea of simple linear classifier: take features, produce real value ("uwes beer score"), use threshold to decide - -> simple linear classifier (linear SVM e.g.) - -> beer example with some weights +### Part 2: Supervised learning: concepts of classification - - show code example with logistic regression for beer data, show weights, plot decision function +Time: 60 min (4/3 uh) -### Coding session: +Intention: demonstrate one / two simple examples of classifiers, also introduce the +concept of decision boundary - - change given code to use a linear SVM classifier +- idea of simple linear classifier: take features, produce real value ("beer score"), + use threshold to decide + - simple linear classifier (linear SVM e.g.) + - beer example with some weights - - use different data (TBD) set which can not be classified well with a linear classifier - - tell to transform data and run again (TBD: how exactly ?) +- show code example with logistic regression for beer data, show weights, plot decision + surface +#### Coding session: -## Part 2b: supervised learning: regression (TBD: skip this ?) +- change given code to use a linear SVM classifier +- use different data set which can not be classified well with a linear classifier +- tell to transform data and run again - Intention: demonstrate one / two simple examples of regression +### Part 3: Overfitting and cross-validation - - regression: how would I rate this movie ? - example: use weighted sum, also example for linear regresor - example: fit a quadratic function +Time: 60 min (4/3 uh) - - learn regressor for movie scores. +Needs: simple accuracy measure. - -## Part 3: underfitting/overfitting - -needs: simple accuracy measure. - -classifiers / regressors have parameters / degrees of freedom. +Classifiers (regressors) have parameters / degrees of freedom. - underfitting: @@ -109,122 +123,132 @@ classifiers / regressors have parameters / degrees of freedom. - polnome of degree 5 to fit points on a line + noise - points in a circle: draw very exact boundary line -- how to check underfitting / overfitting ? +- how to check underfitting / overfitting? - measure accuracy or other metric on test dataset - cross validation -### Coding session: +#### Coding session: - How to do cross validation with scikit-learn - use different beer feature set with redundant feature (+) - run crossvalidation on classifier -- ? run crossvalidation on movie regression problem -## Part 4: accuracy, F1, ROC, ... +### Part 4: accuracy, F1, ROC, ... -Intention: accuracy is usefull but has pitfalls +Time: 60 min (4/3 uh) -- how to measure accuracy ? +Intention: pitfalls of simple accuracy + +- how to measure accuracy? - - (TDB: skip ?) regression accuracy - - - classifier accuracy: - - confusion matrix - - accurarcy - - pitfalls for unbalanced data sets~ + - confusion matrix metrics + - pitfalls for unbalanced data sets e.g. diagnose HIV - precision / recall - - ROC ? + - mention ROC? -- exercise: do cross val with other metrics +- excercise (pen and paper): determine precision / recall -### Coding session +#### Coding session -- evaluate accuracy of linear beer classifier from latest section +- do cross val with multiple metrics: + evaluate linear beer classifier from latest section +- fool them: give them other dataset where classifier fails. -- determine precision / recall +### Part 5: Pipelines and hyperparameters tuning w/ extended exercise -- fool them: give them other dataset where classifier fails. +Time: 1.5h (2 uh) +- Scikit-learn API: recall what we have seen up to now. +- preprocessing (scaler, PCA, function/column transformers) +- cross validation +- parameter tuning: grid search / random search. + +#### Coding session + +- build SVM and LinearRegression crossval pipelines for previous examples +- use PCA in pipeline for (+) to improve performance +- find optimal SVM parameters +- find optimal pca components number +- **extended**: full process for best pipeline/model selection incl. preprocessing steps + selection, hyperparams tunning w/ cross-validation -# Day 2 +## Day 2 -## Part 5: classifiers overview +Total time: 6h (8 x uni hour (uh)) + +### Part 6 a+b: classifiers overview (NNs & regression-based + tree-based & ensembles) Intention: quick walk through reliable classifiers, give some background idea if suitable, let them play with some, incl. modification of parameters. -To consider: decision graph from sklearn, come up with easy to understand -diagram. +Summary: decision graph (mind-map) from ScikitLearn, and come up with easy to understand +summary table. + +#### Part 6a + +Time: 1h (4/3 uh) - Nearest neighbours -- SVM classifier (SVC) +- Logistic regression +- Linear + kernel SVM classifier (SVC) - demo for Radial Basis Function (RBF) kernel trick: different parameters influence on decision line -- ?Decision trees or only in random forests? -- Random forests (ensemble method - averaging) -- Gradient Tree Boosting (ensemble method - boosting) -- Naive Bayes for text classification -- mentions - big data: - - Stochastic Gradient Descent classifier, - - kernel approximation transformation (explicitly approx. kernel trick) - - compare SVC incl. RBF vs. Random Kitchen Sinks (RBFSampler) + linear SVC (https://scikit-learn.org/stable/auto_examples/plot_kernel_approximation.html#sphx-glr-auto-examples-plot-kernel-approximation-py) - -Topics to include: - -- interoperability of results (in terms features importance, e.g. SVN w/ hig deg poly - kernel) -- some rules of thumbs: don't use KNN classifiers for 10 or more dimensions (why? paper - link) -- show decision surfaces for diff classifiers (extend exercise in sec 3 using - hyperparams) -### Coding session +#### Part 6b -- apply SVM, Random Forests, Gradient boosting to previous examples -- apply clustering to previous examples -- MNIST example +Time: 1h (4/3 uh) +- Decision trees +- Averaging: Random forests +- Boosting AdaBoost and mention Gradient Tree Boosting (hist; xgboost) +- mentions + - text classification: Naive Bayes for text classification + - big data: + - Stochastic Gradient Descent classifier, + - kernel approximation transformation (explicitly approx. kernel trick) + - opt, compare SVC incl. RBF vs. Random Kitchen Sinks (RBFSampler) + linear SVC + (https://scikit-learn.org/stable/auto_examples/plot_kernel_approximation.html#sphx-glr-auto-examples-plot-kernel-approximation-py) +- summary/overview -## Part 6: pipelines / parameter tuning with scikit-learn - -- Scikit-learn API: recall what we have seen up to now. -- pipelines, preprocessing (scaler, PCA) -- cross validation -- parameter tuning: grid search / random search. - -### Coding session - -- build SVM and LinearRegression crossval pipelines for previous examples -- use PCA in pipeline for (+) to improve performance -- find optimal SVM parameters -- find optimal pca components number +#### Topics to include +- interoperability of results (in terms features importance, e.g. SVN w/ high deg. poly. + kernel) +- some rules of thumbs: don't use kNN classifiers for 10 or more dimensions (why? paper + link) +- show decision surfaces for diff classifiers (extend exercise in sec 3 using + hyperparams) -## Part 7: Start with neural networks. .5 day +#### Coding session -## Planning +- apply SVM, Random Forests, boosting to specific examples +- MNIST example -Stop here, make time estimates. +### Part 7: Supervised learning: regression +Time: 1h (4/3 uh) +Intention: demonstrate one / two simple examples of regression +- regression: how would I rate this movie? + example: use weighted sum, also example for linear regressor + example: fit a quadratic function +- learn regressor for movie scores / salmon weight. +### Part 8: Supervised learning: neuronal networks -## Part 8: Best practices +Time: 3h (4 uh) -- visualize features: pairwise scatter, tSNE -- PCA to undertand data -- check balance of data set, what if not ? -- start with baseline classifier / regressor -- augment data to introduce variance +Intention: Introduction to neural networks and deep learning with `keras` -## Part 9: neural networks +- include real-life tumor example (maybe in day 3 walk-through) - overview, history - perceptron @@ -233,11 +257,28 @@ Stop here, make time estimates. - where neural networks work well - keras demo -### Coding Session +#### Coding Session - keras reuse network and play with it. +## Day 3 + +Total time: 6h (8 uh) +1. Hands-on walk-through real life example. +2. Assisted programming session where participants can start to work on their own + machine learning application. Assist to setup own machines. Offer some example + data sets from https://www.kaggle.com/datasets +## Misc +### Best practices + +Rather include/repeat in relevant workshop parts/examples + +- visualize features: pairwise scatter, UMAP/tSNE +- PCA to simplify/understand data +- check balance of data set, what if not? +- start with baseline classifier/regressor +- augment data to introduce variance