# Targeted audience - Researchers from DBIOL, BSSE and DGESS having no machine learning experience yet. - Basic Python knowledge. - Almost no math knowledge. # Concepts - two days workshop, 1.5 days workshop + .5 day working on own data / prepared data. - smooth learning curve - explain fundamental concepts first, discuss exceptions, corner cases, pitfalls late. - plotting / pandas? / numpy first. Else participants might be fight with these basics during coding sessions and will be disctracted from the actual learning goal of an exercise. - jupyter notebooks / conda, extra notebooks with solutions. - use prepared computers in computer room, setting up personal computer during last day if required. - exercises: empty holes to fill TBD: # Course structure ## Part 0: Preparation (UWE) - quick basics matplotlib, numpy, pandas?: TBD: installation instructions preparation. TBD: prepare coding session ## Part 1: Introduction (UWE) - What is machine learning ? - learning from examples - working with hard to understand data. - automatation - What are features / samples / feature matrix ? - always numerical / categorical vectors - examples: beer, movies, images, text to numerical examples - Learning problems: - unsupervised: - find structure in set of features - beers: find groups of beer types - supervised: - classification: do I like this beer ? example: draw decision tree ## Part 2a: supervised learning: classification Intention: demonstrate one / two simple examples of classifiers, also introduce the concept of decision boundary - idea of simple linear classifier: take features, produce real value ("uwes beer score"), use threshold to decide -> simple linear classifier (linear SVM e.g.) -> beer example with some weights - show code example with logistic regression for beer data, show weights, plot decision function ### Coding session: - change given code to use a linear SVM classifier - use different data (TBD) set which can not be classified well with a linear classifier - tell to transform data and run again (TBD: how exactly ?) ## Part 2b: supervised learning: regression (TBD: skip this ?) Intention: demonstrate one / two simple examples of regression - regression: how would I rate this movie ? example: use weighted sum, also example for linear regresor example: fit a quadratic function - learn regressor for movie scores. ## Part 3: underfitting/overfitting needs: simple accuracy measure. classifiers / regressors have parameters / degrees of freedom. - underfitting: - linear classifier for points on a quadratic function - overfitting: - features have actual noise, or not enough information not enough information: orchid example in 2d. elevate to 3d using another feature. - polnome of degree 5 to fit points on a line + noise - points in a circle: draw very exact boundary line - how to check underfitting / overfitting ? - measure accuracy or other metric on test dataset - cross validation ### Coding session: - How to do cross validation with scikit-learn - use different beer feature set with redundant feature (+) - run crossvalidation on classifier - ? run crossvalidation on movie regression problem ## Part 4: accuracy, F1, ROC, ... Intention: accuracy is usefull but has pitfalls - how to measure accuracy ? - (TDB: skip ?) regression accuracy - - classifier accuracy: - confusion matrix - accurarcy - pitfalls for unbalanced data sets~ e.g. diagnose HIV - precision / recall - ROC ? - exercise: do cross val with other metrics ### Coding session - evaluate accuracy of linear beer classifier from latest section - determine precision / recall - fool them: give them other dataset where classifier fails. # Day 2 ## Part 5: classifiers overview Intention: quick walk through reliable classifiers, give some background idea if suitable, let them play with some, incl. modification of parameters. To consider: decision graph from sklearn, come up with easy to understand diagram. - Nearest neighbours - SVM classifier (SVC) - demo for Radial Basis Function (RBF) kernel trick: different parameters influence on decision line - ?Decision trees or only in random forests? - Random forests (ensemble method - averaging) - Gradient Tree Boosting (ensemble method - boosting) - Naive Bayes for text classification - mentions - big data: - Stochastic Gradient Descent classifier, - kernel approximation transformation (explicitly approx. kernel trick) - compare SVC incl. RBF vs. Random Kitchen Sinks (RBFSampler) + linear SVC (https://scikit-learn.org/stable/auto_examples/plot_kernel_approximation.html#sphx-glr-auto-examples-plot-kernel-approximation-py) Topics to include: - interoperability of results (in terms features importance, e.g. SVN w/ hig deg poly kernel) - some rules of thumbs: don't use KNN classifiers for 10 or more dimensions (why? paper link) - show decision surfaces for diff classifiers (extend exercise in sec 3 using hyperparams) ### Coding session - apply SVM, Random Forests, Gradient boosting to previous examples - apply clustering to previous examples - MNIST example ## Part 6: pipelines / parameter tuning with scikit-learn - Scikit-learn API: recall what we have seen up to now. - pipelines, preprocessing (scaler, PCA) - cross validation - parameter tuning: grid search / random search. ### Coding session - build SVM and LinearRegression crossval pipelines for previous examples - use PCA in pipeline for (+) to improve performance - find optimal SVM parameters - find optimal pca components number ## Part 7: Start with neural networks. .5 day ## Planning Stop here, make time estimates. ## Part 8: Best practices - visualize features: pairwise scatter, tSNE - PCA to undertand data - check balance of data set, what if not ? - start with baseline classifier / regressor - augment data to introduce variance ## Part 9: neural networks - overview, history - perceptron - multi layer - multi layer demoe with google online tool - where neural networks work well - keras demo ### Coding Session - keras reuse network and play with it.