Skip to content
Snippets Groups Projects

Targeted audience

  • Researchers from DBIOL, BSSE and DGESS having no machine learning experience yet.
  • Basic Python knowledge.
  • Almost no math knowledge.

Concepts

  • two days workshop, 1.5 days workshop + .5 day working on own data / prepared data.
  • smooth learning curve
  • explain fundamental concepts first, discuss exceptions, corner cases, pitfalls late.
  • plotting / pandas? / numpy first. Else participants might be fight with these basics during coding sessions and will be disctracted from the actual learning goal of an exercise.
  • jupyter notebooks / conda, extra notebooks with solutions.
  • use prepared computers in computer room, setting up personal computer during last day if required.
  • exercises: empty holes to fill

TBD:

Course structure

Part 0: Preparation (UWE)

  • quick basics matplotlib, numpy, pandas?:

TBD: installation instructions preparation.

TBD: prepare coding session

Part 1: Introduction (UWE)

  • What is machine learning ?

    • learning from examples
    • working with hard to understand data.
    • automatation
  • What are features / samples / feature matrix ?

    • always numerical / categorical vectors
    • examples: beer, movies, images, text to numerical examples
  • Learning problems:

    • unsupervised:

      • find structure in set of features
      • beers: find groups of beer types
    • supervised:

      • classification: do I like this beer ? example: draw decision tree

Part 2a: supervised learning: classification

Intention: demonstrate one / two simple examples of classifiers, also introduce the concept of decision boundary

  • idea of simple linear classifier: take features, produce real value ("uwes beer score"), use threshold to decide -> simple linear classifier (linear SVM e.g.) -> beer example with some weights

  • show code example with logistic regression for beer data, show weights, plot decision function

Coding session:

  • change given code to use a linear SVM classifier

  • use different data (TBD) set which can not be classified well with a linear classifier

  • tell to transform data and run again (TBD: how exactly ?)

Part 2b: supervised learning: regression (TBD: skip this ?)

Intention: demonstrate one / two simple examples of regression

  • regression: how would I rate this movie ? example: use weighted sum, also example for linear regresor example: fit a quadratic function

  • learn regressor for movie scores.

Part 3: underfitting/overfitting

needs: simple accuracy measure.

classifiers / regressors have parameters / degrees of freedom.

  • underfitting:

    • linear classifier for points on a quadratic function
  • overfitting:

    • features have actual noise, or not enough information not enough information: orchid example in 2d. elevate to 3d using another feature.
    • polnome of degree 5 to fit points on a line + noise
    • points in a circle: draw very exact boundary line
  • how to check underfitting / overfitting ?

    • measure accuracy or other metric on test dataset
    • cross validation

Coding session:

  • How to do cross validation with scikit-learn
  • use different beer feature set with redundant feature (+)
  • run crossvalidation on classifier
  • ? run crossvalidation on movie regression problem

Part 4: accuracy, F1, ROC, ...

Intention: accuracy is usefull but has pitfalls

  • how to measure accuracy ?

    • (TDB: skip ?) regression accuracy
    • classifier accuracy:
      • confusion matrix
      • accurarcy
      • pitfalls for unbalanced data sets~ e.g. diagnose HIV
      • precision / recall
      • ROC ?
  • exercise: do cross val with other metrics

Coding session

  • evaluate accuracy of linear beer classifier from latest section

  • determine precision / recall

  • fool them: give them other dataset where classifier fails.

Day 2

Part 5: pipelines / parameter tuning with scikit-learn

  • Scicit learn api: recall what we have seen up to now.
  • pipelines, preprocessing (scaler, PCA)
  • cross validatioon
  • parameter tuning: grid search / random search.

Coding session

  • build SVM and Random forest crossval pipelines for previous examples
  • use PCA in pipeline for (+) to improve performance
  • find optimal SVM parameters
  • find optimal pca components number

Coding par

Planning: stop here, make time estimates.

Part 6: classifiers overview

Intention: quick walk throught throug reliable classifiers, give some background idea if suitable, let them play withs some incl. modification of parameters.

to consider: decision graph from sklearn, come up with easy to understand diagram.

  • Neighrest neighbours
  • SVMs
    • demo for RBF: different parameters influence on decision line
  • Random forests
  • Gradient Tree Boosting

show decision surfaces of these classifiers on 2d examples.

Coding session

  • apply SVM, Random Forests, Gradient boosting to previous examples
  • apply clustering to previous examples
  • MNIST example

Part 7: Start with neural networks. .5 day

Part 8: Best practices

  • visualize features: pairwise scatter, tSNE
  • PCA to undertand data
  • check balance of data set, what if not ?
  • start with baseline classifier / regressor
  • augment data to introduce variance

Part 9: neural networks

  • overview, history
  • perceptron
  • multi layer
  • multi layer demoe with google online tool
  • where neural networks work well
  • keras demo

Coding Session

  • keras reuse network and play with it.