Skip to content
Snippets Groups Projects
layout.md 4.34 KiB
Newer Older
  • Learn to ignore specific revisions
  • # Targeted audience
    
    - Researchers from DBIOL, BSSE and DGESS having no machine learning experience yet.
    - Basic Python knowledge.
    - Almost no math knowledge.
    
    # Concepts
    
    - smooth learning curve
    
    schmittu's avatar
    schmittu committed
    - explain fundamental concepts first, discuss  exceptions, corner cases,
      pitfalls late.
    - plotting / pandas / numpy first. Else participants might be fight with these
      basics during coding sessions and will be disctracted from the actual
      learning goal of an exercise.
    
    
    
    # Course structure
    
    ## Preparation
    
    - setup machines
    
    - quick basics matplotlib, numpy, pandas
    
    
    ## Part 1: Introduction
    
    - Why machine learning ?
    
    - What are features ?
      - always numerical vectors
      - examples: beer, movies, images, text
    
    - unsupervised:
    
      - find structure in set of features
      - beers: find groups of beer types
    
    ### Coding session:
    
      - read dataframe from csv or excel sheet with beer features
      - do some features vs features scatter plots
      - use tsne to show clusters
      - scikit-learn example to find clusters
    
    ## Part 2: supervised learning
    
    - supervised:
    
      - classification: do I like this beer ?
        example: draw decision tree
    
      - classifiation: points on both sides of a line, points in circle, xor problem
        - idea of decision function: take features, produce real value, use threshold to decide
        - simple linear classifier
        - show some examples for feature engineering here to apply linear classifier
    
      - regression: how would I rate this movie ?
        example: use weighted sum, also example for linear regresor
        example: fit a quadratic function
    
    ### Coding session:
    
      - show: read circle data, plot data, augment features, learn linear classifier with scikit-learn,
        show weights and explain classifier, plot decision boundary
        load eval data set and evaluate accuracy.
    
      - adapt: read xor data, plot data, augment features, learn linear classifier with scikit-learn,
        show weights and explain classifier, plot decision boundary
        load eval data set and evaluate accuracy.
    
      - learn regressor for movie scores.
    
    
    ## Part 3: accuracy, F1, ROC, ...
    
    - how to measure accuracy ?
      - regression accuracy
      - classifier accuracy:
        - confusion matrix
        - pitfalls for unbalanced data sets
            e.g. diagnose HIV
        - precision / recall
        - ROC ?
    
    ### Coding session
    
    - evaluate accuracy of linear beer classifier
    
    - determine precision / recall
    
    - ROC curve based on threshold
    
    - provide predetermined weights, show ROC curve.
    
    
    ## Part 4: underfitting/overfitting
    
    classifiers / regressors have parameters / degrees of freedom.
    
    - underfitting:
    
      - linear classifier for points on a quadratic function
    
    - overfitting:
    
      - features have actual noise, or not enough information
        not enough information: orchid example in 2d. elevate to 3d using another feature.
      - polnome of degree 5 to fit points on a line + noise
      - points in a circle: draw very exact boundary line
    
    - how to check underfitting / overfitting ?
    
      - measure accuracy
      - test data set
      - cross validation
    
    
    ### Coding session:
    
    - How to do cross validation with scikit-learn
    - use different beer feature set with redundant feature (+)
    - run crossvalidation on classifier
    - run crossvalidation on movie regression problem
    
    
    ## Part 5: Overview scikit-learn / algorithms
    
    - Linear regressors
    - Neighrest neighbours
    - SVMs
      - demo for RBF: different parameters influence on decision line
    - Random forests
    - Gradient Tree Boosting
    - Clustering
    
    ### Coding session
    
    - apply SVM, Random Forests, Gradient boosting to previous examples
    - apply clustering to previous examples
    - MNIST example
    
    ## Part 6: pipelines / cross val / parameter optimiation with scikit-learn
    
    - Scicit learn api
    - pipelines, preprocessing (scaler, PCA)
    - cross validatioon
    - parameter optimization
    
    ### Coding session
    
    - build SVM and Random forest crossval pipelines for previous examples
    - use PCA in pipeline for (+) to improve performance
    - find optimal SVM parameters
    - find optimal pca components number
    
    ## Part 7: Best practices
    
    - visualize features: pairwise scatter, tSNE
    - PCA to undertand data
    - check balance of data set, what if not ?
    - start with baseline classifier / regressor
    - augment data to introduce variance
    
    ## Part 8: neural networks
    
    - overview, history
    - perceptron
    - multi layer
    - multi layer demoe with google online tool
    - where neural networks work well
    - keras demo
    
    ### Coding Session
    
    - keras reuse network and play with it.