Resolve "Add images data classification w/ SVM and squeeze out a lot of performance"
Merged
requested to merge 26-add-images-data-classification-w-svm-and-squeeze-out-a-lot-of-performance into master
Compare changes
+ 114
− 30
```
```
Starting from the top the decision tree is build by selecting **best split of the dataset using a single feature**. Best feature and its split value are ones that make the resulting **subsets more pure** in terms of variety of classes they contain (i.e. that minimize misclassification error, or Gini index/impurity, or maximize entropy/information gain).
```
```
```
```
```
```
```
```
OOB is a generalisation/predictive error that, together with <code>warm_start=True</code>, can be used for efficient search for a good-enough number of trees, i.e. the <code>n_estimators</code> hyperparameter value (see: <a href=https://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html>OOB Errors for Random Forests</a>).
For presentation, in order to weight the classifiers, we will use the original discrete AdaBoost learning method (`algorithm="SAMME"`). Because the classifiers learn iteratively on differently weighted samples, to understand the weights we have to look at internal train errors and not at the final scores on the training data.
```
In particular, try out [XGboost](https://xgboost.readthedocs.io/en/latest/); it's a package that won many competitions, cf. [XGboost@Kaggle](https://www.kaggle.com/dansbecker/xgboost). It is not part of scikit-learn, but it offers a `scikit-learn` API (see https://www.kaggle.com/stuarthallows/using-xgboost-with-scikit-learn ); a `scikit-learn` equivalent is [`GradientBoostingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).
A new `scikit-learn` implementation of boosting based on decision trees is [`HistGradientBoostingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html). It is much faster then `GradientBoostingClassifier` for big datasets (`n_samples >= 10 000`).
2. compare cross validation mean f1 scores; use the stratified k-fold CV strategy (`from sklearn.model_selection import cross_val_score, StratifiedKFold`; attention: this is a non-binary multi-class problem, you will have to use an adjusted f1 score, e.g. unweighted per class mean via `scoring="f1_macro"` keyword arg - see [the `scoring` parameter predefined values](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules));
4. next, try to manually tune hyperparameters to minimize the train test gap and to squeeze out at least 90% cross validation f1 score performance out of each classifier; try using PCA preprocessing (`sklearn.pipeline.make_pipeline`, `sklearn.decomposition.PCA`); what about data scaling? which models are most effective and easiest to tune manually?
```
```