diff --git a/03_overfitting_and_cross_validation.ipynb b/03_overfitting_and_cross_validation.ipynb index 290636a960d180eac4c128bf9ccffb190f82bd5c..dedacd89ffcc5bd97e6f5883f160a8ee30e8618b 100644 --- a/03_overfitting_and_cross_validation.ipynb +++ b/03_overfitting_and_cross_validation.ipynb @@ -612,8 +612,13 @@ "\n", "\n", "## 2. How can we do better ?\n", - "\n", - "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "There is no classifier which works out of the box in all situations. Depending on the \"geometry\" / \"shape\" of the data, classification algorithms and their settings can make a big difference.\n", "\n", "In our previous 2D examples we were able to visualize the data and classification results, this is not possible for higher dimensional data.\n", @@ -621,8 +626,54 @@ "The general way to handle this situation is as follows: \n", "\n", "- split our data into a learning data set and a test data set\n", + "\n", + "\n", "- train the classifier on the learning data set\n", - "- assess performance of the classifier on the test data set." + "\n", + "\n", + "- assess performance of the classifier on the test data set.\n", + "\n", + "\n", + "### Cross-validation\n", + "\n", + "<img src=\"https://i.imgflip.com/305azk.jpg\" title=\"made at imgflip.com\" width=40%/>\n", + "\n", + "\n", + "The procedure called *cross-validation* goes a step further: In this procedure the full dataset is split into learn-/test-set in various ways and statistics of the achieved metrics is computed to assess the classifier.\n", + "\n", + "A common approach is **K-fold cross-validation**:\n", + "\n", + "K-fold cross-validation has an advantage that we do not leave out part of our data from training. This is useful when we do not have a lot of data. \n", + "\n", + "### Example: 4-fold cross validation\n", + "\n", + "For 4-fold cross validation we split our data set into four equal sized partitions P1, P2, P3 and P4.\n", + "\n", + "We:\n", + "\n", + "- hold out `P1`: train the classifier on `P2 + P3 + P4`, compute accuracy `m1` on `P1`.\n", + "\n", + "<img src=\"cross_val_0.svg?2\" />\n", + "\n", + "- hold out `P2`: train the classifier on `P1 + P3 + P4`, compute accuracy `m2` on `P2`.\n", + "\n", + "<img src=\"cross_val_1.svg?2\" />\n", + "\n", + "- hold out `P3`: train the classifier on `P1 + P2 + P4`, compute accuray `m3` on `P3`.\n", + "\n", + "<img src=\"cross_val_2.svg?2\" />\n", + "\n", + "- hold out `P4`: train the classifier on `P1 + P2 + P3`, compute accuracy `m4` on `P4`.\n", + "\n", + "<img src=\"cross_val_3.svg?2\" />\n", + "\n", + "Finally we can compute the average of `m1` .. `m4` as the final measure for accuracy.\n", + "\n", + "Some advice:\n", + "\n", + "- This can be done on the original data or on randomly shuffled data. It is recommended to shuffle the data first, as there might be some unknown underlying ordering in your dataset\n", + "\n", + "- Usually one uses 3- to 10-fold cross validation, depending on the amount of data available." ] }, { diff --git a/05_classifiers_overview.ipynb b/05_classifiers_overview.ipynb index 3d6e7b71a00a6e92b0f100bc3050ca17d4620080..799d876d70069dbb8ff2d6b31f4678b295688eed 100644 --- a/05_classifiers_overview.ipynb +++ b/05_classifiers_overview.ipynb @@ -140,20 +140,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This script gives a quick hands-on overview of **how different types of classifiers work, their advantages and their disadvantages**. This should give you an idea of a concept behind each classifier type as well as when and which classifier type to use." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "<img src=\"https://i.imgflip.com/303zjr.jpg\" title=\"made at imgflip.com\" width=50%/>" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ + "This script gives a quick hands-on overview of **how different types of classifiers work, their advantages and their disadvantages**. This should give you an idea of a concept behind each classifier type as well as when and which classifier type to use.\n", + "\n", "For the sake of visualisation we continue with 2 dimensional data examples. For different classifiers we'll be looking at their decision surfaces. Let's start with some helper functions for that:" ] }, @@ -1309,15 +1297,18 @@ "execution_count": 19, "metadata": { "tags": [ - "solution", - "TODO" + "solution" ] }, "outputs": [], "source": [ "# SOLUTION\n", "\n", - "# TODO: text for 1." + "# Again, with C=1000 we've just tried to hard to get all training points correctly classified,\n", + "# but this time it meant essentially no points within the margin. Thus, by overfitting we\n", + "# lost the linear trend in the data, which is represented by the one test data sample that\n", + "# just did not make it over the separation line (and a bit overrepresented by the other quite \n", + "# badly misclassfied test sample).\n" ] }, { @@ -2884,7 +2875,9 @@ "source": [ "## Coding session\n", "\n", - "Compare mean cross validation accuracy, precision and recall scores for all classifiers shown in this script using the `\"beer.csv\"` data. Try to squeeze better than default performance out of the classifiers by tuning their hyperparameters. Which ones perform best?\n", + "Compare mean cross validation accuracy, precision, recall and f1 scores for all classifiers shown in this script using the `\"beer.csv\"` data. Try to squeeze better than default performance out of the classifiers by tuning their hyperparameters. Which ones perform best?\n", + "\n", + "*Hint: we already did some hyperparams fine tuning for some of the methods for the beers dataset.*\n", "\n", "\n", "(*) Compare timing of both learning and CV scoring parts." @@ -2892,18 +2885,28 @@ }, { "cell_type": "code", - "execution_count": 43, + "execution_count": 118, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score\n", "\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.svm import LinearSVC, SVC\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier\n", + "\n", + "df = pd.read_csv(\"beers.csv\")\n", + "features_4d = df.iloc[:, :-1]\n", + "labelv = df.iloc[:, -1]\n", + "\n", + "# classifier = ...\n", "# cross_val_score(classifier, features_4d, labelv, scoring=\"recall\", cv=5)" ] }, { "cell_type": "code", - "execution_count": 44, + "execution_count": 119, "metadata": { "tags": [ "solution" @@ -2915,21 +2918,60 @@ "output_type": "stream", "text": [ "LogisticRegression\n", - "\t5-fold CV mean accuracy: 0.89\n", - "\t5-fold CV mean precision: 0.88\n", - "\t5-fold CV mean recall: 0.93\n" + "\t5-fold CV mean accuracy: 0.91\n", + "\t5-fold CV mean precision: 0.91\n", + "\t5-fold CV mean recall: 0.93\n", + "\t5-fold CV mean f1: 0.92\n", + "\n", + "LinearSVC\n", + "\t5-fold CV mean accuracy: 0.91\n", + "\t5-fold CV mean precision: 0.90\n", + "\t5-fold CV mean recall: 0.94\n", + "\t5-fold CV mean f1: 0.92\n", + "\n", + "SVC\n", + "\t5-fold CV mean accuracy: 0.96\n", + "\t5-fold CV mean precision: 0.94\n", + "\t5-fold CV mean recall: 1.00\n", + "\t5-fold CV mean f1: 0.97\n", + "\n", + "DecisionTreeClassifier\n", + "\t5-fold CV mean accuracy: 0.90\n", + "\t5-fold CV mean precision: 0.92\n", + "\t5-fold CV mean recall: 0.89\n", + "\t5-fold CV mean f1: 0.91\n", + "\n", + "RandomForestClassifier\n", + "\t5-fold CV mean accuracy: 0.90\n", + "\t5-fold CV mean precision: 0.89\n", + "\t5-fold CV mean recall: 0.93\n", + "\t5-fold CV mean f1: 0.91\n", + "\n", + "AdaBoostClassifier\n", + "\t5-fold CV mean accuracy: 0.92\n", + "\t5-fold CV mean precision: 0.92\n", + "\t5-fold CV mean recall: 0.95\n", + "\t5-fold CV mean f1: 0.93\n", + "\n" ] } ], "source": [ "# SOLUTION\n", - "#\n", - "# TODO: add classifiers, tune params\n", - "\n", "from sklearn.model_selection import cross_val_score\n", "\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.svm import LinearSVC, SVC\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier\n", + "\n", "classifiers = [\n", - " LogisticRegression(C=10) \n", + " LogisticRegression(C=100),\n", + " LinearSVC(C=10, max_iter=25000),\n", + " SVC(C=30, gamma=0.1), # but, we didn't scale the features, did we?\n", + " DecisionTreeClassifier(max_depth=7, random_state=0),\n", + " RandomForestClassifier(max_depth=4, n_estimators=10, max_features=2, random_state=0),\n", + " AdaBoostClassifier(n_estimators=20, random_state=0),\n", "] \n", "\n", "df = pd.read_csv(\"beers.csv\")\n", @@ -2938,9 +2980,10 @@ "\n", "for classifier in classifiers:\n", " print(classifier.__class__.__name__)\n", - " for scoring in [\"accuracy\", \"precision\", \"recall\"]:\n", + " for scoring in [\"accuracy\", \"precision\", \"recall\", \"f1\"]:\n", " scores = cross_val_score(classifier, features_4d, labelv, scoring=scoring, cv=5)\n", " print(\"\\t5-fold CV mean {}: {:.2f}\".format(scoring, scores.mean()))\n", + " print()\n", " \n" ] }, @@ -2953,27 +2996,97 @@ }, { "cell_type": "markdown", - "metadata": { - "tags": [ - "TODO" - ] - }, + "metadata": {}, "source": [ - "**TODO**: finish summary table\n", + "Below you will find a table with some guidelines, as well as pros and cons of different classication methods available in scikit-learn.\n", + "\n", + "<div class=\"alert alert-block alert-warning\">\n", + " <p><i class=\"fa fa-warning\"></i> <strong>Summary table</strong></p>\n", + "\n", + "<p>\n", + "<em>Disclaimer</em>: this table is neither a single source of truth nor complete - it's intended only to provide some first considerations when starting out. At the end of the day, you have to try and pick a method that works for your problem/data.\n", + "</p>\n", + "\n", + "<table>\n", + "<thead>\n", + "<tr>\n", + "<th style=\"text-align: center;\">Classifier type</th>\n", + "<th style=\"text-align: center;\">When?</th>\n", + "<th style=\"text-align: center;\">Advantages</th>\n", + "<th style=\"text-align: center;\">Disadvantages</th>\n", + "</tr>\n", + "</thead>\n", + "<tbody>\n", + "<tr>\n", + "<td style=\"text-align: left;\">Nearest Neighbors<br><br><code>KNeighborsClassifier</code></td>\n", + "<td style=\"text-align: left;\">- numeric data<br> - when (fast) linear classifiers do not work</td>\n", + "<td style=\"text-align: left;\">- simple (not many parameters to tweak), hence, a good baseline classifier</td>\n", + "<td style=\"text-align: left;\">- known not to work well for many dimensions (20 or even less features)</td>\n", + "</tr>\n", + "<tr>\n", + "<td style=\"text-align: left;\">Logistic Regression<br><br><code>LogisticRegression</code></td>\n", + "<td style=\"text-align: left;\">- high-dimensional data<br> - a lot of data</td>\n", + "<td style=\"text-align: left;\">- fast, also in high dimensions<br> - weights can be interpreted</td>\n", + "<td style=\"text-align: left;\">- data has to be linearly separable (happens often in higher dimensions)<br> - not very efficient with large number of samples</td>\n", + "</tr>\n", + "<tr>\n", + "<td style=\"text-align: left;\">Linear SVM<br><br><code>LinearSVC</code></td>\n", + "<td style=\"text-align: left;\">same as above but might be better for text analysis (many features)</td>\n", + "<td style=\"text-align: left;\">same as above but might be better with very large number of features</td>\n", + "<td style=\"text-align: left;\">same as above but possibly a bit better with large number of samples</td>\n", + "</tr>\n", + "<tr>\n", + "<td style=\"text-align: left;\">Kernel SVM<br><br><code>SVC</code></td>\n", + "<td style=\"text-align: left;\">same as above but when linear SVM does not work<br>- not too many data points</td>\n", + "<td style=\"text-align: left;\">same as above but learns non-linear boundaries</td>\n", + "<td style=\"text-align: left;\">same as above but much slower and requires data scaling<br>- model is not easily interpretable</td>\n", + "</tr>\n", + "<tr>\n", + "<td style=\"text-align: left;\">Decision Tree<br><br><code>DecisionTreeClassifier</code></td>\n", + "<td style=\"text-align: left;\">- for illustration/insight<br> - with multi-class problems <br> - with categorical or mixed categorical and numerical data</td>\n", + "<td style=\"text-align: left;\">- simple to interpret<br> - good classification speed and performance</td>\n", + "<td style=\"text-align: left;\">- prone to overfitting<br> - instable: small change in the training data can give very different model</td>\n", + "</tr>\n", + "<tr>\n", + "<td style=\"text-align: left;\">Ensemble Averaging<br><br><code>RandomForestClassifier</code></td>\n", + "<td style=\"text-align: left;\">- when decision tree would be used but for performance</td>\n", + "<td style=\"text-align: left;\">- fixes decision tree issues: does not overfit easily and is stable with respect to training data<br> - takes into account features dependencies<br> - can compute predicitve error when learning<br> ...</td>\n", + "<td style=\"text-align: left;\">- harder to interpret than a single decision tree</td>\n", + "</tr>\n", + "<tr>\n", + "<td style=\"text-align: left;\">Boosting<br><br><code>AdaBoostClassifier</code> (<code>XGBClassifier</code>)</td>\n", + "<td style=\"text-align: left;\">same as aobove</td>\n", + "<td style=\"text-align: left;\">- works very well out-of-the-box<br>- better performane and more interpretable than random forest when using depth 1 trees</td>\n", + "<td style=\"text-align: left;\">- more prone to overfitting than random forest</td>\n", + "</tr>\n", + "<tr style=\"border-bottom:1px solid black\">\n", + " <td colspan=\"100%\"></td>\n", + "</tr>\n", + "<tr>\n", + "<td colspan=\"100%\" style=\"text-align: center;\"><em>[not shown here]</em></td>\n", + "</tr>\n", + "<tr>\n", + "<td style=\"text-align: left;\">Naive Bayes<br><br><code>ComplementNB</code>, ...</td>\n", + "<td style=\"text-align: left;\">- with text data</td>\n", + "<td style=\"text-align: left;\">...</td>\n", + "<td style=\"text-align: left;\">...</td>\n", + "</tr>\n", + "<tr>\n", + "<td style=\"text-align: left;\">Stochastic Gradient<br><br><code>SGDClassifier</code></td>\n", + "<td style=\"text-align: left;\">- with really big data</td>\n", + "<td style=\"text-align: left;\">...</td>\n", + "<td style=\"text-align: left;\">...</td>\n", + "</tr>\n", + "<tr>\n", + "<td style=\"text-align: left;\">Kernel Approximation<br><br>pipeline: <code>RBFSampler</code> or <code>Nystroem</code> + <code>LinearSVC</code></td>\n", + "<td style=\"text-align: left;\">- with really big data and on-line training</td>\n", + "<td style=\"text-align: left;\">...</td>\n", + "<td style=\"text-align: left;\">...</td>\n", + "</tr>\n", + "</tbody>\n", + "</table>\n", "\n", - "| Classifier type | When? | Advantages | Disadvantages |\n", - "| ----------------|-------|------------|---------------|\n", - "| Nearest Neighbors<br><br>`KNeighborsClassifier` | - numeric data<br> - when (fast) linear classifiers do not work | - simple (not many parameters to tweak), hence, a good baseline classifier | - known not to work well for many dimensions (20 or even less features) |\n", - "| Logistic Regression<br><br>`LogisticRegression` | - high-dimensional data<br> - a lot of data | - fast, also in high dimensions<br> - weights can be interpreted | - data has to be linearly separable (happens often in higher dimensions) |\n", - "| Linear SVM<br><br>... | ... same above? | ... same as above? | ... same as above? |\n", - "| Kernel SVM<br><br>... | - same as above but when linear SVM does not work<br>- not too many data points | ... | (all linear SVM ones)<br> - requires data scaling<br> - model is not easily interpretable <br>... |\n", - "| Decision Tree<br><br>`DecisionTreeClassifier` | - for illustration/insight<br> - with multi-class problems <br> - with categorical or mixed categorical and numerical data | - simple to interpret<br> - good classification speed and performance<br> - supports class weights (TODO: which other methods do?) | - prone to overfitting<br> - instable: small change in the training data can give very different tree<br> - requires dataset balancing to avoid bias |\n", - "| Averaging<br>Random Forests<br><br> ... | - fixes some decision tree issues: does not overfit easily and is stable with respect to training data,<br> ... | - takes into account features dependencies<br> - ?allows for assesing features importance?<br> ... | - harder to interpret than a decision tree ... |\n", - "| Boosting<br>...<br><br>...| ... | - works very well out-of-the-box | ... |\n", - "| Naive Bayes<br><br>... | - with text data | ... | ... |\n", - "| Stochastic Gradient<br><br>... | - with big data | ... | ... |\n", - "| Kernel Approximation<br><br>... | - with big data | ... | ... |\n", - "|" + "</div>" ] }, { @@ -2989,55 +3102,6 @@ "</table>" ] }, - { - "cell_type": "markdown", - "metadata": { - "tags": [ - "TODO" - ] - }, - "source": [ - "## (*) Further reading\n", - "\n", - "### Text classification: Naive Bayes\n", - "\n", - "**TODO**\n", - "\n", - "### Dealing with large datasets: linear classifiers to the rescue\n", - "\n", - "**TODO**\n", - "\n", - "#### Stochastic Gradient Descent training with linear classifiers\n", - "\n", - "**TODO**\n", - "\n", - "* Loss function of classifier weights: linear SVM or logisitic regression\n", - "* SGD: update weights based on gradient descent of the loss function but computed only from a small random subset of samples\n", - "\n", - "<table>\n", - " <tr><td><img src=\"stochastic-vs-batch-gradient-descent.png\" width=600px></td></tr>\n", - " <tr><td><center><sub>Source: <a href=\"https://wikidocs.net/3413\">https://wikidocs.net/3413</a></sub></center></td></tr>\n", - "</table>\n", - "\n", - "\n", - "\n", - "#### Explicit approximate kernel transformation with linear classifiers\n", - "\n", - "**TODO**\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [ - "TODO" - ] - }, - "source": [ - "## (*) Coding session\n", - "RBF SVC vs. approx. RBF + linear SVC" - ] - }, { "cell_type": "markdown", "metadata": {},