"There is no classifier which works out of the box in all situations. Depending on the \"geometry\" / \"shape\" of the data, classification algorithms and their settings can make a big difference.\n",
"There is no classifier which works out of the box in all situations. Depending on the \"geometry\" / \"shape\" of the data, classification algorithms and their settings can make a big difference.\n",
"\n",
"\n",
"In our previous 2D examples we were able to visualize the data and classification results, this is not possible for higher dimensional data.\n",
"In our previous 2D examples we were able to visualize the data and classification results, this is not possible for higher dimensional data.\n",
...
@@ -621,8 +626,54 @@
...
@@ -621,8 +626,54 @@
"The general way to handle this situation is as follows: \n",
"The general way to handle this situation is as follows: \n",
"\n",
"\n",
"- split our data into a learning data set and a test data set\n",
"- split our data into a learning data set and a test data set\n",
"\n",
"\n",
"- train the classifier on the learning data set\n",
"- train the classifier on the learning data set\n",
"- assess performance of the classifier on the test data set."
"\n",
"\n",
"- assess performance of the classifier on the test data set.\n",
"\n",
"\n",
"### Cross-validation\n",
"\n",
"<img src=\"https://i.imgflip.com/305azk.jpg\" title=\"made at imgflip.com\" width=40%/>\n",
"\n",
"\n",
"The procedure called *cross-validation* goes a step further: In this procedure the full dataset is split into learn-/test-set in various ways and statistics of the achieved metrics is computed to assess the classifier.\n",
"\n",
"A common approach is **K-fold cross-validation**:\n",
"\n",
"K-fold cross-validation has an advantage that we do not leave out part of our data from training. This is useful when we do not have a lot of data. \n",
"\n",
"### Example: 4-fold cross validation\n",
"\n",
"For 4-fold cross validation we split our data set into four equal sized partitions P1, P2, P3 and P4.\n",
"\n",
"We:\n",
"\n",
"- hold out `P1`: train the classifier on `P2 + P3 + P4`, compute accuracy `m1` on `P1`.\n",
"\n",
"<img src=\"cross_val_0.svg?2\" />\n",
"\n",
"- hold out `P2`: train the classifier on `P1 + P3 + P4`, compute accuracy `m2` on `P2`.\n",
"\n",
"<img src=\"cross_val_1.svg?2\" />\n",
"\n",
"- hold out `P3`: train the classifier on `P1 + P2 + P4`, compute accuray `m3` on `P3`.\n",
"\n",
"<img src=\"cross_val_2.svg?2\" />\n",
"\n",
"- hold out `P4`: train the classifier on `P1 + P2 + P3`, compute accuracy `m4` on `P4`.\n",
"\n",
"<img src=\"cross_val_3.svg?2\" />\n",
"\n",
"Finally we can compute the average of `m1` .. `m4` as the final measure for accuracy.\n",
"\n",
"Some advice:\n",
"\n",
"- This can be done on the original data or on randomly shuffled data. It is recommended to shuffle the data first, as there might be some unknown underlying ordering in your dataset\n",
"\n",
"- Usually one uses 3- to 10-fold cross validation, depending on the amount of data available."
]
]
},
},
{
{
...
...
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !
# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !
# Chapter 3: Overfitting, underfitting and cross-validation
# Chapter 3: Overfitting, underfitting and cross-validation
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## 1. What are overfitting and underfitting ?
## 1. What are overfitting and underfitting ?
Let us repeat the `LogisticRegression` based beer classfier we used in the first script, we disovered that setting `C = 2` (remember that the parameter `C` controls the `regularization`, lower `C` means higher `regularization` and vice-versa) gave us good results:
Let us repeat the `LogisticRegression` based beer classfier we used in the first script, we disovered that setting `C = 2` (remember that the parameter `C` controls the `regularization`, lower `C` means higher `regularization` and vice-versa) gave us good results:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
import pandas as pd
import pandas as pd
# reading the beer dataset
# reading the beer dataset
beer_data = pd.read_csv("beers.csv")
beer_data = pd.read_csv("beers.csv")
print(beer_data.shape)
print(beer_data.shape)
# all columns up to the last one:
# all columns up to the last one:
input_features = beer_data.iloc[:, :-1]
input_features = beer_data.iloc[:, :-1]
# only the last column:
# only the last column:
labels = beer_data.iloc[:, -1]
labels = beer_data.iloc[:, -1]
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegression
The parameter `gamma` of `SVC` has an effect on the flexibility/complexity of the decision surface. A large value allows a very flexible / "irregular" decision surface, for smaller values the surface gets smoother / "stiffer" / "more regular".
The parameter `gamma` of `SVC` has an effect on the flexibility/complexity of the decision surface. A large value allows a very flexible / "irregular" decision surface, for smaller values the surface gets smoother / "stiffer" / "more regular".
This is also coined **simple** resp. **complex** models.
This is also coined **simple** resp. **complex** models.
We see here also
We see here also
- that the smallest `gamma` value produces a classifier which seems to get the idea of a "circle",
- that the smallest `gamma` value produces a classifier which seems to get the idea of a "circle",
- whereas the large `gamma` value adapts the classifier more to the given examples.
- whereas the large `gamma` value adapts the classifier more to the given examples.
The plot above shows an extreme example for the previously mentioned effect of overfitting.
The plot above shows an extreme example for the previously mentioned effect of overfitting.
- If we evaluate performance of this classifier on the training data set we get an **accuracy of `~100%`**
- If we evaluate performance of this classifier on the training data set we get an **accuracy of `~100%`**
- But the classifier totally fails to learn the concept of a circle, and you can easily imagine how bad this classifier performs on new and unseen data.
- But the classifier totally fails to learn the concept of a circle, and you can easily imagine how bad this classifier performs on new and unseen data.
<li><strong>Overfitting:</strong>The classifier adapts/fits too closely to the sample data from a given model instead of learning the underlying concept. Thus the classifier does not generalize well and shows strongly degraded performance on previously unseen data.<br/><br/>
<li><strong>Overfitting:</strong>The classifier adapts/fits too closely to the sample data from a given model instead of learning the underlying concept. Thus the classifier does not generalize well and shows strongly degraded performance on previously unseen data.<br/><br/>
<li><strong>Generalization:</strong> A classifier "generalizes" well if we see similar performance on training and on new data.<br/><br/>
<li><strong>Generalization:</strong> A classifier "generalizes" well if we see similar performance on training and on new data.<br/><br/>
<li> A <strong>robust classifier:</strong> A trained classifier which is not or very little susceptible to overfitting.
<li> A <strong>robust classifier:</strong> A trained classifier which is not or very little susceptible to overfitting.
</ul>
</ul>
</p>
</p>
</div>
</div>
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### More "probabilistic" definition:
### More "probabilistic" definition:
- Our data is generated by a (usually unknown) model.
- Our data is generated by a (usually unknown) model.
- We have only samples from this model.
- We have only samples from this model.
- A classifier tries to approximate the underlying model based on the given samples.
- A classifier tries to approximate the underlying model based on the given samples.
In this context the observed bad generalization performance of the classifier can be explained by computing a model which is to far away from the original model.
In this context the observed bad generalization performance of the classifier can be explained by computing a model which is to far away from the original model.
The following graphics depicts our explanations:
The following graphics depicts our explanations:
- The more "complex" a model gets the better it fits trainig data. Thus accuracy on the training data improves.
- The more "complex" a model gets the better it fits trainig data. Thus accuracy on the training data improves.
- At a certain point the model is too adapted to the training data and gets worse and worse on evaluation data.
- At a certain point the model is too adapted to the training data and gets worse and worse on evaluation data.
The other extreme is called **underfitting**: The classifiers decision boundary deviates too far from the sample data and produces a classifier which performs badly even on the training data.
The other extreme is called **underfitting**: The classifiers decision boundary deviates too far from the sample data and produces a classifier which performs badly even on the training data.
We can demonstrate this by choosing a "too small" value of `gamma`
We can demonstrate this by choosing a "too small" value of `gamma`
Our fundamental mistake was to evaluate the performace <br/>of the classifier on the training data.
Our fundamental mistake was to evaluate the performace <br/>of the classifier on the training data.
</center>
</center>
</h3>
</h3>
</div>
</div>
Repeat:
Repeat:
<div class="alert alert-block alert-warning">
<div class="alert alert-block alert-warning">
<h3>
<h3>
<i class="fa fa-info-circle"></i>
<i class="fa fa-info-circle"></i>
<center>
<center>
Our fundamental mistake was to evaluate the performace <br/>of the classifier on the training data.
Our fundamental mistake was to evaluate the performace <br/>of the classifier on the training data.
</center>
</center>
</h3>
</h3>
</div>
</div>
## 2. How can we do better ?
## 2. How can we do better ?
%% Cell type:markdown id: tags:
There is no classifier which works out of the box in all situations. Depending on the "geometry" / "shape" of the data, classification algorithms and their settings can make a big difference.
There is no classifier which works out of the box in all situations. Depending on the "geometry" / "shape" of the data, classification algorithms and their settings can make a big difference.
In our previous 2D examples we were able to visualize the data and classification results, this is not possible for higher dimensional data.
In our previous 2D examples we were able to visualize the data and classification results, this is not possible for higher dimensional data.
The general way to handle this situation is as follows:
The general way to handle this situation is as follows:
- split our data into a learning data set and a test data set
- split our data into a learning data set and a test data set
- train the classifier on the learning data set
- train the classifier on the learning data set
- assess performance of the classifier on the test data set.
- assess performance of the classifier on the test data set.
### Cross-validation
<img src="https://i.imgflip.com/305azk.jpg" title="made at imgflip.com" width=40%/>
The procedure called *cross-validation* goes a step further: In this procedure the full dataset is split into learn-/test-set in various ways and statistics of the achieved metrics is computed to assess the classifier.
A common approach is **K-fold cross-validation**:
K-fold cross-validation has an advantage that we do not leave out part of our data from training. This is useful when we do not have a lot of data.
### Example: 4-fold cross validation
For 4-fold cross validation we split our data set into four equal sized partitions P1, P2, P3 and P4.
We:
- hold out `P1`: train the classifier on `P2 + P3 + P4`, compute accuracy `m1` on `P1`.
<img src="cross_val_0.svg?2" />
- hold out `P2`: train the classifier on `P1 + P3 + P4`, compute accuracy `m2` on `P2`.
<img src="cross_val_1.svg?2" />
- hold out `P3`: train the classifier on `P1 + P2 + P4`, compute accuray `m3` on `P3`.
<img src="cross_val_2.svg?2" />
- hold out `P4`: train the classifier on `P1 + P2 + P3`, compute accuracy `m4` on `P4`.
<img src="cross_val_3.svg?2" />
Finally we can compute the average of `m1` .. `m4` as the final measure for accuracy.
Some advice:
- This can be done on the original data or on randomly shuffled data. It is recommended to shuffle the data first, as there might be some unknown underlying ordering in your dataset
- Usually one uses 3- to 10-fold cross validation, depending on the amount of data available.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Variant: randomized cross validation
### Variant: randomized cross validation
A randomized variant works like this:
A randomized variant works like this:
- Perform $n$ iterations:
- Perform $n$ iterations:
- draw a fraction $p$ (e.g. 80%) from your full data set without replacement for the training data set.
- draw a fraction $p$ (e.g. 80%) from your full data set without replacement for the training data set.
- use the remaining fraction $1 - p$ as evaluation data set
- use the remaining fraction $1 - p$ as evaluation data set
- train classifier and compute metric(s).
- train classifier and compute metric(s).
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## 3. Cross valiation with scikit-learn
## 3. Cross valiation with scikit-learn
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
#from sklearn.utils import shuffle
#from sklearn.utils import shuffle
import pandas as pd
import pandas as pd
beer = pd.read_csv("beers.csv")
beer = pd.read_csv("beers.csv")
beer_eval = pd.read_csv("beers_eval.csv")
beer_eval = pd.read_csv("beers_eval.csv")
all_beer = pd.concat((beer, beer_eval))
all_beer = pd.concat((beer, beer_eval))
all_beer.shape
all_beer.shape
```
```
%% Output
%% Output
(300, 5)
(300, 5)
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Below we use _accuracy_ as a so called _"metric"_, this is the percentage of correct classifications.
Below we use _accuracy_ as a so called _"metric"_, this is the percentage of correct classifications.
More about strategies on how to assess the quality of a classifier in one of the following scripts.
More about strategies on how to assess the quality of a classifier in one of the following scripts.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
features = all_beer.iloc[:, :-1]
features = all_beer.iloc[:, :-1]
labels = all_beer.iloc[:, -1]
labels = all_beer.iloc[:, -1]
classifier = LogisticRegression(C=2)
classifier = LogisticRegression(C=2)
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_score
# "accuracy" is the way we evaluated the classifier up to now,
# "accuracy" is the way we evaluated the classifier up to now,
# which is the percentage of correct classification.
# which is the percentage of correct classification.
# more about so called "metrics" in the following chapter.
# more about so called "metrics" in the following chapter.
print("std dev of test score is {:.3f}".format(s))
print("std dev of test score is {:.3f}".format(s))
print("true test score is with 96% probability between {:.3f} and {:.3f}".format(low, high))
print("true test score is with 96% probability between {:.3f} and {:.3f}".format(low, high))
```
```
%% Output
%% Output
mean test score is 0.837
mean test score is 0.837
std dev of test score is 0.067
std dev of test score is 0.067
true test score is with 96% probability between 0.703 and 0.970
true test score is with 96% probability between 0.703 and 0.970
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Exercise section
## Exercise section
1. Play with the previous examples.
1. Play with the previous examples.
2. Optional exercise: implement classifier + cross evaluation on the iris data set introduced in script 1.
2. Optional exercise: implement classifier + cross evaluation on the iris data set introduced in script 1.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## 4. Some reasons for overfitting and how you might fight it.
## 4. Some reasons for overfitting and how you might fight it.
### 1. Small / insufficient data sets.
### 1. Small / insufficient data sets.
The classifier fails go "grab the concept" because the "concept" is not represented strongly enough in the data set.
The classifier fails go "grab the concept" because the "concept" is not represented strongly enough in the data set.
Possible solutions:
Possible solutions:
- Get more data.
- Get more data.
- Augment your data by creating artificial/synthetic data (e.g. for images: shift / scale / rotate images) if feasible.
- Augment your data by creating artificial/synthetic data (e.g. for images: shift / scale / rotate images) if feasible.
### 2. Unsuitbable classifier / classifier parameters used
### 2. Unsuitbable classifier / classifier parameters used
This is what we observed in the example before.
This is what we observed in the example before.
Possible solutions:
Possible solutions:
- optimize parameters using cross-validation.
- optimize parameters using cross-validation.
- evaluate other classification algorithms.
- evaluate other classification algorithms.
### 3. Noise / uninformative features
### 3. Noise / uninformative features
A classifier can in some sitations use noise or uninformative features to explain noise in the training data. In such cases noise contributes to "artificially" good results on the training data.
A classifier can in some sitations use noise or uninformative features to explain noise in the training data. In such cases noise contributes to "artificially" good results on the training data.
Possible solutions:
Possible solutions:
- Inspect your data to detect noisy or uninformative features.
- Inspect your data to detect noisy or uninformative features.
- run experiments with excluded features. This can be automated, see [recursive feature elimination in scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE).
- run experiments with excluded features. This can be automated, see [recursive feature elimination in scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE).
### 4. Strongly correlated / redundant features
### 4. Strongly correlated / redundant features
In case the data set contains strongly, but not 100% correlated features, their (weighted) difference might be considered as random data. The effect is then similar to 3.
In case the data set contains strongly, but not 100% correlated features, their (weighted) difference might be considered as random data. The effect is then similar to 3.
Possible solutions:
Possible solutions:
- Inspect data to detect noise and correlations.
- Inspect data to detect noise and correlations.
- Use dimension reduction techniques like `PCA` (more about this later).
- Use dimension reduction techniques like `PCA` (more about this later).
- Run experiments with excluded features. This can be automated, see [recursive feature elimination](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE).
- Run experiments with excluded features. This can be automated, see [recursive feature elimination](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE).
The following code demonstrates the effect of noise and redundant features:
The following code demonstrates the effect of noise and redundant features:
You can see above that the classifier yields better accuracy on the extended training data set. But you also can see that the performance on the extended evaluation data set is worse than before.
You can see above that the classifier yields better accuracy on the extended training data set. But you also can see that the performance on the extended evaluation data set is worse than before.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## 5. Training the final classifier
## 5. Training the final classifier
Cross-validation was helpful to determine and tune a good classifier. But how do we eventually build the classifier we want to use later "in production" ?
Cross-validation was helpful to determine and tune a good classifier. But how do we eventually build the classifier we want to use later "in production" ?
A common procedure is:
A common procedure is:
- Split your data 80% to 20% (or another fraction) from the beginning.
- Split your data 80% to 20% (or another fraction) from the beginning.
- Use the 80% fraction for determining and tuning a classifier.
- Use the 80% fraction for determining and tuning a classifier.
- Train the final classifier on the 80% part.
- Train the final classifier on the 80% part.
- Finally use the 20% fraction for a final validation of the classifiers accuracy.
- Finally use the 20% fraction for a final validation of the classifiers accuracy.
<img src="./cross_eval_and_test.svg?7">
<img src="./cross_eval_and_test.svg?7">
Comment: Literature is not consistent in terms. Sometimes the terms "validation data set" and "test data set" are interchanged.
Comment: Literature is not consistent in terms. Sometimes the terms "validation data set" and "test data set" are interchanged.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Demonstration
### Demonstration
We introduce the `train_test_split` function from `sklearn.model_selection` in the following example.
We introduce the `train_test_split` function from `sklearn.model_selection` in the following example.
It splits features and labels in a given proportion. Usually this is randomized, so that you get different results for every function invocation. To get the same result every time we use `random_state=..` (with arbitrary number)
It splits features and labels in a given proportion. Usually this is randomized, so that you get different results for every function invocation. To get the same result every time we use `random_state=..` (with arbitrary number)
below:
below:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
# SPLIT DATASETS 80:20
# SPLIT DATASETS 80:20
np.random.seed(5) # to get same results every time
np.random.seed(5) # to get same results every time
n = len(features)
n = len(features)
indices = np.arange(n)
indices = np.arange(n)
np.random.shuffle(indices)
np.random.shuffle(indices)
features = features.iloc[indices]
features = features.iloc[indices]
labels = labels.iloc[indices]
labels = labels.iloc[indices]
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split