# Chapter 3: Overfitting, underfitting and cross-validation
%% Cell type:markdown id: tags:
## 1. What are overfitting and underfitting ?
Let us repeat the `LogisticRegression` based beer classfier we used in the first script, we disovered that setting `C = 2` (remember that the parameter `C` controls the `regularization`, lower `C` means higher `regularization` and vice-versa) gave us good results:
%% Cell type:code id: tags:
``` python
import pandas as pd
# reading the beer dataset
beer_data = pd.read_csv("beers.csv")
print(beer_data.shape)
# all columns up to the last one:
input_features = beer_data.iloc[:, :-1]
# only the last column:
labels = beer_data.iloc[:, -1]
from sklearn.linear_model import LogisticRegression
The parameter `gamma` of `SVC` has an effect on the flexibility/complexity of the decision surface. A large value allows a very flexible / "irregular" decision surface, for smaller values the surface gets smoother / "stiffer" / "more regular".
This is also coined **simple** resp. **complex** models.
We see here also
- that the smallest `gamma` value produces a classifier which seems to get the idea of a "circle",
- whereas the large `gamma` value adapts the classifier more to the given examples.
The plot above shows an extreme example for the previously mentioned effect of overfitting.
- If we evaluate performance of this classifier on the training data set we get an **accuracy of `~100%`**
- But the classifier totally fails to learn the concept of a circle, and you can easily imagine how bad this classifier performs on new and unseen data.
<li><strong>Overfitting:</strong>The classifier adapts/fits too closely to the sample data from a given model instead of learning the underlying concept. Thus the classifier does not generalize well and shows strongly degraded performance on previously unseen data.<br/><br/>
<li><strong>Generalization:</strong> A classifier "generalizes" well if we see similar performance on training and on new data.<br/><br/>
<li> A <strong>robust classifier:</strong> A trained classifier which is not or very little susceptible to overfitting.
</ul>
</p>
</div>
%% Cell type:markdown id: tags:
### More "probabilistic" definition:
- Our data is generated by a (usually unknown) model.
- We have only samples from this model.
- A classifier tries to approximate the underlying model based on the given samples.
In this context the observed bad generalization performance of the classifier can be explained by computing a model which is to far away from the original model.
The following graphics depicts our explanations:
- The more "complex" a model gets the better it fits trainig data. Thus accuracy on the training data improves.
- At a certain point the model is too adapted to the training data and gets worse and worse on evaluation data.
The other extreme is called **underfitting**: The classifiers decision boundary deviates too far from the sample data and produces a classifier which performs badly even on the training data.
We can demonstrate this by choosing a "too small" value of `gamma`
Our fundamental mistake was to evaluate the performace <br/>of the classifier on the training data.
</center>
</h3>
</div>
Repeat:
<div class="alert alert-block alert-warning">
<h3>
<i class="fa fa-info-circle"></i>
<center>
Our fundamental mistake was to evaluate the performace <br/>of the classifier on the training data.
</center>
</h3>
</div>
## 2. How can we do better ?
There is no classifier which works out of the box in all situations. Depending on the "geometry" / "shape" of the data, classification algorithms and their settings can make a big difference.
In our previous 2D examples we were able to visualize the data and classification results, this is not possible for higher dimensional data.
The general way to handle this situation is as follows:
- split our data into a learning data set and a test data set
- train the classifier on the learning data set
- assess performance of the classifier on the test data set.
%% Cell type:markdown id: tags:
### Variant: randomized cross validation
A randomized variant works like this:
- Perform $n$ iterations:
- draw a fraction $p$ (e.g. 80%) from your full data set without replacement for the training data set.
- use the remaining fraction $1 - p$ as evaluation data set
- train classifier and compute metric(s).
%% Cell type:markdown id: tags:
## 3. Cross valiation with scikit-learn
%% Cell type:code id: tags:
``` python
#from sklearn.utils import shuffle
import pandas as pd
beer = pd.read_csv("beers.csv")
beer_eval = pd.read_csv("beers_eval.csv")
all_beer = pd.concat((beer, beer_eval))
all_beer.shape
```
%% Output
(300, 5)
%% Cell type:markdown id: tags:
Below we use _accuracy_ as a so called _"metric"_, this is the percentage of correct classifications.
More about strategies on how to assess the quality of a classifier in one of the following scripts.
%% Cell type:code id: tags:
``` python
features = all_beer.iloc[:, :-1]
labels = all_beer.iloc[:, -1]
classifier = LogisticRegression(C=2)
from sklearn.model_selection import cross_val_score
# "accuracy" is the way we evaluated the classifier up to now,
# which is the percentage of correct classification.
# more about so called "metrics" in the following chapter.
print("std dev of test score is {:.3f}".format(s))
print("true test score is with 75% probability between {:.3f} and {:.3f}".format(low, high))
print("true test score is with 96% probability between {:.3f} and {:.3f}".format(low, high))
```
%% Output
mean test score is 0.837
std dev of test score is 0.067
true test score is with 75% probability between 0.703 and 0.970
true test score is with 96% probability between 0.703 and 0.970
%% Cell type:markdown id: tags:
## Exercise section
1. Play with the previous examples.
2. Optional exercise: implement classifier + cross evaluation on the iris data set introduced in script 1.
%% Cell type:markdown id: tags:
## 4. Some reasons for overfitting and how you might fight it.
### 1. Small / insufficient data sets.
The classifier fails go "grab the concept" because the "concept" is not represented strongly enough in the data set.
Possible solutions:
- Get more data.
- Augment your data by creating artificial/synthetic data (e.g. for images: shift / scale / rotate images) if feasible.
### 2. Unsuitbable classifier / classifier parameters used
This is what we observed in the example before.
Possible solutions:
- optimize parameters using cross-validation.
- evaluate other classification algorithms.
### 3. Noise / uninformative features
A classifier can in some sitations use noise or uninformative features to explain noise in the training data. In such cases noise contributes to "artificially" good results on the training data.
Possible solutions:
- Inspect your data to detect noisy or uninformative features.
- run experiments with excluded features. This can be automated, see [recursive feature elimination in scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE).
### 4. Strongly correlated / redundant features
In case the data set contains strongly, but not 100% correlated features, their (weighted) difference might be considered as random data. The effect is then similar to 3.
Possible solutions:
- Inspect data to detect noise and correlations.
- Use dimension reduction techniques like `PCA` (more about this later).
- Run experiments with excluded features. This can be automated, see [recursive feature elimination](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE).
The following code demonstrates the effect of noise and redundant features:
You can see above that the classifier yields better accuracy on the extended training data set. But you also can see that the performance on the extended evaluation data set is worse than before.
%% Cell type:markdown id: tags:
## 5. Training the final classifier
Cross-validation was helpful to determine and tune a good classifier. But how do we eventually build the classifier we want to use later "in production" ?
A common procedure is:
- Split your data 80% to 20% (or another fraction) from the beginning.
- Use the 80% fraction for determining and tuning a classifier.
- Train the final classifier on the 80% part.
- Finally use the 20% fraction for a final validation of the classifiers accuracy.
<img src="./cross_eval_and_test.svg?7">
Comment: Literature is not consistent in terms. Sometimes the terms "validation data set" and "test data set" are interchanged.
%% Cell type:markdown id: tags:
### Demonstration
We introduce the `train_test_split` function from `sklearn.model_selection` in the following example.
It splits features and labels in a given proportion. Usually this is randomized, so that you get different results for every function invocation. To get the same result every time we use `random_state=..` (with arbitrary number)
below:
%% Cell type:code id: tags:
``` python
# SPLIT DATASETS 80:20
np.random.seed(5) # to get same results every time
n = len(features)
indices = np.arange(n)
np.random.shuffle(indices)
features = features.iloc[indices]
labels = labels.iloc[indices]
from sklearn.model_selection import train_test_split