07_regression.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "import warnings\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
    "warnings.filterwarnings(\"ignore\", message=\"X does not have valid feature names, but [a-zA-Z]+ was fitted with feature names\", category=UserWarning)\n",
    "                                  \n",
    "warnings.filterwarnings = lambda *a, **kw: None\n",
    "from IPython.core.display import HTML\n",
    "\n",
    "HTML(open(\"custom.html\", \"r\").read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 7: Regression\n",
    "\n",
    "Regression belongs like classification to the field of supervised learning. \n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "<i class=\"fa fa-info-circle\"></i>&nbsp; \n",
    "<strong>Regression predicts numerical values</strong> \n",
    "in contrast to classification which predicts categories.\n",
    "</div>\n",
    "\n",
    "<img src=\"./images/30416v.jpg\" title=\"made at imgflip.com\" width=35%/>\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "<i class=\"fa fa-info-circle\"></i>&nbsp; \n",
    "    Other main differences are:\n",
    "\n",
    "* Different quality measures\n",
    "* Other algorithms\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example: Salmon weight\n",
    "\n",
    "The dataset `data/salmon.csv` holds measurements of `circumference`, `length` and `weight` for  `atlantic` and `sockeye` salmons.\n",
    "\n",
    "Our goal is to predict `weight` based on the other three features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\"data/salmon.csv\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us inspect the features and their distributions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "sns.set(style=\"ticks\")\n",
    "\n",
    "sns.pairplot(df, hue=\"kind\", diag_kind=\"hist\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In contrast to our previous examples, our data set contains a non-numerical text column `kind`.\n",
    "\n",
    "As discussed before a common method to encode categorical features is **one-hot-encoding**, <code>sklearn.preprocessing.OneHotEncoder</code> is a preprocessor which transforms a categorical feature to this encoding:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import OneHotEncoder\n",
    "\n",
    "features = df.iloc[:, :-1]\n",
    "values = df.iloc[:, -1]\n",
    "\n",
    "# needs 2d data structure, features.iloc[2] has dimension 1\n",
    "encoder = OneHotEncoder(sparse=False)\n",
    "one_hot = encoder.fit_transform(features.iloc[:, 2:3])\n",
    "\n",
    "one_hot[:5, :]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the one-hot encoder computes two columns with exclusive flags 0 and 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features[\"is_atlantic\"] = one_hot[:, 0]\n",
    "features[\"is_sockeye\"] = one_hot[:, 1]\n",
    "\n",
    "features.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we remove the categorical column now:\n",
    "del features[\"kind\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we prepare the data for training and testing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "(features_train, features_test, values_train, values_test) = train_test_split(\n",
    "    features, values, random_state=42\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Without further explanation we pick a regression algorithm, more about regrssion algorithms will be discussed later:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.kernel_ridge import KernelRidge\n",
    "\n",
    "kr = KernelRidge(alpha=0.001, kernel=\"rbf\", gamma=0.05)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <i class=\"fa fa-info-circle\"></i>&nbsp; Regression methods in <code>scikit-learn</code> also have <code>fit</code> and <code>predict</code> methods. Thus cross validation, pipelines and hyperparameter-optimization will be available.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kr.fit(features_train, values_train)\n",
    "predicted = kr.predict(features_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us plot how good given and predicted values match on the training data set (sic !)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "def plot_fit_quality(values_test, predicted):\n",
    "\n",
    "    plt.figure(figsize=(12, 4))\n",
    "    plt.subplot(1, 2, 1)\n",
    "\n",
    "    x = np.arange(len(predicted))\n",
    "    plt.scatter(x, predicted - values_test, color=\"steelblue\", marker=\"o\")\n",
    "\n",
    "    plt.plot([0, len(predicted)], [0, 0], \"k:\")\n",
    "\n",
    "    max_diff = np.max(np.abs(predicted - values_test))\n",
    "    plt.ylim([-max_diff, max_diff])\n",
    "\n",
    "    plt.ylabel(\"error\")\n",
    "    plt.xlabel(\"sample id\")\n",
    "\n",
    "    plt.subplot(1, 2, 2)\n",
    "\n",
    "    plt.scatter(\n",
    "        x, (predicted - values_test) / values_test, color=\"steelblue\", marker=\"o\"\n",
    "    )\n",
    "    plt.plot([0, len(predicted)], [0, 0], \"k:\")\n",
    "    plt.ylim([-0.5, 0.5])\n",
    "\n",
    "    plt.ylabel(\"relative error\")\n",
    "    plt.xlabel(\"sample id\")\n",
    "\n",
    "\n",
    "plot_fit_quality(values_test, predicted)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For assessing the quality of the predictions of a regression method, we can use multiple methods which we will discuss later in this script.\n",
    "\n",
    "For our current example we compute the **mean absolute error** which is the average absolute difference between given values $y_i$ and predicted values  $\\hat{y}_i$:\n",
    "\n",
    "$$\n",
    "\\frac{1}{n} \\left(\\, |y_1 - \\hat{y}_1| \\, + \\, |y_2 - \\hat{y}_2| \\, + \\, \\ldots \\,+ \\,|y_n - \\hat{y}_n| \\,\\right)\n",
    "$$\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "error = np.sum(np.abs(predicted - values_test)) / len(values_test)\n",
    "print(error)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metrics / error measures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the classification metrics we introduced (like accuracy, precision, recall, F1) higher numbers indicated better classification performance. \n",
    "\n",
    "Most regression metrics turn this upside down. E.g. smaller values indicate a better regression model.\n",
    "\n",
    "|                |                                   |                        |   |   |\n",
    "|----------------|-----------------------------------|------------------------|---|---|\n",
    "| classification | accuracy, F1, ...                 | the larger the better  |   |   |\n",
    "| regression     |mean absolute error, ...  | the smaller the better |   |   |\n",
    "\n",
    "\n",
    "To harmonize this we can flip the sign of the regression scores, e.g."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "error = -np.sum(np.abs(predicted - values_test)) / len(values_test)\n",
    "print(error)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the regression score 0.3 which is better then 0.4 also becomes larger when we flip the sign:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "-0.3 > -0.4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The benefit is that we can use hyperparameter optimization functions from `scikit-learn` (which select configurations which yield a large score) without further modification. \n",
    "\n",
    "This is why the names of the scores we mention below are prefixed by `neg_`.\n",
    "\n",
    "Below some of the [metrics `scikit-learn` offers for measuring regression quality](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics):\n",
    "\n",
    "### 1. Mean absolute error\n",
    "\n",
    "This is the metric we used before. Taking absolute values before adding up the deviatons assures that deviations with different signs can not cancel out.\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <i class=\"fa fa-info-circle\"></i>&nbsp; <strong>mean absolute error</strong> is defined as \n",
    "\n",
    "\n",
    "$$\n",
    "\\frac{1}{n} \\left(\\, |y_1 - \\hat{y}_1| \\, + \\, |y_2 - \\hat{y}_2| \\, + \\, \\ldots \\,+ \\,|y_n - \\hat{y}_n| \\,\\right)\n",
    "$$\n",
    "\n",
    "\n",
    "</div>\n",
    "\n",
    "\n",
    "The name of the corresponding score in `scikit-learn` is `neg_mean_absolute_error`.\n",
    "\n",
    "\n",
    "### 2. Mean squared error\n",
    "\n",
    "Here we replace the absolute difference by its squared difference. Squaring also insures positive differences.\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <i class=\"fa fa-info-circle\"></i>&nbsp; <strong>mean squared error</strong> is defined as \n",
    "\n",
    "\n",
    "\n",
    "$$\n",
    "\\frac{1}{n} \\left(\\, (y_1 - \\hat{y}_1)^2 \\, + \\, (y_2 - \\hat{y}_2)^2 \\, \\, \\ldots \\,+ \\,(y_n - \\hat{y}_n)^2 \\,\\right)\n",
    "$$\n",
    "\n",
    "\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "This measure is more sensitive to outliers: A few larger differences contribute more significantly to a larger mean squared error. The name of the corresponding score in `scikit-learn` is `neg_mean_squared_error`.\n",
    "\n",
    "\n",
    "### 3. Median absolute error\n",
    "\n",
    "Here we replace mean calculation by median. \n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <i class=\"fa fa-info-circle\"></i>&nbsp; <strong>median absolute error</strong> is defined as \n",
    "\n",
    "\n",
    "\n",
    "$$\n",
    "\\text{median}\\left(\\,|y_1 - \\hat{y}_1|, \\,|y_2 - \\hat{y}_2|, \\,\\ldots, \\,|y_n - \\hat{y}_n| \\, \\right)\n",
    "$$\n",
    "\n",
    "\n",
    "</div>\n",
    "\n",
    "\n",
    "This measure is less sensitive to outliers than the metrics we discussed before: A few larger differences will not contribute significantly to a larger error value. The name of the corresponding score in `scikit-learn` is `neg_median_absolute_error`.\n",
    "\n",
    "### 4. Mean squared log error\n",
    "\n",
    "The formula for this metric can be found [here](https://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-log-error). \n",
    "\n",
    "This metric is recommended when your target values are distributed over a huge range of values, like popoluation numbers. \n",
    "The previous error metrics would put a larger weight on large target values. One could consider relative deviations to compensate such effects but relative deviations come with other problems like division by zero.\n",
    "\n",
    "\n",
    "The name is `neg_mean_squared_log_error`\n",
    "\n",
    "\n",
    "### 5. Explained variance and $r^2$-score\n",
    "\n",
    "Two other scores to mention are *explained variance* and $r^2$-score. For both larger values indicate better regression results.\n",
    "\n",
    "The formula for [r2 can be found here](https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score), the score takes values in the range $0 .. 1$. The name within `scikit-learn` is `r2`.\n",
    "\n",
    "The formula for [explained variance](https://scikit-learn.org/stable/modules/model_evaluation.html#explained-variance-score), the score takes values up to $1$. The name within `scikit-learn` is `explained_variance`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Some algorithms from `scikit-learn`\n",
    "\n",
    "- `sklearn.linear_model.LinearRegression` is a linear regression method, which only works well for target values which can be described as a linear combination of feature values. This is also known as linear least squares method.\n",
    "\n",
    "- `sklearn.linear_model.Ridge`, `sklearn.linear_model.Lasso`, and `sklearn.linear_model.ElasticNet` are linear regression methods with `L2`, `L1` resp `L2 + L1` regularization terms to avoid overfitting resp. improve generalization.\n",
    "\n",
    "- `sklearn.kernel_ridge.KernelRidge` is [documented here](https://scikit-learn.org/stable/modules/kernel_ridge.html#kernel-ridge). It combines the kernel trick from SVMs with ridge regression.\n",
    "\n",
    "\n",
    "- `sklearn.svm.SVR` is an extension of support vector classification concepts to regression, including the kernel trick for non-linear regression, [you find examples here](https://scikit-learn.org/stable/modules/svm.html#svm-regression).\n",
    "\n",
    "\n",
    "- `sklearn.neighbors.KNeighborsRegressor` extends the idea of nearest neighbour classification to regression: Search for similar data points in the learning data set and compute the predicted value from the values from the neighbourhood, e.g. by averaging or by linear interpolation. [Documentation is available here](https://scikit-learn.org/stable/modules/neighbors.html#regression)\n",
    "\n",
    "\n",
    "- `sklearn.tree.DecisionTreeRegressor` expands the concept of decision trees to regression [is documented here](https://scikit-learn.org/stable/modules/tree.html#regression).\n",
    "\n",
    "- `sklearn.linear_model.TweedieRegressor`, `sklearn.linear_model.PoissonRegressor`, `sklearn.linear_model.GammaRegressor` offer so-called *Generalized Linear Models* (**GLM**) \n",
    "   - These models are usually of interest when your target values are event-based discrete counts/frequencies, or continuous amounts, durations, costs/prices, or rates/probabilities.  The [scikit-learn GLM tutorial](https://scikit-learn.org/stable/modules/linear_model.html#generalized-linear-regression)  provides a formal insight as well as tips for choosing GLM with some use case examples. \n",
    "   - Beyond that the [wikipedia article about generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model#Intuition) gives a nice intuition and [this discussion](https://stats.stackexchange.com/questions/190763/how-to-decide-which-glm-family-to-use) provides a guide which GLM should be used when.\n",
    "   - For assessing and hyperparameter-optimization of such General Linear Models you should also use [suitable metrics from scikit-learn](https://scikit-learn.org/stable/modules/model_evaluation.html#mean-poisson-gamma-and-tweedie-deviances)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A full pipeline\n",
    "\n",
    "Let us now try to find a good regressor using `scikit-learn`s hyper-parameter tuning:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "from sklearn.kernel_ridge import KernelRidge\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import PolynomialFeatures, StandardScaler\n",
    "\n",
    "\n",
    "def eval_regression(p, features, values):\n",
    "    score = cross_val_score(\n",
    "        p, features, values, scoring=\"neg_median_absolute_error\", cv=4\n",
    "    ).mean()\n",
    "    print(\"cross val score:\", score)\n",
    "\n",
    "    predicted = p.fit(features_train, values_train).predict(features_test)\n",
    "    plot_fit_quality(values_test, predicted)\n",
    "\n",
    "\n",
    "p = make_pipeline(StandardScaler(), PolynomialFeatures(2), PCA(2), LinearRegression())\n",
    "eval_regression(p, features, values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = make_pipeline(StandardScaler(), PolynomialFeatures(), PCA(), LinearRegression())\n",
    "\n",
    "param_grid = {\n",
    "    \"polynomialfeatures__degree\": range(3, 6),\n",
    "    \"pca__n_components\": range(3, 11),\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "search = GridSearchCV(\n",
    "    p, param_grid, scoring=\"neg_median_absolute_error\", cv=4, n_jobs=4\n",
    ")\n",
    "\n",
    "search.fit(features, values)\n",
    "\n",
    "print(search.best_params_)\n",
    "eval_regression(search, features, values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise section\n",
    "\n",
    "- Play with the examples above and try different algorithms, metrics and pipelines.\n",
    "\n",
    "\n",
    "### Optional exercise: Timeseries prediction\n",
    "\n",
    "The file  `data/sales.csv` holds sales data of a swiss sports shop selling skiing equipment. The time axis is in units of months, starting with January.\n",
    "\n",
    "* Load the data and plot sales value over months"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "sales_data = pd.read_csv(\"data/sales.csv\")\n",
    "sales_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "months = sales_data.iloc[:160, 0].values\n",
    "sales = sales_data.iloc[:160, 1].values\n",
    "\n",
    "months_validate = sales_data.iloc[160:, 0].values\n",
    "sales_validate = sales_data.iloc[160:, 1].values\n",
    "\n",
    "\n",
    "plt.figure(figsize=(12, 5))\n",
    "plt.plot(months, sales, color=\"chocolate\", label=\"train\")\n",
    "plt.plot(\n",
    "    months_validate, sales_validate, color=\"chocolate\", linestyle=\"--\", label=\"validate\"\n",
    ")\n",
    "plt.legend()\n",
    "plt.xlabel(\"month\")\n",
    "plt.ylabel(\"sales\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"font-size: 130%;\">Approach for time series forecasting</div>\n",
    "\n",
    "**Advice**: you might want to read the full instructions several times to understand the concept.\n",
    "\n",
    "We want to learn from our single time series of sales by looking at its smaller parts and take sales value that comes after each such part (window) as an expected prediction.\n",
    "\n",
    "More formally, we learn to predict n-th value in the series $s_{n}$ based on few previous values `W`: $s_{n - W}, s_{n - W + 1} \\ldots s_{n - 1}$. `W` is a parameter called *window size*.\n",
    "\n",
    "E.g. for window size $W$ = 3 we create a feature matrix `X` based on sales values:\n",
    "\n",
    "    s_0, s_1, s_2\n",
    "    s_1, s_2, s_3\n",
    "    s_2, s_3, s_4\n",
    "    ...\n",
    "    \n",
    "and a vector of target values `y` as follows:\n",
    "\n",
    "    s_3\n",
    "    s_4\n",
    "    s_5\n",
    "    ...\n",
    "    \n",
    "Here is a function which takes sales vector `s_0, s_1, ...` and window size `W` and returns features matrix `X` and the right hand side `y`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_feature_matrix_and_target_values(sales, window_size):\n",
    "    features = np.zeros((len(sales) - window_size, window_size))\n",
    "\n",
    "    for i in range(len(sales) - window_size):\n",
    "        features[i] = sales[i : i + window_size]\n",
    "\n",
    "    return features, sales[window_size:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "def test():\n",
    "    X, y = create_feature_matrix_and_target_values(sales, 4)\n",
    "    assert np.all(X[0] == sales[:4])\n",
    "    assert np.all(X[1] == sales[1:5])\n",
    "    assert np.all(X[2] == sales[2:6])\n",
    "    assert np.all(X[-1] == sales[-5:-1])\n",
    "\n",
    "    assert np.all(y[0] == sales[4])\n",
    "    assert np.all(y[-1] == sales[-1])\n",
    "\n",
    "\n",
    "test()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Find optimal configurations (use the `r2` metric) for the regressors `Lasso`, `SVR` and `KernelRidge(kernel=\"rbf\")` and several window sizes. \n",
    "- For every month $i$ we want to compare the known value of $s_i$ and the predicted value for this month. To that end, plot both sales and predicted time series on one plot, and plot sales against predicted points as a scatter plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.kernel_ridge import KernelRidge\n",
    "from sklearn.linear_model import Lasso\n",
    "from sklearn.metrics import r2_score\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.svm import SVR\n",
    "\n",
    "# ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "lasso_grid = {\"alpha\": 10 ** np.linspace(-2, 3, 30)}\n",
    "svr_grid = {\"C\": 10 ** np.linspace(-4, 2, 30)}\n",
    "kernel_ridge_grid = {\"alpha\": 10 ** np.linspace(-4, 3, 30)}\n",
    "\n",
    "WINDOW_SIZE = 36\n",
    "X, y = create_feature_matrix_and_target_values(sales, WINDOW_SIZE)\n",
    "\n",
    "\n",
    "def main(X, y):\n",
    "\n",
    "    regressors = []\n",
    "\n",
    "    # Note: do not scale - the time series data (windows) are already scaled\n",
    "    for regressor, param_grid in [\n",
    "        (Lasso(), lasso_grid),\n",
    "        (SVR(), svr_grid),\n",
    "        (KernelRidge(kernel=\"rbf\"), kernel_ridge_grid),\n",
    "    ]:\n",
    "        search = GridSearchCV(regressor, param_grid, scoring=\"r2\", cv=5)\n",
    "        search.fit(X, y)\n",
    "\n",
    "        # we predict on the learning data set to get a general\n",
    "        # \"feeling\" how well the regressors work\n",
    "        predicted = search.predict(X)\n",
    "\n",
    "        plot_regression(regressor.__class__.__qualname__, predicted)\n",
    "\n",
    "        regressors.append(search)\n",
    "\n",
    "    return regressors\n",
    "\n",
    "\n",
    "def plot_regression(title, predicted):\n",
    "    f, (a0, a1) = plt.subplots(\n",
    "        1, 2, figsize=(18, 5), gridspec_kw={\"width_ratios\": [2, 1]}\n",
    "    )\n",
    "    f.suptitle(title)\n",
    "\n",
    "    a0.plot(months, sales, color=\"chocolate\", label=\"train\")\n",
    "    a0.plot(\n",
    "        months_validate,\n",
    "        sales_validate,\n",
    "        color=\"chocolate\",\n",
    "        linestyle=\"--\",\n",
    "        label=\"validate\",\n",
    "    )\n",
    "    a0.plot(\n",
    "        months[WINDOW_SIZE:], predicted, color=\"steelblue\", label=\"predicted (train)\"\n",
    "    )\n",
    "    a0.legend()\n",
    "    a0.set_xlabel(\"month\")\n",
    "    a0.set_ylabel(\"sales\")\n",
    "\n",
    "    a1.scatter(sales[WINDOW_SIZE:], predicted, color=\"steelblue\")\n",
    "    mi, ma = np.min(sales), np.max(sales)\n",
    "    a1.plot([mi, ma], [mi, ma], \"k:\")\n",
    "    r2 = r2_score(sales[WINDOW_SIZE:], predicted)\n",
    "    a1.set_title(\"r2 = {:.3f}\".format(r2))\n",
    "    a1.set_xlabel(\"sales\")\n",
    "    a1.set_ylabel(\"predicted (train)\")\n",
    "\n",
    "\n",
    "regressors = main(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we want to use our model to get long term predictions starting at month `N`, with a given window size `W`.\n",
    "\n",
    "We will be iteratively predicting next value after the window and shifting the window by one. We execute the following steps:\n",
    "\n",
    "1. predict first value after given time window.\n",
    "2. update data window: shift by one time step, thus: discard the first entry and append the newly predicted value\n",
    "3. continue with step 1.\n",
    "\n",
    "\n",
    "We demonstrate this for `W=4` and `N=100` to predict values `z_101`, `z_102`, ...:\n",
    "\n",
    "<div style=\"margin-left: 1em;\">\n",
    "<code>\n",
    "predict <span style=\"color: green; \">z_101</span> from s_097, s_098, s_099, s_100\n",
    "predict <span style=\"color: green; ;\">z_102</span> from s_098, s_099, s_100, <span style=\"color: green; \">z_101</span>\n",
    "predict <span style=\"color: green; \">z_103</span> from s_099, s_100, <span style=\"color: green; \">z_101, z_102</span>\n",
    "predict <span style=\"color: green; \">z_104</span> from s_100, <span style=\"color: green; \">z_101, z_102, z_103</span>\n",
    "predict <span style=\"color: green; ;\">z_105</span> from <span style=\"color: green; \">z_101, z_102, z_103, z_104</span>\n",
    "...                                          \n",
    "</code> \n",
    "</div>\n",
    "\n",
    "\n",
    "- Implement this procedure and plot forecasts vs real data for different regression algorithms and window sizes.\n",
    "     \n",
    "     "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "# start predicting from a first validation time point\n",
    "# => add the last training window for the first WINDOW_SIZE predictions\n",
    "X_validate, y_validate = create_feature_matrix_and_target_values(\n",
    "    np.r_[sales[-WINDOW_SIZE:], sales_validate], WINDOW_SIZE\n",
    ")\n",
    "\n",
    "\n",
    "def forecast(X, y, regressor):\n",
    "\n",
    "    # we crate a copy because we change below content\n",
    "    # current_window in place. Without copy we would\n",
    "    # also change X!\n",
    "    n_last = len(y)  # make sure to skip the last training window\n",
    "    current_window = X[-n_last, :].copy()\n",
    "\n",
    "    predicted = []\n",
    "\n",
    "    for k in range(n_last):\n",
    "        new = regressor.predict(current_window[None, :])\n",
    "        predicted.append(new)\n",
    "        # this relates to the comment above! here we\n",
    "        # modify the window data in place:\n",
    "        current_window[:-1] = current_window[1:]\n",
    "        current_window[-1] = new\n",
    "\n",
    "    return np.array(predicted).flatten(), y[-n_last:]\n",
    "\n",
    "\n",
    "def forecast_and_plot(regressor):\n",
    "    predicted, correct = forecast(X_validate, y_validate, regressor)\n",
    "\n",
    "    f, (a0, a1) = plt.subplots(\n",
    "        1, 2, figsize=(18, 5), gridspec_kw={\"width_ratios\": [2, 1]}\n",
    "    )\n",
    "    f.suptitle(regressor.estimator.__class__.__qualname__)\n",
    "\n",
    "    a0.plot(months, sales, color=\"chocolate\", linestyle=\"--\", label=\"train\")\n",
    "    a0.plot(months_validate, correct, color=\"chocolate\", label=\"validate\")\n",
    "    a0.plot(months_validate, predicted, color=\"steelblue\", label=\"predicted (validate)\")\n",
    "    a0.legend()\n",
    "    a0.set_xlabel(\"month\")\n",
    "    a0.set_ylabel(\"sales\")\n",
    "\n",
    "    a1.scatter(correct, predicted, color=\"steelblue\")\n",
    "    mi, ma = np.min(correct), np.max(correct)\n",
    "    a1.plot([mi, ma], [mi, ma], \"k:\")\n",
    "    r2 = r2_score(correct, predicted)\n",
    "    a1.set_title(\"r2 = {:.3f}\".format(r2))\n",
    "    a1.set_xlabel(\"sales\")\n",
    "    a1.set_ylabel(\"predicted (validate)\")\n",
    "\n",
    "\n",
    "for regressor in regressors:\n",
    "    forecast_and_plot(regressor)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (C) 2019-2022 ETH Zurich, SIS ID"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "machine-learning-ws-2023",
   "language": "python",
   "name": "machine-learning-ws-2023"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}