diff --git a/01_introduction.json b/01_introduction.json deleted file mode 100644 index decba45ad37fe7ab857b4db618c9534038a10583..0000000000000000000000000000000000000000 --- a/01_introduction.json +++ /dev/null @@ -1,707 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Chapter 1: General Introduction to machine learning (ML)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## ML = \"learning models from data\"\n", - "\n", - "\n", - "### About models\n", - "\n", - "A \"model\" allows us to explain observations and to answer questions. For example:\n", - "\n", - " 1. Where will my car at given velocity stop if I apply break now?\n", - " 2. Where on the night sky will I see the moon tonight?\n", - " 2. Is the email I received spam?\n", - " 4. Which article "X" should I recommend to a customer "Y"?\n", - " \n", - "- The first two questions can be answered based on existing physical models (formulas). \n", - "\n", - "- For the questions 3 and 4 it is difficult to develop explicitly formulated models. \n", - "\n", - "### What is needed to apply ML ?\n", - "\n", - "Problems 3 and 4 have the following in common:\n", - "\n", - "- No exact model known or implementable because we have a vague understanding of the problem domain.\n", - "- But enough data with sufficient and implicit information is available.\n", - "\n", - "\n", - "\n", - "E.g. for the spam email example:\n", - "\n", - "- We have no explicit formula for such a task (and devising one would boil down to lots of trial with different statistics or scores and possibly weighting of them).\n", - "- We have a vague understanding of the problem domain because we know that some words are specific to spam emails and others are specific to my personal and work-related emails.\n", - "- My mailbox is full with examples of both spam and non-spam emails.\n", - "\n", - "**In such cases machine learning offers approaches to build models based on example data.**\n", - "\n", - "<div class=\"alert alert-block alert-info\">\n", - "<i class=\"fa fa-info-circle\"></i>\n", - "The closely-related concept of <strong>data mining</strong> usually means use of predictive machine learning models to explicitly discover previously unknown knowledge from a specific data set, such as, for instance, association rules between customer and article types in the Problem 4 above.\n", - "</div>\n", - "\n", - "\n", - "\n", - "## ML: what is \"learning\" ?\n", - "\n", - "To create a predictive model, we must first **train** such a model on given data. \n", - "\n", - "<div class=\"alert alert-block alert-info\">\n", - "<i class=\"fa fa-info-circle\"></i>\n", - "Alternative names for \"to train\" a model are \"to <strong>fit</strong>\" or \"to <strong>learn</strong>\" a model.\n", - "</div>\n", - "\n", - "\n", - "All ML algorithms have in common that they rely on internal data structures and/or parameters. Learning then builds up such data structures or adjusts parameters based on the given data. After that such models can be used to explain observations or to answer questions.\n", - "\n", - "The important difference between explicit models and models learned from data:\n", - "\n", - "- Explicit models usually offer exact answers to questions\n", - "- Models we learn from data usually come with inherent uncertainty." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "## Some history\n", - "\n", - "Some parts of ML are older than you might think. This is a rough time line with a few selected achievements from this field:\n", - "\n", - " 1805: Least squares regression\n", - " 1812: Bayes' rule\n", - " 1913: Markov Chains\n", - "\n", - " 1951: First neural network\n", - " 1957-65: \"k-means\" clustering algorithm\n", - " 1959: Term \"machine learning\" is coined by Arthur Samuel, an AI pioneer\n", - " 1969: Book \"Perceptrons\": Limitations of Neural Networks\n", - " 1984: Book \"Classification And Regression Trees\"\n", - " 1974-86: Neural networks learning breakthrough: backpropagation method\n", - " 1995: Randomized Forests and Support Vector Machines methods\n", - " 1998: Public appearance: first ML implementations of spam filtering methods; naive Bayes Classifier method\n", - " 2006-12: Neural networks learning breakthrough: deep learning\n", - " \n", - "So the field is not as new as one might think, but due to \n", - "\n", - "- more available data\n", - "- more processing power \n", - "- development of better algorithms \n", - "\n", - "more applications of machine learning appeared during the last 15 years." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Machine learning with Python\n", - "\n", - "Currently (2018) `Python` is the dominant programming language for ML. Especially the advent of deep-learning pushed this forward. First versions of frameworks such as `TensorFlow` or `PyTorch` got early `Python` releases.\n", - "\n", - "The prevalent packages in the Python eco-system used for ML include:\n", - "\n", - "- `pandas` for handling tabular data\n", - "- `matplotlib` and `seaborn` for plotting\n", - "- `scikit-learn` for classical (non-deep-learning) ML\n", - "- `TensorFlow`, `PyTorch` and `Keras` for deep-learning.\n", - "\n", - "`scikit-learn` is very comprehensive and the online-documentation itself provides a good introducion into ML." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## ML lingo: What are \"features\" ?\n", - "\n", - "A typical and very common situation is that our data is presented as a table, as in the following example:" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "<div>\n", - "<style scoped>\n", - " .dataframe tbody tr th:only-of-type {\n", - " vertical-align: middle;\n", - " }\n", - "\n", - " .dataframe tbody tr th {\n", - " vertical-align: top;\n", - " }\n", - "\n", - " .dataframe thead th {\n", - " text-align: right;\n", - " }\n", - "</style>\n", - "<table border=\"1\" class=\"dataframe\">\n", - " <thead>\n", - " <tr style=\"text-align: right;\">\n", - " <th></th>\n", - " <th>alcohol_content</th>\n", - " <th>bitterness</th>\n", - " <th>darkness</th>\n", - " <th>fruitiness</th>\n", - " <th>is_yummy</th>\n", - " </tr>\n", - " </thead>\n", - " <tbody>\n", - " <tr>\n", - " <th>0</th>\n", - " <td>3.739295</td>\n", - " <td>0.422503</td>\n", - " <td>0.989463</td>\n", - " <td>0.215791</td>\n", - " <td>0</td>\n", - " </tr>\n", - " <tr>\n", - " <th>1</th>\n", - " <td>4.207849</td>\n", - " <td>0.841668</td>\n", - " <td>0.928626</td>\n", - " <td>0.380420</td>\n", - " <td>0</td>\n", - " </tr>\n", - " <tr>\n", - " <th>2</th>\n", - " <td>4.709494</td>\n", - " <td>0.322037</td>\n", - " <td>5.374682</td>\n", - " <td>0.145231</td>\n", - " <td>1</td>\n", - " </tr>\n", - " <tr>\n", - " <th>3</th>\n", - " <td>4.684743</td>\n", - " <td>0.434315</td>\n", - " <td>4.072805</td>\n", - " <td>0.191321</td>\n", - " <td>1</td>\n", - " </tr>\n", - " <tr>\n", - " <th>4</th>\n", - " <td>4.148710</td>\n", - " <td>0.570586</td>\n", - " <td>1.461568</td>\n", - " <td>0.260218</td>\n", - " <td>0</td>\n", - " </tr>\n", - " </tbody>\n", - "</table>\n", - "</div>" - ], - "text/plain": [ - " alcohol_content bitterness darkness fruitiness is_yummy\n", - "0 3.739295 0.422503 0.989463 0.215791 0\n", - "1 4.207849 0.841668 0.928626 0.380420 0\n", - "2 4.709494 0.322037 5.374682 0.145231 1\n", - "3 4.684743 0.434315 4.072805 0.191321 1\n", - "4 4.148710 0.570586 1.461568 0.260218 0" - ] - }, - "execution_count": 1, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import pandas as pd\n", - "\n", - "features = pd.read_csv(\"beers.csv\")\n", - "features.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "<div class=\"alert alert-block alert-warning\">\n", - "<i class=\"fa fa-warning\"></i> <strong>Definitions</strong>\n", - "<ul>\n", - " <li>every row of such a matrix is called a <strong>sample</strong> or <strong>feature vector</strong>;</li>\n", - " <li>the cells in a row are <strong>feature values</strong>;</li>\n", - " <li>every column name is called a <strong>feature name</strong> or <strong>attribute</strong>.</li>\n", - "</ul>\n", - "\n", - "Features are also commonly called <strong>variables</strong>.\n", - "</div>" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This table shown holds five samples.\n", - "\n", - "The feature names are `alcohol_content`, `bitterness`, `darkness`, `fruitiness` and `is_yummy`.\n", - "\n", - "<div class=\"alert alert-block alert-warning\">\n", - "<i class=\"fa fa-warning\"></i> <strong>More definitions</strong>\n", - "<ul>\n", - " <li>The first four features have continuous numerical values within some ranges - these are called <strong>numerical features</strong>,</li>\n", - " <li>the <code>is_yummy</code> feature has only a finite set of values (\"categories\"): <code>0</code> (\"no\") and <code>1</code> (\"yes\") - this is called a <strong>categorical feature</strong>.</li>\n", - "</ul>\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A straight-forward application of machine-learning on the previous beer dataset is: **\"can we predict `is_yummy` from the other features\"** ?\n", - "\n", - "<div class=\"alert alert-block alert-warning\">\n", - "<i class=\"fa fa-warning\"></i> <strong>Even more definitions</strong>\n", - "\n", - "In context of the question above we call:\n", - "<ul>\n", - " <li>the <code>alcohol_content</code>, <code>bitterness</code>, <code>darkness</code>, <code>fruitiness</code> features our <strong>input features</strong>, and</li>\n", - " <li>the <code>is_yummy</code> feature our <strong>target/output feature</strong> or a <strong>label</strong> of our data samples.\n", - " <ul>\n", - " <li>Values of categorical labels, such as <code>0</code> (\"no\") and <code>1</code> (\"yes\") here, are often called <strong>classes</strong>.</li>\n", - " </ul>\n", - " </li>\n", - "</ul>" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Most of the machine learning algorithms require that every sample is represented as a vector containing numbers. Let's look now at two examples of how one can create feature vectors from data which is not naturally given as vectors:\n", - "\n", - "1. Feature vectors from images\n", - "2. Feature vectors from text.\n", - "\n", - "### 1st Example: How to represent images as feature vectors ?\n", - "\n", - "In order to simplify our explanations we only consider grayscale images in this section. \n", - "Computers represent images as matrices. Every cell in the matrix represents one pixel, and the numerical value in the matrix cell its gray value.\n", - "\n", - "So how can we represent images as vectors?\n", - "\n", - "To demonstrate this we will now load a sample dataset that is included in `scikit-learn`:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.datasets import load_digits\n", - "import matplotlib.pyplot as plt\n", - "%matplotlib inline" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "['DESCR', 'data', 'images', 'target', 'target_names']\n" - ] - } - ], - "source": [ - "dd = load_digits()\n", - "print(dir(dd))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's plot the first ten digits from this data set:" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAABHsAAACNCAYAAAAn1Xb5AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAGWdJREFUeJzt3X+QXXV5x/HPY4IVCGSXKtAGyhIEq+00S5NxprWaRYn4ozXbIg6iNMtMB0YHJ2mxJZ2xQ6J2DDPVLOOvJiOyabF1jMUNtQw2W1ksztSSmI0UAgysS0kKA9HdBUSJ4NM/7kVSmpjzLPfs2e+T92tmh+zm4bvP2c8959x9cu655u4CAAAAAABADi9rugEAAAAAAAB0DsMeAAAAAACARBj2AAAAAAAAJMKwBwAAAAAAIBGGPQAAAAAAAIkw7AEAAAAAAEiEYQ8AAAAAAEAiDHvazOwkM/uamf3IzB4ys0ua7gkxZnalme0ws2fMbKjpfhBnZr9kZte398EnzWzMzN7edF+IMbMbzewRM3vCzO43sz9puifMnJmdbWY/MbMbm+4FMWY22s7uqfbHfU33hJkxs4vNbE/7eeqDZvbGpntCNQftf89/PGdmn266L8SZWY+Z3WJmk2b2qJl9xszmN90XqjOz15rZN81s2sweMLM/bLqnOjHsecFnJR2QdIqk90n6vJn9RrMtIeh/JH1c0hebbgQzNl/Sw5KWS1oo6SOSvmJmPQ32hLhPSOpx9xMlvUvSx81sacM9YeY+K+nOppvAjF3p7gvaH69puhnEmdkKSddKukzSCZLeJGm80aZQ2UH73wJJp0r6saStDbeFmfmcpMck/YqkXrWer36w0Y5QWXswt03S1yWdJOlySTea2TmNNlYjhj2SzOx4SRdK+it3f8rd75B0s6RLm+0MEe5+k7sPS/pB071gZtz9R+6+zt0n3P1n7v51Sd+XxKCgIO5+t7s/8/yn7Y+zGmwJM2RmF0uakvRvTfcCHMXWS/qou/9H+9y4z933Nd0UZuRCtYYF/950I5iRMyV9xd1/4u6PSrpVEhcHlOPXJf2qpI3u/py7f1PSt5X4d36GPS3nSHrW3e8/6Gu7xc4LNMrMTlFr/7y76V4QY2afM7OnJd0r6RFJtzTcEoLM7ERJH5X0Z033gpfkE2a238y+bWZ9TTeDGDObJ2mZpFe1X3Kwt/3SkWOb7g0zskrS37m7N90IZmRQ0sVmdpyZLZL0drUGPiiXSfrNppuoC8OelgWSnnjR16bVulQWQAPM7BhJX5K0xd3vbbofxLj7B9U6hr5R0k2SnvnF/wfmoI9Jut7d9zbdCGbsakmLJS2StFnSP5sZV9mV5RRJx0h6t1rH015J56r1MmcUxMzOUOtlP1ua7gUz9i21LgZ4QtJeSTskDTfaESLuU+vKuj83s2PM7K1q7ZPHNdtWfRj2tDwl6cQXfe1ESU820Atw1DOzl0n6e7Xuo3Vlw+1ghtqXyN4h6TRJH2i6H1RnZr2Szpe0seleMHPu/h13f9Ldn3H3LWpdrv6OpvtCyI/b//20uz/i7vslfUrkWKJLJd3h7t9vuhHEtZ+b3qrWP2AdL+mVkrrVup8WCuDuP5XUL+mdkh6VdJWkr6g1uEuJYU/L/ZLmm9nZB31tiXjpCDDrzMwkXa/Wv2Ze2D4wo2zzxT17StMnqUfSf5vZo5I+LOlCM/tuk03hJXO1LllHIdx9Uq1fRA5+2Q8vASrTH4urekp2kqRfk/SZ9gD9B5JuEIPXorj799x9ubv/srtfoNbVr//ZdF91Ydij1k1h1ZrSftTMjjezN0haqdaVBSiEmc03s1dImidpnpm9grdDLNLnJb1W0h+4+4+PVIy5xcxObr9F8AIzm2dmF0h6r7jBb2k2qzWg621//K2kf5F0QZNNoToz6zKzC54/F5rZ+9R6FyfuL1GeGyR9qH187Zb0p2q9mwwKYWa/q9bLKXkXrkK1r6r7vqQPtI+pXWrdg+l7zXaGCDP7rfZ58Tgz+7Ba76w21HBbtWHY84IPSjpWrdfx/aOkD7g7V/aU5SNqXe68VtL723/mNe0Fab+e/Qq1frl81Myean+8r+HWUJ2r9ZKtvZImJf2NpDXufnOjXSHE3Z9290ef/1Dr5c4/cffHm+4NlR0j6eOSHpe0X9KHJPW/6M0oUIaPSbpTrSvR90jaJemvG+0IUask3eTu3CKibH8k6W1qHVcfkPRTtYavKMelar1xyGOS3iJpxUHvIJuOcTN4AAAAAACAPLiyBwAAAAAAIBGGPQAAAAAAAIkw7AEAAAAAAEiEYQ8AAAAAAEAitbwttZnVetfn7u7uUP2iRYsq1z7xxBOhtfft2xeqf+6550L1Ue5unVin7gyjzjnnnMq18+fHHtbRDKenp0P1M7Df3V/ViYXmWo4LFiyoXPvqV786tPbTTz8dqr///nrfkKaUffHUU08N1UeOp888E3tzgz179oTq6z6eKvG+OG/evMq1PT09obUffPDBYDf1KmVfjJznJOnAgQOVaycmJoLdzDlp98U6n9/cc8890XZqVcq+ePLJJ4fqI8fT6O8wxx57bKg+el686667ousXsy+efvrpofqurq7Ktfv37w+t/dhjj4Xq+X2x5ayzzgrVR/bFun8PmAWV9sVahj11O//880P1GzZsqFw7MjISWnvt2rWh+snJyVA9WjZv3ly5NnKwlqRrrrkmVL9t27ZQ/Qw8VPc3aMqyZcsq1w4PD4fWHhsbC9X39fWF6rNatWpVqD5yPB0fHw+tHXl8SLNyPE27L55wwgmVaz/5yU+G1u7v74+2A8XOc1JsgDMwMBBrZu5Juy/W+fymt7c32g4kXXLJJaH6SC7R4+OSJUtC9dF/kIwO86emporZF6+66qpQfSSboaGh0NqDg4Oh+qmpqVB9VtHnH5F9McHvAZX2RV7GBQAAAAAAkEilYY+Zvc3M7jOzB8wsdikL5gQyzIEcy0eGOZBj+cgwB3IsHxnmQI7lI8N8jjjsMbN5kj4r6e2SXifpvWb2urobQ+eQYQ7kWD4yzIEcy0eGOZBj+cgwB3IsHxnmVOXKntdLesDdx939gKQvS1pZb1voMDLMgRzLR4Y5kGP5yDAHciwfGeZAjuUjw4SqDHsWSXr4oM/3tr/2f5jZ5Wa2w8x2dKo5dAwZ5kCO5SPDHMixfGSYAzmWjwxzIMfykWFCHXs3LnffLGmzNPfe1hLVkGEO5Fg+MsyBHMtHhjmQY/nIMAdyLB8ZlqXKlT37JJ1+0Oentb+GcpBhDuRYPjLMgRzLR4Y5kGP5yDAHciwfGSZUZdhzp6SzzexMM3u5pIsl3VxvW+gwMsyBHMtHhjmQY/nIMAdyLB8Z5kCO5SPDhI74Mi53f9bMrpT0DUnzJH3R3e+uvTN0DBnmQI7lI8McyLF8ZJgDOZaPDHMgx/KRYU6V7tnj7rdIuqXmXlAjMsyBHMtHhjmQY/nIMAdyLB8Z5kCO5SPDfDp2g+bZtGHDhlD94sWLK9d2d3eH1v7hD38Yqn/Pe94Tqt+6dWuoPqupqanKtcuXLw+tfd5554Xqt23bFqrPrLe3N1R/2223Va6dnp4Ord3T0xOqzyp6fLzoootC9VdccUXl2k2bNoXWXrp0aah+ZGQkVI8XDAwMVK4dGxurrxH8XPQYFjnXrVq1KrT2Qw89FKrn+PuClStj71QcyXH9+vXRdjALIs9R16xZE1o7Wt/V1RWqj/Remuhz1IjIOVSS+vr6aq0vRfRcET2eRrjH7i29e/fuUH2dj7+IKvfsAQAAAAAAQCEY9gAAAAAAACTCsAcAAAAAACARhj0AAAAAAACJMOwBAAAAAABIhGEPAAAAAABAIgx7AAAAAAAAEmHYAwAAAAAAkAjDHgAAAAAAgEQY9gAAAAAAACTCsAcAAAAAACCR+U03IElLly4N1S9evDhUf9ZZZ1WuHR8fD629ffv2UH10W7du3RqqL0Vvb2+ovq+vr55GJI2NjdW2dnb9/f2h+t27d1euHR4eDq19zTXXhOqz2rx5c6j+2muvDdXv2LGjcm30eDoyMhKqxwu6urpC9QMDA5VrBwcHQ2v39PSE6qMmJiZqXb8pU1NTofozzjijcu309HRo7dHR0VB99PEX3daSrF+/vra1o+dFzEz0mBexbt26UH30eFrn8+XSRJ/fR84tkXOoFD/mRXOMHrObEj1XRN1+++2Va6PPJUrdt7iyBwAAAAAAIBGGPQAAAAAAAIkccdhjZqeb2W1mdo+Z3W1mq2ejMXQOGeZAjuUjwxzIsXxkmAM5lo8McyDH8pFhTlXu2fOspKvc/btmdoKknWa23d3vqbk3dA4Z5kCO5SPDHMixfGSYAzmWjwxzIMfykWFCR7yyx90fcffvtv/8pKQ9khbV3Rg6hwxzIMfykWEO5Fg+MsyBHMtHhjmQY/nIMKfQu3GZWY+kcyV95xB/d7mkyzvSFWpDhjmQY/nIMAdyLB8Z5kCO5SPDHMixfGSYR+Vhj5ktkPRPkta4+xMv/nt33yxpc7vWO9YhOoYMcyDH8pFhDuRYPjLMgRzLR4Y5kGP5yDCXSu/GZWbHqBX6l9z9pnpbQh3IMAdyLB8Z5kCO5SPDHMixfGSYAzmWjwzzqfJuXCbpekl73P1T9beETiPDHMixfGSYAzmWjwxzIMfykWEO5Fg+MsypypU9b5B0qaQ3m9lY++MdNfeFziLDHMixfGSYAzmWjwxzIMfykWEO5Fg+MkzoiPfscfc7JNks9IKakGEO5Fg+MsyBHMtHhjmQY/nIMAdyLB8Z5hR6N666dHd3h+p37twZqh8fHw/VR0R7yWrNmjWh+nXr1oXqFy5cGKqPGB0drW3t7AYHB0P1ExMTta29bdu2UH1W0ePd4sWLa6sfGRkJrR09F0xOTobqMxsYGAjV9/T0VK4dGhoKrR3dd6empkL10fNHKSLHR0lasmRJ5droOXRsbCxUH80ws66urlD97t27K9dGc0FLX19frfUR0efLUf39/aH66PG9JNFt27VrV+XayDlUih8jo+eDUtS9XZHH//DwcGjt6LF9rqh0g2YAAAAAAACUgWEPAAAAAABAIgx7AAAAAAAAEmHYAwAAAAAAkAjDHgAAAAAAgEQY9gAAAAAAACTCsAcAAAAAACARhj0AAAAAAACJMOwBAAAAAABIhGEPAAAAAABAIvObbkCSuru7Q/UjIyM1dRIX7X1ycrKmTpo1ODgYqh8aGgrV1/lz6+rqqm3t0kR/FmvWrAnV9/f3h+ojBgYGals7s/Hx8VD9SSedVLl2+/btobWj9StWrAjVl3T8XblyZah+48aNofotW7aE6iNWr14dqr/ssstq6qQs0eNjX19f5dre3t7Q2tHHU1T0OUNJoufRiYmJyrXRc+7w8HBtvZQkul3R/SWyL0ZFjwujo6P1NFKgOp/fL1++PFR/5plnhuqz7otTU1Oh+t27d4fqI8/zrrvuutDa0eNCT09PqL6uzLmyBwAAAAAAIBGGPQAAAAAAAIlUHvaY2Twz22VmX6+zIdSHDHMgx/KRYQ7kWD4yzIEcy0eGOZBj+cgwl8iVPasl7amrEcwKMsyBHMtHhjmQY/nIMAdyLB8Z5kCO5SPDRCoNe8zsNEnvlPSFettBXcgwB3IsHxnmQI7lI8McyLF8ZJgDOZaPDPOpemXPoKS/kPSzwxWY2eVmtsPMdnSkM3QaGeZAjuUjwxzIsXxkmAM5lo8McyDH8pFhMkcc9pjZ70t6zN13/qI6d9/s7svcfVnHukNHkGEO5Fg+MsyBHMtHhjmQY/nIMAdyLB8Z5lTlyp43SHqXmU1I+rKkN5vZjbV2hU4jwxzIsXxkmAM5lo8McyDH8pFhDuRYPjJM6IjDHnf/S3c/zd17JF0s6Zvu/v7aO0PHkGEO5Fg+MsyBHMtHhjmQY/nIMAdyLB8Z5hR5Ny4AAAAAAADMcfMjxe4+Kmm0lk4wK8gwB3IsHxnmQI7lI8McyLF8ZJgDOZaPDPMIDXvqMjk5GapfunRpTZ1I3d3dofpoL1u3bg3Vo369vb2h+rGxsZo6ad66detC9atXr66nEUn9/f2h+qmpqZo6wcEix+sVK1aE1t60aVOo/uqrrw7Vr127NlTfpOnp6VrrV61aVbk2eoyMGh4ernX9rEZHR5tu4ed6enqabmHOmJiYCNUvX768cm1XV1do7Y0bN4bqzz333FB9Kc+HoplEn3+4e21rz6X9vGnRc9Ftt90Wql+/fn3l2ugxL3qeiz5Ooo/xUkQzj9TXffwaHBwM1Uczr4qXcQEAAAAAACTCsAcAAAAAACARhj0AAAAAAACJMOwBAAAAAABIhGEPAAAAAABAIgx7AAAAAAAAEmHYAwAAAAAAkAjDHgAAAAAAgEQY9gAAAAAAACTCsAcAAAAAACARhj0AAAAAAACJzG+6AUkaHx8P1S9dujRUf9FFF9VSOxPXXnttresDL8XQ0FCovq+vL1S/ZMmSyrXDw8Ohtbdt2xaqv+GGG2pdvxQbNmwI1Y+MjFSu7e7uDq19/vnnh+q3bt0aqi/J6OhoqL6rqytU39vbW1svW7ZsCdVPTU2F6rNauXJlqH56erpy7bp164LdxESP15lFz6MbN26sXDsxMRFau6enJ1Tf398fqh8bGwvVl2JwcDBUH9kXb7/99mg7aIs+/iO5SLHco/vWrl27QvUDAwOh+rqP8aWIHJOi+3k0k+jxtC5c2QMAAAAAAJAIwx4AAAAAAIBEKg17zKzLzL5qZvea2R4z+526G0NnkWEO5Fg+MsyBHMtHhjmQY/nIMAdyLB8Z5lP1nj3XSbrV3d9tZi+XdFyNPaEeZJgDOZaPDHMgx/KRYQ7kWD4yzIEcy0eGyRxx2GNmCyW9SdKAJLn7AUkH6m0LnUSGOZBj+cgwB3IsHxnmQI7lI8McyLF8ZJhTlZdxnSnpcUk3mNkuM/uCmR3/4iIzu9zMdpjZjo53iZeKDHMgx/KRYQ7kWD4yzIEcy0eGOZBj+cgwoSrDnvmSflvS5939XEk/krT2xUXuvtndl7n7sg73iJeODHMgx/KRYQ7kWD4yzIEcy0eGOZBj+cgwoSrDnr2S9rr7d9qff1WtBwLKQYY5kGP5yDAHciwfGeZAjuUjwxzIsXxkmNARhz3u/qikh83sNe0vvUXSPbV2hY4iwxzIsXxkmAM5lo8McyDH8pFhDuRYPjLMqeq7cX1I0pfad+Uel3RZfS2hJmSYAzmWjwxzIMfykWEO5Fg+MsyBHMtHhslUGva4+5gkXpdXMDLMgRzLR4Y5kGP5yDAHciwfGeZAjuUjw3yqXtlTq/Hx8VD92rX/715Rv9CGDRsq1+7cuTO09rJl7A8zMTU1Farftm1b5dqVK1eG1u7r6wvVDw0NhepLMjY2Fqrv7e2trX7dunWhtaO5T0xMhOojj8GSTE5Ohuo3bdpUUyfS1q1bQ/VXXHFFTZ3kFzkGL1y4MLR25mNknc4777xQ/erVq2vqRNqyZUuofnR0tJ5GChR9/Pf09FSuHRgYCK0dzWV4eDhUn1X0eeGqVasq10af/+IF0Z9d9PEfeT40PT0dWjv6HHJwcDBUn1X05xD5PaOrqyu0dvS4EP2dqi5VbtAMAAAAAACAQjDsAQAAAAAASIRhDwAAAAAAQCIMewAAAAAAABJh2AMAAAAAAJAIwx4AAAAAAIBEGPYAAAAAAAAkwrAHAAAAAAAgEYY9AAAAAAAAiTDsAQAAAAAASIRhDwAAAAAAQCLm7p1f1OxxSQ+96MuvlLS/499s7mpie89w91d1YqHDZCgdXTk2ta1153g0ZSixL2bAvpgD+2L52BdzYF8sH/tiDuyL5ZvT+2Itw55DfiOzHe6+bFa+2RyQdXuzbtehZN3WrNt1OFm3N+t2HUrWbc26XYeTdXuzbtehZN3WrNt1OFm3N+t2HUrWbc26XYeTdXuzbtehzPVt5WVcAAAAAAAAiTDsAQAAAAAASGQ2hz2bZ/F7zQVZtzfrdh1K1m3Nul2Hk3V7s27XoWTd1qzbdThZtzfrdh1K1m3Nul2Hk3V7s27XoWTd1qzbdThZtzfrdh3KnN7WWbtnDwAAAAAAAOrHy7gAAAAAAAASYdgDAAAAAACQyKwMe8zsbWZ2n5k9YGZrZ+N7NsXMJszsLjMbM7MdTffTKUdThhI5ZkCGOZBj+cgwB3IsHxnmQI7lI8McSsix9nv2mNk8SfdLWiFpr6Q7Jb3X3e+p9Rs3xMwmJC1z9/1N99IpR1uGEjlmQIY5kGP5yDAHciwfGeZAjuUjwxxKyHE2rux5vaQH3H3c3Q9I+rKklbPwfdE5ZJgDOZaPDHMgx/KRYQ7kWD4yzIEcy0eGc9BsDHsWSXr4oM/3tr+WlUv6VzPbaWaXN91MhxxtGUrkmAEZ5kCO5SPDHMixfGSYAzmWjwxzmPM5zm+6gYR+z933mdnJkrab2b3u/q2mm0IYOZaPDHMgx/KRYQ7kWD4yzIEcy0eGOcz5HGfjyp59kk4/6PPT2l9Lyd33tf/7mKSvqXVJW+mOqgwlcsyADHMgx/KRYQ7kWD4yzIEcy0eGOZSQ42wMe+6UdLaZnWlmL5d0saSbZ+H7zjozO97MTnj+z5LeKum/mu2qI46aDCVyzIAMcyDH8pFhDuRYPjLMgRzLR4Y5lJJj7S/jcvdnzexKSd+QNE/SF9397rq/b0NOkfQ1M5NaP9t/cPdbm23ppTvKMpTIMQMyzIEcy0eGOZBj+cgwB3IsHxnmUESOtb/1OgAAAAAAAGbPbLyMCwAAAAAAALOEYQ8AAAAAAEAiDHsAAAAAAAASYdgDAAAAAACQCMMeAAAAAACARBj2AAAAAAAAJMKwBwAAAAAAIJH/BbKiUL0lvDQ5AAAAAElFTkSuQmCC\n", - "text/plain": [ - "<Figure size 1440x360 with 10 Axes>" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "N = 10\n", - "\n", - "plt.figure(figsize=(2 * N, 5))\n", - "\n", - "for i, image in enumerate(dd.images[:N]):\n", - " plt.subplot(1, N, i + 1).set_title(dd.target[i])\n", - " plt.imshow(image, cmap=\"gray\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The data is a set of 8 x 8 matrices with values 0 to 15 (black to white). The range 0 to 15 is fixed for this specific data set. Other formats allow e.g. values 0..255 or floating point values in the range 0 to 1." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "images.ndim: 3\n", - "images[0].shape: (8, 8)\n", - "images[0]:\n", - " [[ 0. 0. 5. 13. 9. 1. 0. 0.]\n", - " [ 0. 0. 13. 15. 10. 15. 5. 0.]\n", - " [ 0. 3. 15. 2. 0. 11. 8. 0.]\n", - " [ 0. 4. 12. 0. 0. 8. 8. 0.]\n", - " [ 0. 5. 8. 0. 0. 9. 8. 0.]\n", - " [ 0. 4. 11. 0. 1. 12. 7. 0.]\n", - " [ 0. 2. 14. 5. 10. 12. 0. 0.]\n", - " [ 0. 0. 6. 13. 10. 0. 0. 0.]]\n", - "images.shape: (1797, 8, 8)\n", - "images.size: 115008\n", - "images.dtype: float64\n", - "images.itemsize: 8\n", - "target.size: 1797\n", - "target_names: [0 1 2 3 4 5 6 7 8 9]\n", - "DESCR:\n", - " Optical Recognition of Handwritten Digits Data Set\n", - "===================================================\n", - "\n", - "Notes\n", - "-----\n", - "Data Set Characteristics:\n", - " :Number of Instances: 5620\n", - " :Number of Attributes: 64\n", - " :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n", - " :Missing Attribute Values: None\n", - " :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n", - " :Date: July; 1998\n", - "\n", - "This is a copy of the test set of the UCI ML hand-written digits datasets\n", - "http://archive.ics.uci.edu/ml/datas \n", - "[...]\n" - ] - } - ], - "source": [ - "print(\"images.ndim:\", dd.images.ndim) # number of dimensions of the array\n", - "print(\"images[0].shape:\", dd.images[0].shape) # dimensions of a first sample array\n", - "print(\"images[0]:\\n\", dd.images[0]) # first sample array\n", - "print(\"images.shape:\", dd.images.shape) # dimensions of the array of all samples\n", - "print(\"images.size:\", dd.images.size) # total number of elements of the array\n", - "print(\"images.dtype:\", dd.images.dtype) # type of the elements in the array\n", - "print(\"images.itemsize:\", dd.images.itemsize) # size in bytes of each element of the array\n", - "print(\"target.size:\", dd.target.size) # size of the target feature vector (labels of samples)\n", - "print(\"target_names:\", dd.target_names) # classes vector\n", - "print(\"DESCR:\\n\", dd.DESCR[:500], \"\\n[...]\") # description of the dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To transform such an image to a feature vector we just have to flatten the matrix by concatenating the rows to one single vector of size 64:" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "image_vector.shape: (64,)\n", - "image_vector: [ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5. 0. 0. 3.\n", - " 15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8. 8. 0. 0. 5. 8. 0.\n", - " 0. 9. 8. 0. 0. 4. 11. 0. 1. 12. 7. 0. 0. 2. 14. 5. 10. 12.\n", - " 0. 0. 0. 0. 6. 13. 10. 0. 0. 0.]\n" - ] - } - ], - "source": [ - "image_vector = dd.images[0].flatten()\n", - "print(\"image_vector.shape:\", image_vector.shape)\n", - "print(\"image_vector:\", image_vector)" - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "(1797, 8, 8)\n", - "(1797, 64)\n", - "[ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5. 0. 0. 3.\n", - " 15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8. 8. 0. 0. 5. 8. 0.\n", - " 0. 9. 8. 0. 0. 4. 11. 0. 1. 12. 7. 0. 0. 2. 14. 5. 10. 12.\n", - " 0. 0. 0. 0. 6. 13. 10. 0. 0. 0.]\n" - ] - } - ], - "source": [ - "print(dd.images.shape)\n", - "\n", - "# reashape to 1797, 64:\n", - "images_flat = dd.images.reshape(-1, 64)\n", - "print(images_flat.shape)\n", - "print(images_flat[0])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2nd Example: How to present textual data as feature vectors ?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If we start a machine learning project for texts, we first have to choose a dictionary - set of words for this project. The final representation of a text as a feature vector depends on this dictionary.\n", - "\n", - "Such a dictionary can be very large, but for the sake of simplicity we use a very small enumerated dictionary to explain the overall procedure:\n", - "\n", - "\n", - "| Word | Index |\n", - "|----------|-------|\n", - "| like | 0 |\n", - "| dislike | 1 |\n", - "| american | 2 |\n", - "| italian | 3 |\n", - "| beer | 4 |\n", - "| pizza | 5 |\n", - "\n", - "To \"vectorize\" a given text we count the words in the text which also exist in the vocabulary and put the counts at the given `Index`.\n", - "\n", - "E.g. `\"I dislike american pizza, but american beer is nice\"`:\n", - "\n", - "| Word | Index | Count |\n", - "|----------|-------|-------|\n", - "| like | 0 | 0 |\n", - "| dislike | 1 | 1 |\n", - "| american | 2 | 2 |\n", - "| italian | 3 | 0 |\n", - "| beer | 4 | 1 |\n", - "| pizza | 5 | 1 |\n", - "\n", - "The respective feature vector is the `Count` column, which is:\n", - "\n", - "`[0, 1, 2, 0, 1, 1]`\n", - "\n", - "In real case scenarios the dictionary is much bigger, which often results in vectors with only few non-zero entries (so called **sparse vectors**)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Below you find is a short code example to demonstrate how text feature vectors can be created with `scikit-learn`.\n", - "<div class=\"alert alert-block alert-info\">\n", - "<i class=\"fa fa-info-circle\"></i>\n", - "Such vectorization is unsually not done manually. Actually there are improved but more complicated procedures which compute multiplicative weights for the vector entries to emphasize informative words such as, e.g., <a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\">\"term frequency-inverse document frequency\" vectorizer</a>.\n", - "</div>" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[0 1 2 0 1 1]\n" - ] - } - ], - "source": [ - "from sklearn.feature_extraction.text import CountVectorizer\n", - "from itertools import count\n", - "\n", - "vocabulary = {\n", - " \"like\": 0,\n", - " \"dislike\": 1,\n", - " \"american\": 2,\n", - " \"italian\": 3,\n", - " \"beer\": 4,\n", - " \"pizza\": 5,\n", - "}\n", - "\n", - "vectorizer = CountVectorizer(vocabulary=vocabulary)\n", - "\n", - "# this how one can create a count vector for a given piece of text:\n", - "vector = vectorizer.fit_transform([\n", - " \"I dislike american pizza. But american beer is nice\"\n", - "]).toarray().flatten()\n", - "print(vector)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## ML lingo: What are the different types of datasets?\n", - "\n", - "<div class=\"alert alert-block alert-warning\">\n", - "<i class=\"fa fa-warning\"></i> <strong>Definitions</strong>\n", - "\n", - "Subset of data used for:\n", - "<ul>\n", - " <li>learning (training) a model is called a <strong>training set</strong>;</li>\n", - " <li>improving ML method performance by adjusting its parameters is called <strong>validation set</strong>;</li>\n", - " <li>assesing final performance is called <strong>test set</strong>.</li>\n", - "</ul>\n", - "</div>\n", - "\n", - "<table>\n", - " <tr>\n", - " <td><img src=\"./data_split.png\" width=300px></td>\n", - " </tr>\n", - " <tr>\n", - " <td style=\"font-size:75%\"><center>Img source: https://dziganto.github.io</center></td>\n", - " </tr>\n", - "</table>\n", - "\n", - "\n", - "You will learn more on how to select wisely subsets of your data and about related issues later in the course. For now just remember that:\n", - "1. the training and validation datasets must be disjunct during each iteration of the method improvement, and\n", - "1. the test dataset must be independent from the model (hence, from the other datasets), i.e. it is indeed used only for the final assesment of the method's performance (think: locked in the safe until you're done with model tweaking).\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Taxonomy of machine learning\n", - "\n", - "Most applications of ML belong to two categories: **supervised** and **unsupervised** learning.\n", - "\n", - "### Supervised learning \n", - "\n", - "In supervised learning the data comes with an additional target/label value that we want to predict. Such a problem can be either \n", - "\n", - "- **classification**: we want to predict a categorical value.\n", - " \n", - "- **regression**: we want to predict numbers in a given range.\n", - " \n", - " \n", - "\n", - "Examples of supervised learning:\n", - "\n", - "- Classification: predict the class `is_yummy` based on the attributes `alcohol_content`,\t`bitterness`, \t`darkness` and `fruitiness` (a standard two class problem).\n", - "\n", - "- Classification: predict the digit-shown based on a 8 x 8 pixel image (a multi-class problem).\n", - "\n", - "- Regression: predict temperature based on how long sun was shining in the last 10 minutes.\n", - "\n", - "\n", - "\n", - "<table>\n", - " <tr>\n", - " <td><img src=\"./classification-svc-2d-poly.png\" width=400px></td>\n", - " <td><img src=\"./regression-lin-1d.png\" width=400px></td>\n", - " </tr>\n", - " <tr>\n", - " <td><center>Classification</center></td>\n", - " <td><center>Linear regression</center></td>\n", - " </tr>\n", - "</table>\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Unsupervised learning \n", - "\n", - "In unsupervised learning the training data consists of samples without any corresponding target/label values and the aim is to find structure in data. Some common applications are:\n", - "\n", - "- Clustering: find groups in data.\n", - "- Density estimation, novelty detection: find a probability distribution in your data.\n", - "- Dimension reduction (e.g. PCA): find latent structures in your data.\n", - "\n", - "Examples of unsupervised learning:\n", - "\n", - "- Can we split up our beer data set into sub-groups of similar beers?\n", - "- Can we reduce our data set because groups of features are somehow correlated?\n", - "\n", - "<table>\n", - " <tr>\n", - " <td><img src=\"./cluster-image.png/\" width=400px></td>\n", - " <td><img src=\"./nonlin-pca.png/\" width=400px></td>\n", - " </tr>\n", - " <tr>\n", - " <td><center>Clustering</center></td>\n", - " <td><center>Dimension reduction: detecting 2D structure in 3D data</center></td>\n", - " </tr>\n", - "</table>\n", - "\n", - "\n", - "\n", - "This course will only introduce concepts and methods from **supervised learning**." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.1" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -}