Skip to content
Snippets Groups Projects
01_introduction.ipynb 342 KiB
Newer Older
schmittu's avatar
schmittu committed
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
schmittu's avatar
schmittu committed
    "# Chapter 1: General Introduction to machine learning (ML)"
schmittu's avatar
schmittu committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
schmittu's avatar
schmittu committed
    "## ML = \"learning models from data\"\n",
schmittu's avatar
schmittu committed
    "\n",
schmittu's avatar
schmittu committed
    "\n",
schmittu's avatar
schmittu committed
    "### About models\n",
schmittu's avatar
schmittu committed
    "\n",
schmittu's avatar
schmittu committed
    "A \"model\" allows us to explain observations and to answer questions. For example:\n",
    "\n",
    "   1. Where will my car at given velocity stop if I apply break now?\n",
    "   2. Where on the night sky will I see the moon tonight?\n",
    "   3. Is the email I received spam?\n",
    "   4. Which article \"X\" should I recommend to a customer \"Y\"?\n",
schmittu's avatar
schmittu committed
    "   \n",
schmittu's avatar
schmittu committed
    "- The first two questions can be answered based on existing physical models (formulas). \n",
    "\n",
    "- For the  questions 3 and 4 it is difficult to develop explicitly formulated models. \n",
schmittu's avatar
schmittu committed
    "\n",
schmittu's avatar
schmittu committed
    "### What is needed to apply ML ?\n",
    "\n",
    "Problems 3 and 4 have the following in common:\n",
    "\n",
    "- No exact model known or implementable because we have a vague understanding of the problem domain.\n",
    "- But enough data with sufficient and implicit information is available.\n",
schmittu's avatar
schmittu committed
    "\n",
schmittu's avatar
schmittu committed
    "\n",
schmittu's avatar
schmittu committed
    "\n",
    "E.g. for the spam email example:\n",
schmittu's avatar
schmittu committed
    "\n",
    "- We have no explicit formula for such a task (and devising one would boil down to lots of trial with different statistics or scores and possibly weighting of them).\n",
    "- We have a vague understanding of the problem domain because we know that some words are specific to spam emails and others are specific to my personal and work-related emails.\n",
    "- My mailbox is full with examples of both spam and non-spam emails.\n",
schmittu's avatar
schmittu committed
    "\n",
schmittu's avatar
schmittu committed
    "**In such cases machine learning offers approaches to build models based on example data.**\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<i class=\"fa fa-info-circle\"></i>\n",
    "The closely-related concept of <strong>data mining</strong> usually means use of predictive machine learning models to explicitly discover previously unknown knowledge from a specific data set, such as, for instance, association rules between customer and article types in the Problem 4 above.\n",
schmittu's avatar
schmittu committed
    "\n",
    "\n",
    "\n",
    "## ML: what is \"learning\" ?\n",
    "\n",
    "To create a predictive model, we must first **train** such a model on given data. \n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<i class=\"fa fa-info-circle\"></i>\n",
    "Alternative names for \"to train\" a model are \"to <strong>fit</strong>\" or \"to <strong>learn</strong>\" a model.\n",
    "</div>\n",
schmittu's avatar
schmittu committed
    "\n",
    "All ML algorithms have in common that they rely on internal data structures and/or parameters. Learning then builds up such data structures or adjusts parameters based on the given data. After that such models can be used to explain observations or to answer questions.\n",
    "\n",
    "The important difference between explicit models and models learned from data:\n",
    "\n",
    "- Explicit models usually offer exact answers to questions\n",
    "- Models we learn from data usually come with inherent uncertainty."
schmittu's avatar
schmittu committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
schmittu's avatar
schmittu committed
    "\n",
schmittu's avatar
schmittu committed
    "## Some history\n",
    "\n",
schmittu's avatar
schmittu committed
    "Some parts of ML are older than you might think. This is a rough time line with a few selected achievements from this field:\n",
schmittu's avatar
schmittu committed
    "\n",
    "    1805: Least squares regression\n",
    "    1812: Bayes' rule\n",
schmittu's avatar
schmittu committed
    "    1913: Markov Chains\n",
schmittu's avatar
schmittu committed
    "    1951: First neural network\n",
    "    1957-65: \"k-means\" clustering algorithm\n",
    "    1959: Term \"machine learning\" is coined by Arthur Samuel, an AI pioneer\n",
schmittu's avatar
schmittu committed
    "    1969: Book \"Perceptrons\": Limitations of Neural Networks\n",
    "    1984: Book \"Classification And Regression Trees\"\n",
    "    1974-86: Neural networks learning breakthrough: backpropagation method\n",
    "    1995: Randomized Forests and Support Vector Machines methods\n",
    "    1998: Public appearance: first ML implementations of spam filtering methods; naive Bayes Classifier method\n",
    "    2006-12: Neural networks learning breakthrough: deep learning\n",
schmittu's avatar
schmittu committed
    "    \n",
    "So the field is not as new as one might think, but due to \n",
    "\n",
    "- more available data\n",
    "- more processing power \n",
    "- development of better algorithms \n",
    "\n",
    "more applications of machine learning appeared during the last 15 years."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Machine learning with Python\n",
    "\n",
    "Currently (2018) `Python` is the  dominant programming language for ML. Especially the advent of deep-learning pushed this forward. First versions of frameworks such as `TensorFlow` or `PyTorch` got early `Python` releases.\n",
schmittu's avatar
schmittu committed
    "\n",
    "The prevalent packages in the Python eco-system used for ML include:\n",
    "\n",
    "- `pandas` for handling tabular data\n",
schmittu's avatar
schmittu committed
    "- `matplotlib` and `seaborn` for plotting\n",
    "- `scikit-learn` for classical (non-deep-learning) ML\n",
    "- `TensorFlow`, `PyTorch` and `Keras` for deep-learning.\n",
schmittu's avatar
schmittu committed
    "\n",
    "`scikit-learn` is very comprehensive and the online-documentation itself provides a good introducion into ML."
schmittu's avatar
schmittu committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ML lingo: What are \"features\" ?\n",
schmittu's avatar
schmittu committed
    "\n",
schmittu's avatar
schmittu committed
    "A typical and very common situation is that our data is presented as a table, as in the following example:"
schmittu's avatar
schmittu committed
   ]
  },
  {
   "cell_type": "code",
schmittu's avatar
schmittu committed
   "execution_count": 1,
schmittu's avatar
schmittu committed
   "metadata": {},
schmittu's avatar
schmittu committed
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>alcohol_content</th>\n",
       "      <th>bitterness</th>\n",
       "      <th>darkness</th>\n",
       "      <th>fruitiness</th>\n",
       "      <th>is_yummy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3.739295</td>\n",
       "      <td>0.422503</td>\n",
       "      <td>0.989463</td>\n",
       "      <td>0.215791</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.207849</td>\n",
       "      <td>0.841668</td>\n",
       "      <td>0.928626</td>\n",
       "      <td>0.380420</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.709494</td>\n",
       "      <td>0.322037</td>\n",
       "      <td>5.374682</td>\n",
       "      <td>0.145231</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.684743</td>\n",
       "      <td>0.434315</td>\n",
       "      <td>4.072805</td>\n",
       "      <td>0.191321</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.148710</td>\n",
       "      <td>0.570586</td>\n",
Loading
Loading full blame...