Newer
Older
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chapter 2: Classification"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we have learned in the previous chapter *classification* belongs to the field of *supervised learning*. In such problems the aim is to predict a category. Such categories can be \n",
"\n",
"- ok/not ok\n",
"- good / bad / dont't know\n",
"- digit 0 ... / digit 9\n",
"- etc \n",
"\n",
"In this chapter we introduce the core concepts of classification."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How could we build a simple classifier ?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we look at the beer example, we can assume that the person who rated tbe beers has preferences like, \"I don't like high alcohol content\", \"I like fruity beer\", etc."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This means we could construct a score where high numbers relate to \"favorable beer\". One simple way to implement such a score is to use a weighted sum like\n",
"\n",
"\n",
" score = -0.1 * alcohol_content + 4 * bitterness + 0.8 * darkness + 1.8 * fruitiness \n",
"\n",
"Positive weights contribute to a heigher score and negative weights to a lower.\n",
"\n",
"The actual weights here are guessed and serve as an example.\n",
"\n",
"The size of the numbers also reflects the numerical ranges of the features: alcohol content is in the range 3 to 5.9, where as bitterness is between 0 and 1.08:"
]
},
{
"cell_type": "code",
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>alcohol_content</th>\n",
" <th>bitterness</th>\n",
" <th>darkness</th>\n",
" <th>fruitiness</th>\n",
" <th>is_yummy</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>225.000000</td>\n",
" <td>225.000000</td>\n",
" <td>225.000000</td>\n",
" <td>225.000000</td>\n",
" <td>225.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>4.711873</td>\n",
" <td>0.463945</td>\n",
" <td>2.574963</td>\n",
" <td>0.223111</td>\n",
" <td>0.528889</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.437040</td>\n",
" <td>0.227366</td>\n",
" <td>1.725916</td>\n",
" <td>0.117272</td>\n",
" <td>0.500278</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>3.073993</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>4.429183</td>\n",
" <td>0.281291</td>\n",
" <td>1.197640</td>\n",
" <td>0.135783</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>4.740846</td>\n",
" <td>0.488249</td>\n",
" <td>2.026548</td>\n",
" <td>0.242396</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>5.005170</td>\n",
" <td>0.631056</td>\n",
" <td>4.043995</td>\n",
" <td>0.311874</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>5.955272</td>\n",
" <td>1.080170</td>\n",
" <td>7.221285</td>\n",
" <td>0.535315</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" alcohol_content bitterness darkness fruitiness is_yummy\n",
"count 225.000000 225.000000 225.000000 225.000000 225.000000\n",
"mean 4.711873 0.463945 2.574963 0.223111 0.528889\n",
"std 0.437040 0.227366 1.725916 0.117272 0.500278\n",
"min 3.073993 0.000000 0.000000 0.000000 0.000000\n",
"25% 4.429183 0.281291 1.197640 0.135783 0.000000\n",
"50% 4.740846 0.488249 2.026548 0.242396 1.000000\n",
"75% 5.005170 0.631056 4.043995 0.311874 1.000000\n",
"max 5.955272 1.080170 7.221285 0.535315 1.000000"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"# read some data\n",
"beer_data = pd.read_csv(\"beers.csv\")\n",
"\n",
"beer_data.describe()"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"scores =( -0.1 * beer_data[\"alcohol_content\"] + 3 * beer_data[\"bitterness\"] \n",
" + 0.8 * beer_data[\"darkness\"] + 1.8 * beer_data[\"fruitiness\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can plot the histogram of the scores by classes:"
]
Loading
Loading full blame...