Newer
Older
Mikolaj Rybinski
committed
"execution_count": null,
"source": [
"# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"%config InlineBackend.figure_format = 'retina'\n",
"import warnings\n",
"warnings.filterwarnings('ignore', category=FutureWarning)\n",
"from IPython.core.display import HTML; HTML(open(\"custom.html\", \"r\").read())"
Mikolaj Rybinski
committed
],
"outputs": [],
"metadata": {}
{
"cell_type": "markdown",
"source": [
"# Chapter 2: Classification"
Mikolaj Rybinski
committed
],
"metadata": {}
"As we have learned in the previous chapter *classification* is a machine learning problem belonging to a group of *supervised learning* problems. In classification the aim is to learn how to predict a class of a categorical label, based on set of already labelled training examples (hence, supervised). Such labels (categories) and corresponding classes can be:\n",
"- sick: yes / no,\n",
"- character: good / bad / dont't know,\n",
"- digit: 0 / ... / 9.\n",
"In this chapter we introduce the core concepts of classification."
Mikolaj Rybinski
committed
],
"metadata": {}
"## How to build a simple classifier?"
Mikolaj Rybinski
committed
],
"metadata": {}
"Let's quickly look again at the beer example:"
Mikolaj Rybinski
committed
],
"metadata": {}
Mikolaj Rybinski
committed
"execution_count": null,
"import seaborn as sns\n",
"sns.set(style=\"ticks\")\n",
"beer_data = pd.read_csv(\"data/beers.csv\")\n",
"for_plot = beer_data.copy()\n",
"# fixes seaborn labels issue\n",
"def translate_label(value):\n",
" return \"no\" if value == 0 else \"yes\"\n",
"for_plot[\"is_yummy\"] = for_plot[\"is_yummy\"].apply(translate_label)\n",
"\n",
"sns.pairplot(for_plot, hue=\"is_yummy\", diag_kind=\"hist\", diag_kws=dict(alpha=.7));\n",
Mikolaj Rybinski
committed
],
"outputs": [],
"metadata": {}
{
"cell_type": "markdown",
"source": [
"We can assume that the person who rated these beers has preferences such as:\n",
"* \"I don't like too low alcohol content\",\n",
"* \"I like more fruity beers\", etc."
Mikolaj Rybinski
committed
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"This means we could construct a score where high numbers relate to \"favorable beer\". One simple way to implement such a score is to use a weighted sum like:\n",
"\n",
" score = 1.1 * alcohol_content + 4 * bitterness + 1.5 * darkness + 1.8 * fruitiness\n",
"\n",
"The actual weights here are guessed and serve as an example.\n",
"\n",
"The size of the numbers reflects the numerical ranges of the features: alcohol content is in the range 3 to 5.9, where as bitterness is between 0 and 1.08:"
Mikolaj Rybinski
committed
],
"metadata": {}
Mikolaj Rybinski
committed
"execution_count": null,
"scores =( 1.1 * beer_data[\"alcohol_content\"] + 4 * beer_data[\"bitterness\"] \n",
" + 1.5 * beer_data[\"darkness\"] + 1.8 * beer_data[\"fruitiness\"])\n",
"scores.shape"
Mikolaj Rybinski
committed
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Now we can plot the histogram of the scores by classes:"
Mikolaj Rybinski
committed
],
"metadata": {}
Mikolaj Rybinski
committed
"execution_count": null,
"source": [
"scores_bad = scores[beer_data[\"is_yummy\"] == 0]\n",
"scores_good = scores[beer_data[\"is_yummy\"] == 1]\n",
"plt.hist(scores_bad, bins=25, color=\"steelblue\", alpha=.7) # alpha makes bars translucent\n",
"plt.hist(scores_good, bins=25, color=\"chocolate\", alpha=.7);"
Mikolaj Rybinski
committed
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Consequence: a simple classifier could use these scores and use a threshold around 10.5 to assign a class label."
Mikolaj Rybinski
committed
],
"metadata": {}
Mikolaj Rybinski
committed
"execution_count": null,
" scores = (1.1 * beer_feature[\"alcohol_content\"] + 4 * beer_feature[\"bitterness\"] \n",
" + 1.5 * beer_feature[\"darkness\"] + 1.8 * beer_feature[\"fruitiness\"])\n",
" return \"not yummy\"\n",
"\n",
"# check this for samples 5 .. 14:\n",
" is_yummy = translate_label(beer_data[\"is_yummy\"][i])\n",
" classified_as = classify(beer_data.iloc[i, :])\n",
" print(i, \n",
" \"is yummy?\", \"{:3s}\".format(is_yummy),\n",
Mikolaj Rybinski
committed
],
"outputs": [],
"metadata": {}
"**This is how \"linear\" classifiers work. The magic is in computing the weights and the final threshold to guarantee good results.**\n",
"<div class=\"alert alert-block alert-info\">\n",
"<i class=\"fa fa-info-circle\"></i>\n",
"Although this seems to be a simplistic concept, linear classifiers can actually work very well, especially for problems with many features (high-dimensional problems).\n",
"</div>\n"
Mikolaj Rybinski
committed
],
"metadata": {}
{
"cell_type": "markdown",
"source": [
"## Exercise section 1\n",
"\n",
"- Modify the weights in the beer classifiers and check if you can improve separation in the histogram.\n",
"- In `scikit-learn` the weights of a trained linear classifier are availble via the `coef_` attribute as a 2 dimensional `numpy` array. Extract the weights from the `LogisticRegression` classifier example from the last script and try them out in your weighted sum scoring function."
Mikolaj Rybinski
committed
],
"metadata": {}
},
{
"cell_type": "code",
Mikolaj Rybinski
committed
"execution_count": null,
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"classifier = LogisticRegression()\n",
"input_features = beer_data.iloc[:, :-1]\n",
"labels = beer_data.iloc[:, -1]\n",
"classifier.fit(input_features, labels)\n",
"w = classifier.coef_[0]\n",
"\n",
"scores =( w[0] * beer_data[\"alcohol_content\"] + w[1] * beer_data[\"bitterness\"] \n",
" + w[2] * beer_data[\"darkness\"] + w[3] * beer_data[\"fruitiness\"])\n",
Loading
Loading full blame...