Newer
Older
Mikolaj Rybinski
committed
"execution_count": null,
"source": [
"# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !\n",
"%matplotlib inline\n",
"%config InlineBackend.figure_format = 'retina'\n",
"import warnings\n",
"\n",
"import matplotlib.pyplot as plt\n",
"\n",
"warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
"warnings.filterwarnings(\"ignore\", message=\"X does not have valid feature names, but [a-zA-Z]+ was fitted with feature names\", category=UserWarning)\n",
" \n",
"from IPython.core.display import HTML\n",
"\n",
"HTML(open(\"custom.html\", \"r\").read())"
"As we have learned in the previous chapter *classification* is a machine learning problem belonging to a group of *supervised learning* problems. In classification the aim is to learn how to predict a class of a categorical label, based on set of already labelled training examples (hence, supervised). Such labels (categories) and corresponding classes can be:\n",
"- sick: yes / no,\n",
"- character: good / bad / dont't know,\n",
"- digit: 0 / ... / 9.\n",
"In this chapter we introduce the core concepts of classification."
"## How to build a simple classifier?"
"Let's quickly look again at the beer example:"
Mikolaj Rybinski
committed
"execution_count": null,
"import seaborn as sns\n",
"sns.set(style=\"ticks\")\n",
"beer_data = pd.read_csv(\"data/beers.csv\")\n",
"for_plot = beer_data.copy()\n",
"# fixes seaborn labels issue\n",
"def translate_label(value):\n",
" return \"no\" if value == 0 else \"yes\"\n",
"for_plot[\"is_yummy\"] = for_plot[\"is_yummy\"].apply(translate_label)\n",
"\n",
"sns.pairplot(for_plot, hue=\"is_yummy\", diag_kind=\"hist\", diag_kws=dict(alpha=.7));\n",
"source": [
"We can assume that the person who rated these beers has preferences such as:\n",
"* \"I don't like too low alcohol content\",\n",
"* \"I like more fruity beers\", etc."
},
{
"cell_type": "markdown",
"source": [
"This means we could construct a score where high numbers relate to \"favorable beer\". One simple way to implement such a score is to use a weighted sum like:\n",
"\n",
" score = 1.1 * alcohol_content + 4 * bitterness + 1.5 * darkness + 1.8 * fruitiness\n",
"\n",
"The actual weights here are guessed and serve as an example.\n",
"\n",
"The size of the numbers reflects the numerical ranges of the features: alcohol content is in the range 3 to 5.9, where as bitterness is between 0 and 1.08:"
Mikolaj Rybinski
committed
"execution_count": null,
"scores =( 1.1 * beer_data[\"alcohol_content\"] + 4 * beer_data[\"bitterness\"] \n",
" + 1.5 * beer_data[\"darkness\"] + 1.8 * beer_data[\"fruitiness\"])\n",
"scores.shape"
"source": [
"Now we can plot the histogram of the scores by classes:"
Mikolaj Rybinski
committed
"execution_count": null,
"source": [
"scores_bad = scores[beer_data[\"is_yummy\"] == 0]\n",
"scores_good = scores[beer_data[\"is_yummy\"] == 1]\n",
"plt.hist(scores_bad, bins=25, color=\"steelblue\", alpha=.7) # alpha makes bars translucent\n",
"plt.hist(scores_good, bins=25, color=\"chocolate\", alpha=.7);"
"Consequence: a simple classifier could use these scores and use a threshold around 10.5 to assign a class label."
Mikolaj Rybinski
committed
"execution_count": null,
" scores = (1.1 * beer_feature[\"alcohol_content\"] + 4 * beer_feature[\"bitterness\"] \n",
" + 1.5 * beer_feature[\"darkness\"] + 1.8 * beer_feature[\"fruitiness\"])\n",
" return \"not yummy\"\n",
"\n",
"# check this for samples 5 .. 14:\n",
" is_yummy = translate_label(beer_data[\"is_yummy\"][i])\n",
" classified_as = classify(beer_data.iloc[i, :])\n",
" print(i, \n",
" \"is yummy?\", \"{:3s}\".format(is_yummy),\n",
"**This is how \"linear\" classifiers work. The magic is in computing the weights and the final threshold to guarantee good results.**\n",
"<div class=\"alert alert-block alert-info\">\n",
"<i class=\"fa fa-info-circle\"></i>\n",
"Although this seems to be a simplistic concept, linear classifiers can actually work very well, especially for problems with many features (high-dimensional problems).\n",
"</div>\n"
"- Modify the weights in the beer classifiers and check if you can improve separation in the histogram.\n",
"- In `scikit-learn` the weights of a trained linear classifier are availble via the `coef_` attribute as a 2 dimensional `numpy` array. Extract the weights from the `LogisticRegression` classifier example from the last script and try them out in your weighted sum scoring function."
},
{
"cell_type": "code",
Mikolaj Rybinski
committed
"execution_count": null,
Loading
Loading full blame...