Newer
Older
"<div class=\"alert alert-block alert-warning\">\n",
"<i class=\"fa fa-warning\"></i> <strong>Built-in documentation</strong>\n",
"\n",
"If you want to learn more about <code>LogisticRegression</code> you can use <code>help(LogisticRegression)</code> or <code>?LogisticRegression</code> to see the related documenation. The latter version works only in Jupyter Notebooks (or in IPython shell).\n",
"</div>"
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-warning\">\n",
" <i class=\"fa fa-warning\"></i> <strong><code>scikit-learn</code> API</strong>\n",
"In <code>scikit-learn</code> all classifiers have:\n",
"<ul>\n",
" <li>a <strong><code>fit()</code></strong> method to learn from data, and</li>\n",
" <li>and a subsequent <strong><code>predict()</code></strong> method for predicting classes from input features.</li>\n",
"</ul>\n",
"</div>"
]
},
"# Sanity check: can't predict if not fitted (trained)\n",
"classifier.predict(input_features)"
"# Fit\n",
"classifier.fit(input_features, labels)\n",
"\n",
"# Predict\n",
"predicted_labels = classifier.predict(input_features)\n",
"print(predicted_labels.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we've just re-classified our training data. Lets check our result with a few examples:"
"for i in range(3):\n",
" print(labels[i], \"predicted as\", predicted_labels[i])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What, \"0 predicted as 1\"? This looks suspicious!\n",
"source": [
"print(len(labels), \"examples\")\n",
"print(sum(predicted_labels == labels), \"labeled correctly\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-info\">\n",
"<i class=\"fa fa-info-circle\"></i>\n",
"<code>predicted_labels == labels</code> evaluates to a vector of <code>True</code> or <code>False</code> Boolean values. When used as numbers, Python handles <code>True</code> as <code>1</code> and <code>False</code> as <code>0</code>. So, <code>sum(...)</code> simply counts the correctly predicted labels.\n",
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div style=\"font-weight: bold; font-size: 200%;\">What happened?</div>\n",
"Why were not all labels predicted correctly?\n",
"\n",
"Neither `Python` nor `scikit-learn` is broken. What we observed above is very typical for machine-learning applications.\n",
"\n",
"Reasons could be:\n",
"- we have incomplete information: other features of beer which also contribute to the rating (like \"maltiness\") were not measured or can not be measured. \n",
"- the used classifiers might have been not suitable for the given problem.\n",
"- noise in the data as incorrectly assigned labels also affect results.\n",
"**Finding sufficient features and clean data is crucial for the performance of ML algorithms!**\n",
"\n",
"Another important requirement is to make sure that you have clean data: input-features might be corrupted by flawed entries, feeding such data into a ML algorithm will usually lead to reduced performance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compare with alternative machine learning method from `scikit-learn`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, using previously loaded and prepared beer data, train a different `scikit-learn` classifier - the so called **Support Vector Classifier** `SVC`, and evaluate its \"re-classification\" performance again.\n",
"\n",
"<div class=\"alert alert-block alert-info\">\n",
"<i class=\"fa fa-info-circle\"></i>\n",
"<code>SVC</code> belongs to a class of algorithms named \"Support Vector Machines\" (SVMs). Again, it will be discussed in more detail in the following scripts.\n",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.svm import SVC\n",
"classifier = SVC()\n",
"# ..."
]
},
"metadata": {
"tags": [
"solution"
]
},
"classifier = SVC()\n",
"classifier.fit(input_features, labels)\n",
"\n",
"predicted_labels = classifier.predict(input_features)\n",
"\n",
"print(len(labels), \"examples\")\n",
"print(sum(predicted_labels == labels), \"labeled correctly\")"
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-info\">\n",
"<i class=\"fa fa-info-circle\"></i>\n",
"Better re-classification in our example does not indicate here that <code>SVC</code> is better than <code>LogisticRegression</code> in all cases. The performance of a classifier strongly depends on the data set.\n",
"</div>\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Experiment with hyperparameters of ML methods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Both `LogisticRegression` and `SVC` classifiers have a hyperparameter `C` which allows to enforce a \"simplification\" (often called **regularization**) of the resulting model. Test the beers data \"re-classification\" with different values of this parameter. "
"outputs": [],
"source": [
"# Recall: ?LogisticRegression\n",
"metadata": {
"tags": [
"solution"
]
},
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"classifier.fit(input_features, labels)\n",
"predicted_labels = classifier.predict(input_features)\n",
"print(len(labels), \"examples\")\n",
"print(sum(predicted_labels == labels), \"labeled correctly\")\n",
"print(sum(predicted_labels == labels) / len(labels) * 100, \"% labeled correctly\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-warning\">\n",
"<i class=\"fa fa-warning\"></i> <strong>Classifiers have hyper-parameters</strong>\n",
"All classifiers have hyper-parameters, e.g. the `C` we have seen before. It is an incident that both, `LogisticRegression` and `SVC`, have parameter named `C`. Beyond that some classifiers have more than one parameter, e.g. `SVC` also has a parameter `gamma`. But more about these details later.\n",
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load and inspect the cannonical Fisher's \"Iris\" data set, which is included in `scikit-learn`: see [docs for `sklearn.datasets.load_iris`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html). What's conceptually diffferent?\n",
"Apply `LogisticRegression` or `SVC` classifiers. Is it easier or more difficult than classification of the beers data?\n",
{
"cell_type": "code",
"source": [
"from sklearn.datasets import load_iris\n",
"\n",
"data = load_iris()\n",
"\n",
"# labels as text\n",
"\n",
"# (rows, columns) of the feature matrix:\n",
]
},
{
"cell_type": "code",
"source": [
"# transform the scikit-learn data structure into a data frame:\n",
"df = pd.DataFrame(data.data, columns=data.feature_names)\n",
"df[\"class\"] = data.target\n",
"df.head()"
]
},
{
"cell_type": "code",
]
},
{
"cell_type": "code",
"source": [
"import seaborn as sns\n",
"sns.set(style=\"ticks\")\n",
"\n",
"for_plot = df.copy()\n",
"\n",
"def transform_label(class_):\n",
" return data.target_names[class_]\n",
"\n",
"# seaborn does not work here if we use numeric values in the class\n",
"# column, or strings which represent numbers. To fix this we\n",
"# create textual class labels\n",
"for_plot[\"class\"] = for_plot[\"class\"].apply(transform_label)\n",
"sns.pairplot(for_plot, hue=\"class\", diag_kind=\"hist\");"
]
},
{
"cell_type": "code",
"metadata": {
"tags": [
"solution"
]
},
"features = df.iloc[:, :-1]\n",
"labels = df.iloc[:, -1]\n",
"classifier.fit(features, labels)\n",
"predicted_labels = classifier.predict(features)\n",
"\n",
"print(len(labels), \"examples\")\n",
"print(sum(predicted_labels == labels), \"labeled correctly\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (C) 2019-2021 ETH Zurich, SIS ID"
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autoclose": false,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"toc_section_display": true,
"toc_window_display": true