From ce6f0a47b7f056079da9c1f537a8bb833d2fc11b Mon Sep 17 00:00:00 2001 From: Uwe Schmitt <uwe.schmitt@id.ethz.ch> Date: Tue, 11 Dec 2018 16:28:30 +0100 Subject: [PATCH] small modifications after review session --- 01_introduction.ipynb | 36 ++++++++++++++++++++++++++++++------ 02_classification.ipynb | 5 ----- course_layout.md | 8 +++++--- 3 files changed, 35 insertions(+), 14 deletions(-) diff --git a/01_introduction.ipynb b/01_introduction.ipynb index 6c7a651..d1811b0 100644 --- a/01_introduction.ipynb +++ b/01_introduction.ipynb @@ -438,12 +438,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "And this is how we can compute such a word vector using Python:" + "**Note**: Such vectorization is unsually not done manually. This is a quick code example how to automate this within `scikit-learn`:\n" ] }, { "cell_type": "code", - "execution_count": 77, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -462,11 +462,18 @@ "\n", "vectorizer = CountVectorizer(vocabulary=vocabulary)\n", "\n", - "# create count vector for a pice of text:\n", + "# this how one can create a count vector for a given piece of text:\n", "vector = vectorizer.fit_transform([\"I dislike american pizza. But american beer is nice\"]).toarray().flatten()\n", "print(vector)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# TODO: exercises here to load another example data set and do some stats / plots." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -680,13 +687,20 @@ "for_plot = beer_data.copy()\n", "\n", "def translate_label(value):\n", - " return \"yummy\" if value == 1 else \"not yummy\"\n", + " return \"not yummy\" if value == 0 else \"yummy\"\n", "\n", "for_plot[\"is_yummy\"] = for_plot[\"is_yummy\"].apply(translate_label)\n", "\n", "sns.pairplot(for_plot, hue=\"is_yummy\", diag_kind=\"hist\");" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# TODO: some comments on the plots" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -740,7 +754,7 @@ }, { "cell_type": "code", - "execution_count": 81, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -786,6 +800,13 @@ "classifier.fit(input_features, labels)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "TODO: Type `LogisticRegression?` in a code cell and read the shown documentation." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -878,7 +899,10 @@ "\n", "The reason here is that we have incomplete information: other features of beer which also contribute to the rating (like \"maltiness\") where not measured or can not be measured. So even the best algorithm can not predict the target values reliably.\n", "\n", - "Another reason might be mistakes in the input data, e.g. some labels are assigned incorrectly.\n", + "Another explanation is that the used classifiers might have been not suitable for the given problem.\n", + "\n", + "Noise in the data as incorrectly assigned labels can also produce weak results:\n", + "\n", "\n", "* Finding good features is crucial for the performance of ML algorithms !\n", "\n", diff --git a/02_classification.ipynb b/02_classification.ipynb index 4017488..8e5bef6 100644 --- a/02_classification.ipynb +++ b/02_classification.ipynb @@ -55,11 +55,6 @@ "In `scikit-learn` many classifiers support such multi-class problems out of the box and also offers functionalities to implement `one-vs-all` or `one-vs-one` for specific cases. See https://scikit-learn.org/stable/modules/multiclass.html" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] - }, { "cell_type": "markdown", "metadata": {}, diff --git a/course_layout.md b/course_layout.md index 88080b6..e27a0b4 100644 --- a/course_layout.md +++ b/course_layout.md @@ -93,7 +93,7 @@ TBD: prepare coding session - learn regressor for movie scores. -## Part 3: accuracy, F1, ROC, ... +## Part 4: accuracy, F1, ROC, ... Intention: accuracy is usefull but has pitfalls @@ -104,7 +104,7 @@ Intention: accuracy is usefull but has pitfalls - classifier accuracy: - confusion matrix - accurarcy - - pitfalls for unbalanced data sets + - pitfalls for unbalanced data sets~ e.g. diagnose HIV - precision / recall - ROC ? @@ -117,7 +117,9 @@ Intention: accuracy is usefull but has pitfalls - fool them: give them other dataset where classifier fails. -## Part 4: underfitting/overfitting +## Part 3: underfitting/overfitting + +needs: simple accuracy measure. classifiers / regressors have parameters / degrees of freedom. -- GitLab