diff --git a/01_introduction.ipynb b/01_introduction.ipynb index 6c7a65188ca264b94ba1bfd9a66bed16b94b7334..d1811b02270b9526810e5a30fe0b55163b262c9a 100644 --- a/01_introduction.ipynb +++ b/01_introduction.ipynb @@ -438,12 +438,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "And this is how we can compute such a word vector using Python:" + "**Note**: Such vectorization is unsually not done manually. This is a quick code example how to automate this within `scikit-learn`:\n" ] }, { "cell_type": "code", - "execution_count": 77, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -462,11 +462,18 @@ "\n", "vectorizer = CountVectorizer(vocabulary=vocabulary)\n", "\n", - "# create count vector for a pice of text:\n", + "# this how one can create a count vector for a given piece of text:\n", "vector = vectorizer.fit_transform([\"I dislike american pizza. But american beer is nice\"]).toarray().flatten()\n", "print(vector)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# TODO: exercises here to load another example data set and do some stats / plots." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -680,13 +687,20 @@ "for_plot = beer_data.copy()\n", "\n", "def translate_label(value):\n", - " return \"yummy\" if value == 1 else \"not yummy\"\n", + " return \"not yummy\" if value == 0 else \"yummy\"\n", "\n", "for_plot[\"is_yummy\"] = for_plot[\"is_yummy\"].apply(translate_label)\n", "\n", "sns.pairplot(for_plot, hue=\"is_yummy\", diag_kind=\"hist\");" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# TODO: some comments on the plots" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -740,7 +754,7 @@ }, { "cell_type": "code", - "execution_count": 81, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -786,6 +800,13 @@ "classifier.fit(input_features, labels)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "TODO: Type `LogisticRegression?` in a code cell and read the shown documentation." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -878,7 +899,10 @@ "\n", "The reason here is that we have incomplete information: other features of beer which also contribute to the rating (like \"maltiness\") where not measured or can not be measured. So even the best algorithm can not predict the target values reliably.\n", "\n", - "Another reason might be mistakes in the input data, e.g. some labels are assigned incorrectly.\n", + "Another explanation is that the used classifiers might have been not suitable for the given problem.\n", + "\n", + "Noise in the data as incorrectly assigned labels can also produce weak results:\n", + "\n", "\n", "* Finding good features is crucial for the performance of ML algorithms !\n", "\n", diff --git a/02_classification.ipynb b/02_classification.ipynb index 40174882337c2e34d641b6cf0c66ad9f2c474e35..8e5bef6907ae0040fee1314ad2a6341f57fdd6ad 100644 --- a/02_classification.ipynb +++ b/02_classification.ipynb @@ -55,11 +55,6 @@ "In `scikit-learn` many classifiers support such multi-class problems out of the box and also offers functionalities to implement `one-vs-all` or `one-vs-one` for specific cases. See https://scikit-learn.org/stable/modules/multiclass.html" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] - }, { "cell_type": "markdown", "metadata": {}, diff --git a/course_layout.md b/course_layout.md index 88080b641fe472ce3a809a02a448b5aeea69ba8b..e27a0b48d8dcf57ee008a973fcfaa6b515eddbb4 100644 --- a/course_layout.md +++ b/course_layout.md @@ -93,7 +93,7 @@ TBD: prepare coding session - learn regressor for movie scores. -## Part 3: accuracy, F1, ROC, ... +## Part 4: accuracy, F1, ROC, ... Intention: accuracy is usefull but has pitfalls @@ -104,7 +104,7 @@ Intention: accuracy is usefull but has pitfalls - classifier accuracy: - confusion matrix - accurarcy - - pitfalls for unbalanced data sets + - pitfalls for unbalanced data sets~ e.g. diagnose HIV - precision / recall - ROC ? @@ -117,7 +117,9 @@ Intention: accuracy is usefull but has pitfalls - fool them: give them other dataset where classifier fails. -## Part 4: underfitting/overfitting +## Part 3: underfitting/overfitting + +needs: simple accuracy measure. classifiers / regressors have parameters / degrees of freedom.