Newer
Older
Mikolaj Rybinski
committed
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chapter 5: An overview of classifiers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What classifiers ?\n",
"\n",
"- Neighrest neighbours\n",
"- Logistic Regression\n",
"- Linear SVM\n",
"\n",
"- Kernel SVM\n",
"- Decision trees\n",
"- Random forests\n",
"\n",
"- XGboost (https://xgboost.readthedocs.io/en/latest/) (not part of scikit-learn, won many kaggle competitions https://www.kaggle.com/dansbecker/xgboost, offers scikit-learn API https://www.kaggle.com/stuarthallows/using-xgboost-with-scikit-learn)\n",
"\n",
"\n",
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
"For every classifier: some examples for decision surfaces.\n",
"\n",
"Historical information ?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Neighrest neighbours\n",
"\n",
"- For a new feature $x$ look for $N$ closests examples from learning data (usually using the euclidean distance).\n",
"- Classify $x$ as the majority of labels among these closest examples.\n",
"\n",
"Parameter: $N$. the larger $N$ the smoother the decision surface.\n",
"\n",
"Benefit: simple\n",
"\n",
"Disadvanages: needs lots of data, does not work well for dimesions > 8(ish) (source !?)\n",
"\n",
"TODO: Commentary about course of dimensionality"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Logistic regression\n",
"\n",
"$\\sigma (t)={\\frac {e^{t}}{e^{t}+1}}={\\frac {1}{1+e^{-t}}}$\n",
"\n",
"plot !\n",
"\n",
"linear classifier, sigma shrinks result of linear combinations to interval 0, 1 which are interpreted as class probabilities.\n",
"\n",
"works better in high dimensions\n",
"\n",
"weights can be interpreted\n",
"\n",
"Parameters: C (https://stackoverflow.com/questions/22851316/what-is-the-inverse-of-regularization-strength-in-logistic-regression-how-shoul)\n",
"\n",
"Penelaty to avoid overfitting\n",
"\n",
"Plot logistig regression diagram as very simple neural network ?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Linear SVM\n",
"\n",
"- linear classifier such that margin is maximised (show example)\n",
"- based on \"empirical risk minization\" (vapnik)\n",
"\n",
"the final weight vector is a linear combination of a subset of the features from the learning set. These are called \"support vectors\".\n",
"\n",
"weights can be interpreted\n",
"\n",
"C: how much weight to we put on examples within the \"margin strip\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Kernel based SVM\n",
"\n",
"So called kernels are used to build the classifiation surface. Default kernel is rbf.\n",
"\n",
"Hard to intepret the internals.\n",
"\n",
"for rbf: gamma parameter is \"decline rate\" of rbf functions, controls smoothness of decision surface.\n",
"\n",
"feature scaling is crucial for good performance !"
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Decision trees\n",
"\n",
"- simple example incl. plot\n",
"- basic idea: \"optimal\" splits...\n",
"\n",
"- benefit: interpretability\n",
"\n",
"Parameter: depth, the deeper the higher the risk for overfitting."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Random forests\n",
"\n",
"- generate many week classifiers by creating shallow trees with random splittings\n",
"- use so call bagging to implement a good overall classifier\n",
"\n",
"- benefits: allows also estimates about feature importance\n",
"\n",
"- more robust to overfitting than decision trees\n"
]
},
Mikolaj Rybinski
committed
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}