Models are actually
To fit a model means to
adapted from explainxkcd.com
Possible
The loss function defines some
Loss=f(Error)
Two purposes
Purpose | Description |
Fitting | Find parameters that minimize loss function. |
Evaluation | Calculate loss function for fitted model. |
In regression, the criterion Y is modeled as the
ˆY=β0+β1×X1+β2×X2+...
The weight βi indiciates the
Ceteris paribus, the
If βi=0, then Xi
Mean Squared Error (MSE)
MSE=1n∑i∈1,...,n(Yi−ˆYi)2
Mean Absolute Error (MAE)
MAE=1n∑i∈1,...,n|Yi−ˆYi|
There are two fundamentally different ways to find the set of parameters that minimizes loss.
Analytically
In rare cases, the parameters can be
\boldsymbol \beta = (\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol y
Numerically
In most cases, parameters need to be found using a
\boldsymbol \beta_{n+1} = \boldsymbol \beta_{n}+\gamma \nabla F(\boldsymbol \beta_{n})
adapted from me.me
There are two fundamentally different ways to find the set of parameters that minimizes loss.
Analytically
In rare cases, the parameters can be
\boldsymbol \beta = (\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol y
Numerically
In most cases, parameters need to be found using a
\boldsymbol \beta_{n+1} = \boldsymbol \beta_{n}+\gamma \nabla F(\boldsymbol \beta_{n})
adapted from dunglai.github.io
adapted from dunglai.github.io
There are two types of supervised learning problems that can often be approached using the same model.
RegressionRegression problems involve the
E.g., predicting the cholesterol level as a function of age.
ClassificationClassification problems involve the
E.g., predicting the type of chest pain as a function of age.
In logistic regression, the class criterion Y \in (0,1) is modeled also as the
\large \hat{Y} = Logistic(\beta_{0} + \beta_{1} \times X_1 + ...)
The logistic function
Logistic(x) = \frac{1}{1+exp(-x)}
In logistic regression, the class criterion Y \in (0,1) is modeled also as the
\large \hat{Y} = Logistic(\beta_{0} + \beta_{1} \times X_1 + ...)
The logistic function
Logistic(x) = \frac{1}{1+exp(-x)}
Logloss is
\small LogLoss = -\frac{1}{n}\sum_{i}^{n}(log(\hat{y})y+log(1-\hat{y})(1-y)) \small MSE = \frac{1}{n}\sum_{i}^{n}(y-\hat{y})^2, \: MAE = \frac{1}{n}\sum_{i}^{n} \lvert y-\hat{y} \rvert
OverlapDoes the
\small Loss_{01}=\frac{1}{n}\sum_i^n I(y \neq \lfloor \hat{y} \rceil)
The confusion matrix
The confusion matrix permits specification of a number of
Confusion matrix
|
|
|
|
True positive (TP) | False negative (FN) |
|
False positive (FP) | True negative (TN) |
Accuracy: Of all cases, what percent of predictions are correct?
\small Acc. = \frac{TP + TN}{ TP + TN + FN + FP} = 1-Loss_{01}
Sensitivity: Of the truly Positive cases, what percent of predictions are correct?
\small Sensitivity = \frac{TP}{ TP +FN }
Specificity: Of the truly Negative cases, what percent of predictions are correct?
\small Specificity = \frac{TN}{ TN + FP }
The confusion matrix
The confusion matrix permits specification of a number of
Confusion matrix
|
|
|
|
TP = 3 | FN = 1 |
|
FP = 1 | TN = 2 |
Accuracy: Of all cases, what percent of predictions are correct?
\small Acc. = \frac{TP + TN}{ TP + TN + FN + FP} = 1-Loss_{01}
Sensitivity: Of the truly Positive cases, what percent of predictions are correct?
\small Sensitivity = \frac{TP}{ TP +FN }
Specificity: Of the truly Negative cases, what percent of predictions are correct?
\small Specificity = \frac{TN}{ TN + FP }
Model performance must be evaluated as true prediction on an
The unseen data set can be
More commonly unseen data is created by
Training a model means to
To test a model means to
Occurs when a model
In other words, overfitting occurs when a model
More
Penalizes regression loss for having large \beta values using the
Regularized \;loss = \sum_i^n (y_i-\hat{y}_i)^2+\lambda \sum_j^p f(\beta_j))
Name | Function | Description |
Lasso | |βj| |
Penalize by the |
Ridge | βj2 |
Penalize by the |
Elastic net | |βj| + βj2 | Penalize by Lasso and Ridge penalties. |
from mallorcazeitung.es
Despite
Ridge
By penalizing the most extreme βs most strongly, Ridge leads to (relatively) more
Lasso
By penalizing all βs equally, irrespective of magnitude, Lasso drives some βs to 0 resulting effectively in
CART is short for
In decision trees, the criterion is modeled as a
Classification trees (and regression trees) are created using a relatively simple
Algorithm
1 -
2 - minsplit
) splits are no longer possible.
3 -
Classification trees attempt to
\large Gini(S) = 1 - \sum_j^kp_j^2
Nodes are
Gini \; gain = Gini(S) - Gini(A,S)
with
Gini(A, S) = \sum \frac{n_i}{n}Gini(S_i)
Classification trees are cp
typically set to .01
.
Minimize:
\large \begin{split} Loss = & Impurity\,+\\ &cp*(n\:terminal\:nodes)\\ \end{split}
Classification trees are cp
typically set to .01
.
Minimize:
\large \begin{split} Loss = & Impurity\,+\\ &cp*(n\:terminal\:nodes)\\ \end{split}
Trees can also be used to perform regression tasks. Instead of impurity, regression trees attempt to
\large SSE = \sum_{i \in S_1}(y_i - \bar{y}_1)^2+\sum_{i \in S_2}(y_i - \bar{y}_2)^2 Algorithm
1 -
2 - minsplit
) splits are no longe possible.
3 -
In Random Forest, the criterion is modeled as the
Algorithm
1 -
1 -
2 -
Each split
2 -
Random forests make use of important machine learning elements,
Element | Description |
Resampling |
Creates new data sets that vary in their composition thereby |
Averaging |
Combining predictions typically |
Random forests make use of important machine learning elements,
Element | Description |
Resampling |
Creates new data sets that vary in their composition thereby |
Averaging |
Combining predictions typically |
All machine learning models are equipped with tuning parameters that
These tuning parameters can be identified using a
Logic
1 - Create separate test set.
2 - Fit model using various tuning parameters.
3 - Select tuning leading to best prediction on validation set.
4 - Refit model to entire training set (training + validation).
Resampling methods automatize and generalize model tuning.
Method | Description |
k-fold cross-validation |
Splits the data in k-pieces, use |
Bootstrap |
For B bootstrap rounds |
Resampling methods automatize and generalize model tuning.
Method | Description |
k-fold cross-validation |
Splits the data in k-pieces, use |
Bootstrap |
For B bootstrap rounds |
Resampling methods automatize and generalize model tuning.
Method | Description |
k-fold cross-validation |
Splits the data in k-pieces, use |
Bootstrap |
For B bootstrap rounds |
Goal
Use 10-fold cross-validation to identify
Consider
\alpha \in 0, .5, 1 and \lambda \in 1, 2, ..., 100
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |