Processing math: 38%
+ - 0:00:00
Notes for current slide
Notes for next slide

Fitting

Models are actually families of models, with every parameter combination specifying a different model.

To fit a model means to identify from the family of models the specific model that fits the data best.




adapted from explainxkcd.com

3 / 44

Loss function

Possible the most important concept in statistics and machine learning.

The loss function defines some summary of the errors committed by the model.

Loss=f(Error)

Two purposes

Purpose Description
Fitting Find parameters that minimize loss function.
Evaluation Calculate loss function for fitted model.

6 / 44

Regression

In regression, the criterion Y is modeled as the sum of features X1,X2,... times weights β1,β2,... plus β0 the so-called the intercept.

ˆY=β0+β1×X1+β2×X2+...

The weight βi indiciates the amount of change in ˆY for a change of 1 in Xi.

Ceteris paribus, the more extreme βi, the more important Xi for the prediction of Y (Note: the scale of Xi matters too!).

If βi=0, then Xi does not help predicting Y

8 / 44

Regression loss

Mean Squared Error (MSE)

Average squared distance between predictions and true values?
MSE=1ni1,...,n(YiˆYi)2
Mean Absolute Error (MAE)

Average absolute distance between predictions and true values?
MAE=1ni1,...,n|YiˆYi|

9 / 44

Fitting

There are two fundamentally different ways to find the set of parameters that minimizes loss.

Analytically

In rare cases, the parameters can be directly calculated, e.g., using the normal equation:

\boldsymbol \beta = (\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol y

Numerically

In most cases, parameters need to be found using a directed trial and error, e.g., gradient descent:

\boldsymbol \beta_{n+1} = \boldsymbol \beta_{n}+\gamma \nabla F(\boldsymbol \beta_{n})


adapted from me.me

10 / 44

Fitting

There are two fundamentally different ways to find the set of parameters that minimizes loss.

Analytically

In rare cases, the parameters can be directly calculated, e.g., using the normal equation:

\boldsymbol \beta = (\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol y

Numerically

In most cases, parameters need to be found using a directed trial and error, e.g., gradient descent:

\boldsymbol \beta_{n+1} = \boldsymbol \beta_{n}+\gamma \nabla F(\boldsymbol \beta_{n})



adapted from dunglai.github.io

adapted from dunglai.github.io

11 / 44

2 types of supervised problems

There are two types of supervised learning problems that can often be approached using the same model.

Regression

Regression problems involve the prediction of a quantitative feature.

E.g., predicting the cholesterol level as a function of age.

Classification

Classification problems involve the prediction of a categorical feature.

E.g., predicting the type of chest pain as a function of age.


12 / 44

Logistic regression

In logistic regression, the class criterion Y \in (0,1) is modeled also as the sum of feature times weights, but with the prediction being transformed using a logistic link function:

\large \hat{Y} = Logistic(\beta_{0} + \beta_{1} \times X_1 + ...)

The logistic function maps predictions to the range of 0 and 1, the two class values.

Logistic(x) = \frac{1}{1+exp(-x)}

13 / 44

Logistic regression

In logistic regression, the class criterion Y \in (0,1) is modeled also as the sum of feature times weights, but with the prediction being transformed using a logistic link function:

\large \hat{Y} = Logistic(\beta_{0} + \beta_{1} \times X_1 + ...)

The logistic function maps predictions to the range of 0 and 1, the two class values.

Logistic(x) = \frac{1}{1+exp(-x)}

14 / 44

Classification loss - two ways

Distance

Logloss is used to fit the parameters, alternative distance measures are MSE and MAE.

\small LogLoss = -\frac{1}{n}\sum_{i}^{n}(log(\hat{y})y+log(1-\hat{y})(1-y)) \small MSE = \frac{1}{n}\sum_{i}^{n}(y-\hat{y})^2, \: MAE = \frac{1}{n}\sum_{i}^{n} \lvert y-\hat{y} \rvert

Overlap

Does the predicted class match the actual class. Often preferred for ease of interpretation.

\small Loss_{01}=\frac{1}{n}\sum_i^n I(y \neq \lfloor \hat{y} \rceil)

15 / 44

Confusion matrix

The confusion matrix tabulates prediction matches and mismatches as a function of the true class.

The confusion matrix permits specification of a number of helpful performance metrics.


Confusion matrix

ŷ = 1 ŷ = 0
y = 1 True positive (TP) False negative (FN)
y = 0 False positive (FP) True negative (TN)

Accuracy: Of all cases, what percent of predictions are correct?

\small Acc. = \frac{TP + TN}{ TP + TN + FN + FP} = 1-Loss_{01}

Sensitivity: Of the truly Positive cases, what percent of predictions are correct?

\small Sensitivity = \frac{TP}{ TP +FN }

Specificity: Of the truly Negative cases, what percent of predictions are correct?

\small Specificity = \frac{TN}{ TN + FP }

16 / 44

Confusion matrix

The confusion matrix tabulates prediction matches and mismatches as a function of the true class.

The confusion matrix permits specification of a number of helpful performance metrics.


Confusion matrix

"Default" "Repay"
Default TP = 3 FN = 1
Repay FP = 1 TN = 2

Accuracy: Of all cases, what percent of predictions are correct?

\small Acc. = \frac{TP + TN}{ TP + TN + FN + FP} = 1-Loss_{01}

Sensitivity: Of the truly Positive cases, what percent of predictions are correct?

\small Sensitivity = \frac{TP}{ TP +FN }

Specificity: Of the truly Negative cases, what percent of predictions are correct?

\small Specificity = \frac{TN}{ TN + FP }

17 / 44

Hold-out data

Model performance must be evaluated as true prediction on an unseen data set.

The unseen data set can be naturally occurring, e.g., using 2019 stock prizes to evaluate a model fit using 2018 stock prizes.

More commonly unseen data is created by splitting the available data into a training set and a test set.

19 / 44

Training

Training a model means to fit the model to data by finding the parameter combination that minizes some error function, e.g., mean squared error (MSE).

20 / 44

Test

To test a model means to evaluate the prediction error for a fitted model, i.e., for a fixed parameter combination.

21 / 44



Overfitting

Occurs when a model fits data too closely and therefore fails to reliably predict future observations.

In other words, overfitting occurs when a model 'mistakes' random noise for a predictable signal.

More complex models are more prone to overfitting.




22 / 44

Regularized regression

Penalizes regression loss for having large \beta values using the lambda λ tuning parameter and one of several penalty functions.

Regularized \;loss = \sum_i^n (y_i-\hat{y}_i)^2+\lambda \sum_j^p f(\beta_j))

Name Function Description
Lasso j| Penalize by the absolute regression weights.
Ridge βj2 Penalize by the squared regression weights.
Elastic net j| + βj2 Penalize by Lasso and Ridge penalties.
25 / 44

Regularized regression

Despite superficial similarities, Lasso and Ridge show very different behavior.

Ridge

By penalizing the most extreme βs most strongly, Ridge leads to (relatively) more uniform βs.

Lasso

By penalizing all βs equally, irrespective of magnitude, Lasso drives some βs to 0 resulting effectively in automatic feature selection.

26 / 44

CART

CART is short for Classification and Regression Trees, which are often just called Decision trees.

In decision trees, the criterion is modeled as a sequence of logical TRUE or FALSE questions.

28 / 44

Classificiation trees

Classification trees (and regression trees) are created using a relatively simple three-step algorithm.

Algorithm

1 - Split nodes to maximize purity gain (e.g., Gini gain).

2 - Repeat until pre-defined threshold (e.g., minsplit) splits are no longer possible.

3 - Prune tree to reasonable size.

29 / 44

Node splitting

Classification trees attempt to minize node impurity using, e.g., the Gini coefficient.

\large Gini(S) = 1 - \sum_j^kp_j^2

Nodes are split using the variable and split value that maximizes Gini gain.

Gini \; gain = Gini(S) - Gini(A,S)

with

Gini(A, S) = \sum \frac{n_i}{n}Gini(S_i)

30 / 44

Pruning trees

Classification trees are pruned back such that every split has a purity gain of at least cp, with cp typically set to .01.

Minimize:


\large \begin{split} Loss = & Impurity\,+\\ &cp*(n\:terminal\:nodes)\\ \end{split}

31 / 44

Pruning trees

Classification trees are pruned back such that every split has a purity gain of at least cp, with cp typically set to .01.

Minimize:


\large \begin{split} Loss = & Impurity\,+\\ &cp*(n\:terminal\:nodes)\\ \end{split}

32 / 44

Regression trees

Trees can also be used to perform regression tasks. Instead of impurity, regression trees attempt to minimize within-node variance (or maximize node homogeneity):

\large SSE = \sum_{i \in S_1}(y_i - \bar{y}_1)^2+\sum_{i \in S_2}(y_i - \bar{y}_2)^2 Algorithm

1 - Split nodes to maximize homogeneity gain.

2 - Repeat until pre-defined threshold (e.g., minsplit) splits are no longe possible.

3 - Prune tree to reasonable size.

33 / 44

Random Forest

In Random Forest, the criterion is modeled as the aggregate prediction of a large number of decision trees each based on different features.

Algorithm

1 - Repeat n times

      1 - Resample data

      2 - Grow non-pruned decision tree

            Each split consider only m features

2 - Average fitted values


35 / 44

Random Forest

Random forests make use of important machine learning elements, resampling and averaging that together are also referred to as bagging.

Element Description
Resampling Creates new data sets that vary in their composition thereby deemphasizing idiosyncracies of the available data.
Averaging Combining predictions typically evens out idiosyncracies of the models created from single data sets.

36 / 44

Random Forest

Random forests make use of important machine learning elements, resampling and averaging that together are also referred to as bagging.

Element Description
Resampling Creates new data sets that vary in their composition thereby deemphasizing idiosyncracies of the available data.
Averaging Combining predictions typically evens out idiosyncracies of the models created from single data sets.

37 / 44

Tuning

All machine learning models are equipped with tuning parameters that control model complexity.

These tuning parameters can be identified using a validation set created from the traning data.

Logic

1 - Create separate test set.

2 - Fit model using various tuning parameters.

3 - Select tuning leading to best prediction on      validation set.

4 - Refit model to entire training set (training +      validation).

39 / 44

Resampling methods

Resampling methods automatize and generalize model tuning.

Method Description
k-fold cross-validation Splits the data in k-pieces, use each piece once as the validation set, while using the other one for training.
Bootstrap For B bootstrap rounds sample from the data with replacement and split the data in training and validation set.

40 / 44

Resampling methods

Resampling methods automatize and generalize model tuning.

Method Description
k-fold cross-validation Splits the data in k-pieces, use each piece once as the validation set, while using the other one for training.
Bootstrap For B bootstrap rounds sample from the data with replacement and split the data in training and validation set.

41 / 44

Resampling methods

Resampling methods automatize and generalize model tuning.

Method Description
k-fold cross-validation Splits the data in k-pieces, use each piece once as the validation set, while using the other one for training.
Bootstrap For B bootstrap rounds sample from the data with replacement and split the data in training and validation set.

42 / 44

k-fold cross validation for Ridge and Lasso

Goal

Use 10-fold cross-validation to identify optimal regularization parameters for a regression model.

Consider

\alpha \in 0, .5, 1 and \lambda \in 1, 2, ..., 100




43 / 44
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow