Supervised learning

# Supervised learning
### <a href='https://cdsbasel.github.io/dataanalytics/'> Data Analytics for Psychology and Business </a> <a href='https://cdsbasel.github.io/dataanalytics/menu/materials.html'> </a>  <a href='https://cdsbasel.github.io/dataanalytics/'> </a>  <a href='mailto:rui.mata@unibas.ch'> 
### April 2019

---

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://cdsbasel.github.io/dataanalytics/">
 
 
 cdsbasel.github.io/dataanalytics/
 
 
 </a>
 <a href="https://cdsbasel.github.io/dataanalytics/">
 
 Data Analytics for Psychology and Business | April 2019
 
 </a>
 
 </div>

---

<a><h1>Fitting</h1></a>

<h1>Evaluation</h1>

<h1>Tuning</h1>

---

# Fitting

Models are actually <high>families of models</high>, with every parameter combination specifying a different model.

To fit a model means to <high>identify</high> from the family of models <high>the specific model that fits the data best</high>.

]

<img src="image/curvefits.png" height=480px> 
adapted from <a href="https://www.explainxkcd.com/wiki/index.php/2048:_Curve-Fitting">explainxkcd.com</a>

]

---

# Which of these models is better? Why?

---

# Which of these models is better? Why?

---

# Loss function

Possible <high>the most important concept</high> in statistics and machine learning.

The loss function defines some <high>summary of the errors committed by the model</high>.

`$$\Large Loss = f(Error)$$`

Two purposes

<table style="cellspacing:0; cellpadding:0; border:none;">
<tr>
 <td>
 Purpose
 </td>
 <td>
 Description
 </td>
</tr>
<tr>
 <td bgcolor="white">
 Fitting
 </td>
 <td bgcolor="white">
 Find parameters that minimize loss function.
 </td>
</tr>
<tr>
 <td>
 Evaluation
 </td>
 <td>
 Calculate loss function for fitted model.
 </td>
</tr>
</table>

]

]

---

<high><h1>Regression</h1></high>

<h1>Decision Trees</h1>

<h1>Random Forests</h1>

---

# Regression

In [regression](https://en.wikipedia.org/wiki/Regression_analysis), the criterion `$Y$` is modeled as the <high>sum</high> of <high>features</high> `$X_1, X_2, ...$` <high>times weights</high> `$\beta_1, \beta_2, ...$` plus `$\beta_0$` the so-called the intercept.

`$$\large \hat{Y} =  \beta_{0} + \beta_{1} \times X_1 + \beta_{2} \times X2 + ...$$`

The weight `$\beta_{i}$` indiciates the <high>amount of change</high> in `$\hat{Y}$` for a change of 1 in `$X_{i}$`.

Ceteris paribus, the <high>more extreme</high> `$\beta_{i}$`, the <high>more important</high> `$X_{i}$` for the prediction of `$Y$` (Note: the scale of `$X_{i}$` matters too!).

If `$\beta_{i} = 0$`, then `$X_{i}$` <high>does not help</high> predicting `$Y$`

]

]

---

# Regression loss

Mean Squared Error (MSE) <high>Average squared distance</high> between predictions and true values?

$$ MSE = \frac{1}{n}\sum_{i \in 1,...,n}(Y_{i} - \hat{Y}_{i})^{2}$$

Mean Absolute Error (MAE) <high>Average absolute distance</high> between predictions and true values?

$$ MAE = \frac{1}{n}\sum_{i \in 1,...,n} \lvert Y_{i} - \hat{Y}_{i} \rvert$$

]

]

---

# Fitting

There are two fundamentally different ways to find the set of parameters that minimizes loss.

Analytically

In rare cases, the parameters can be <high>directly calculated</high>, e.g., using the normal equation:

`$$\boldsymbol \beta = (\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol y$$`

Numerically

In most cases, parameters need to be found using a <high>directed trial and error</high>, e.g., gradient descent:

`$$\boldsymbol \beta_{n+1} = \boldsymbol \beta_{n}+\gamma \nabla F(\boldsymbol \beta_{n})$$`

]

<img src="image/gradient.png" height=420px> 
adapted from <a href="https://me.me/i/machine-learning-gradient-descent-machine-learning-machine-learning-behind-the-ea8fe9fc64054eda89232d7ffc9ba60e">me.me</a>

]

---

# Fitting

There are two fundamentally different ways to find the set of parameters that minimizes loss.

Analytically

In rare cases, the parameters can be <high>directly calculated</high>, e.g., using the normal equation:

`$$\boldsymbol \beta = (\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol y$$`

Numerically

In most cases, parameters need to be found using a <high>directed trial and error</high>, e.g., gradient descent:

`$$\boldsymbol \beta_{n+1} = \boldsymbol \beta_{n}+\gamma \nabla F(\boldsymbol \beta_{n})$$`

]

<br2>

<img src="image/gradient1.gif" height=250px> 
adapted from <a href="https://dunglai.github.io/2017/12/21/gradient-descent/
">dunglai.github.io</a> 
<img src="image/gradient2.gif" height=250px> 
adapted from <a href="https://dunglai.github.io/2017/12/21/gradient-descent/
">dunglai.github.io</a>

]

---

# 2 types of supervised problems

There are two types of supervised learning problems that can often be approached using the same model.

Regression

Regression problems involve the <high>prediction of a quantitative feature</high>.

E.g., predicting the cholesterol level as a function of age.

Classification

Classification problems involve the <high>prediction of a categorical feature</high>.

E.g., predicting the type of chest pain as a function of age.

]

]

---

# Logistic regression

In [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression), the class criterion `$Y \in (0,1)$` is modeled also as the <high>sum of feature times weights</high>, but with the prediction being transformed using a <high>logistic link function</high>:

`$$\large \hat{Y} =  Logistic(\beta_{0} + \beta_{1} \times X_1 + ...)$$`

The logistic function <high>maps predictions to the range of 0 and 1</high>, the two class values.

$$ Logistic(x) = \frac{1}{1+exp(-x)}$$

]

]

---

# Logistic regression

`$$\large \hat{Y} =  Logistic(\beta_{0} + \beta_{1} \times X_1 + ...)$$`

The logistic function <high>maps predictions to the range of 0 and 1</high>, the two class values.

$$ Logistic(x) = \frac{1}{1+exp(-x)}$$

]

]

---

# Classification loss - two ways

Distance

Logloss is <high>used to fit the parameters</high>, alternative distance measures are MSE and MAE.

`$$\small LogLoss = -\frac{1}{n}\sum_{i}^{n}(log(\hat{y})y+log(1-\hat{y})(1-y))$$`
`$$\small MSE = \frac{1}{n}\sum_{i}^{n}(y-\hat{y})^2, \: MAE = \frac{1}{n}\sum_{i}^{n} \lvert y-\hat{y} \rvert$$`

Overlap

Does the <high>predicted class match the actual class</high>. Often preferred for <high>ease of interpretation</high>.

`$$\small Loss_{01}=\frac{1}{n}\sum_i^n I(y \neq \lfloor \hat{y} \rceil)$$`

]

]

---

# Confusion matrix

The confusion matrix <high>tabulates prediction matches and mismatches</high> as a function of the true class.

The confusion matrix permits specification of a number of <high>helpful performance metrics</high>.

Confusion matrix

<table style="cellspacing:0; cellpadding:0; border:none;">
<tr>
 <td>
 </td>
 <td>
 <eq>y&#770; = 1</eq>
 </td>
 <td>
 <eq>y&#770; = 0</eq>
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <eq>y = 1</eq>
 </td>
 <td bgcolor="white">
 True positive (TP)
 </td>
 <td bgcolor="white">
 False negative (FN)
 </td>
</tr>
<tr>
 <td>
 <eq>y = 0</eq>
 </td>
 <td>
 False positive (FP)
 </td>
 <td>
 True negative (TN)
 </td>
</tr>
</table>

]

Accuracy: Of all cases, what percent of predictions are correct?

`$$\small Acc. = \frac{TP + TN}{ TP + TN + FN + FP} = 1-Loss_{01}$$`

Sensitivity: Of the truly Positive cases, what percent of predictions are correct?

`$$\small Sensitivity = \frac{TP}{ TP +FN }$$`

Specificity: Of the truly Negative cases, what percent of predictions are correct?

`$$\small Specificity = \frac{TN}{ TN + FP }$$`

]

---

# Confusion matrix

The confusion matrix <high>tabulates prediction matches and mismatches</high> as a function of the true class.

The confusion matrix permits specification of a number of <high>helpful performance metrics</high>.

Confusion matrix

<table style="cellspacing:0; cellpadding:0; border:none;">
<tr>
 <td>
 </td>
 <td>
 <eq>"Default"</eq>
 </td>
 <td>
 <eq>"Repay"</eq>
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <eq>Default</eq>
 </td>
 <td bgcolor="white">
 TP = 3
 </td>
 <td bgcolor="white">
 FN = 1
 </td>
</tr>
<tr>
 <td>
 <eq>Repay</eq>
 </td>
 <td>
 FP = 1
 </td>
 <td>
 TN = 2
 </td>
</tr>
</table>

]

Accuracy: Of all cases, what percent of predictions are correct?

`$$\small Acc. = \frac{TP + TN}{ TP + TN + FN + FP} = 1-Loss_{01}$$`

Sensitivity: Of the truly Positive cases, what percent of predictions are correct?

`$$\small Sensitivity = \frac{TP}{ TP +FN }$$`

Specificity: Of the truly Negative cases, what percent of predictions are correct?

`$$\small Specificity = \frac{TN}{ TN + FP }$$`

]

---

<h1>Fitting</h1>

<a><h1>Evaluation</h1></a>

<h1>Tuning</h1>

---

# Hold-out data

Model performance must be evaluated as true prediction on an <high>unseen data set</high>.

The unseen data set can be <high>naturally</high> occurring, e.g., using 2019 stock prizes to evaluate a model fit using 2018 stock prizes.

More commonly unseen data is created by <high>splitting the available data</high> into a training set and a test set.

]

]

---

# Training

Training a model means to <high>fit the model</high> to data by finding the parameter combination that <high>minizes some error function</high>, e.g., mean squared error (MSE).

---

# Test

To test a model means to <high>evaluate the prediction error</high> for a fitted model, i.e., for a <high>fixed parameter combination</high>.

---

# Overfitting

Occurs when a model <high>fits data too closely</high> and therefore <high>fails to reliably predict</high> future observations.

In other words, overfitting occurs when a model <high>'mistakes' random noise for a predictable signal</high>.

More <high>complex models</high> are more <high>prone to overfitting</high>.

]

]

---

# Overfitting

---

<high><h1>Regression</h1></high>

<h1>Decision Trees</h1>

<h1>Random Forests</h1>

---

# Regularized regression

Penalizes regression loss for having large `$\beta$` values using the <high>lambda &lambda; tuning parameter</high> and one of several penalty functions.

$$Regularized \;loss = \sum_i^n (y_i-\hat{y}_i)^2+\lambda \sum_j^p f(\beta_j)) $$
<table style="cellspacing:0; cellpadding:0; border:none;">
<tr>
 <td bgcolor="white">
 Name
 </td>
 <td bgcolor="white">
 Function
 </td> 
 <td bgcolor="white">
 Description
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 Lasso
 </td>
 <td bgcolor="white">
 |&beta;j|
 </td> 
 <td bgcolor="white">
 Penalize by the <high>absolute</high> regression weights.
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 Ridge 
 </td>
 <td bgcolor="white">
 &beta;j2
 </td> 
 <td bgcolor="white">
 Penalize by the <high>squared</high> regression weights.
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 Elastic net
 </td>
 <td bgcolor="white">
 |&beta;j| + &beta;j2
 </td> 
 <td bgcolor="white">
 Penalize by Lasso and Ridge penalties.
 </td> 
</tr>
</table>

]

<img src="image/bonsai.png"> 
from <a href="https://www.mallorcazeitung.es/leben/2018/05/02/bonsai-liebhaber-mallorca-kunst-lebenden/59437.html">mallorcazeitung.es</a>

]

---

# Regularized regression

Despite <high>superficial similarities</high>, Lasso and Ridge show very different behavior.

Ridge

By penalizing the most extreme &beta;s most strongly, Ridge leads to (relatively) more <high>uniform &beta;s</high>.

Lasso

By penalizing all &beta;s equally, irrespective of magnitude, Lasso drives some &beta;s to 0 resulting effectively in <high>automatic feature selection</high>.

]

Ridge 
 <img src="image/ridge.png" height=210px> 
 from <a href="https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf">James et al. (2013) ISLR</a>

Lasso 
 <img src="image/lasso.png" height=210px> 
 from <a href="https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf">James et al. (2013) ISLR</a>

]

---

<h1>Regression</h1>

<high><h1>Decision Trees</h1></high>

<h1>Random Forests</h1>

---

# CART

CART is short for <high>Classification and Regression Trees</high>, which are often just called <high>Decision trees</high>.

In [decision trees](https://en.wikipedia.org/wiki/Decision_tree), the criterion is modeled as a <high>sequence of logical TRUE or FALSE questions</high>.

]

<img src="image/tree.png">

]

---

# Classificiation trees

Classification trees (and regression trees) are created using a relatively simple <high>three-step algorithm</high>.

Algorithm

1 - <high>Split</high> nodes to maximize purity gain (e.g., Gini gain).

2 - <high>Repeat</high> until pre-defined threshold (e.g., `minsplit`) splits are no longer possible.

3 - <high>Prune</high> tree to reasonable size.

]

<img src="image/tree.png">

]

---

# Node splitting

Classification trees attempt to <high>minize node impurity</high> using, e.g., the <high>Gini coefficient</high>.

`$$\large Gini(S) = 1 - \sum_j^kp_j^2$$`

Nodes are <high>split</high> using the variable and split value that <high>maximizes Gini gain</high>.

`$$Gini \; gain = Gini(S) - Gini(A,S)$$`

with

`$$Gini(A, S) = \sum \frac{n_i}{n}Gini(S_i)$$`

]

]

---

# Pruning trees

Classification trees are <high>pruned</high> back such that every split has a purity gain of at least <high><mono>cp</mono></high>, with `cp` typically set to `.01`.

Minimize:

$$
\large
`\begin{split}
Loss = & Impurity\,+\\
&cp*(n\:terminal\:nodes)\\
\end{split}`
$$

]

]

---

# Pruning trees

Classification trees are <high>pruned</high> back such that every split has a purity gain of at least <high><mono>cp</mono></high>, with `cp` typically set to `.01`.

Minimize:

$$
\large
`\begin{split}
Loss = & Impurity\,+\\
&cp*(n\:terminal\:nodes)\\
\end{split}`
$$

]

]

---

# Regression trees

Trees can also be used to perform regression tasks. Instead of impurity, regression trees attempt to <high>minimize within-node variance</high> (or maximize node homogeneity):

`$$\large SSE = \sum_{i \in S_1}(y_i - \bar{y}_1)^2+\sum_{i \in S_2}(y_i - \bar{y}_2)^2$$`
Algorithm

1 - <high>Split</high> nodes to maximize homogeneity gain.

2 - <high>Repeat</high> until pre-defined threshold (e.g., `minsplit`) splits are no longe possible.

3 - <high>Prune</high> tree to reasonable size.

]

]

---
class: center, middle

<h1>Regression</h1>

<h1>Decision Trees</h1>

<high><h1>Random Forests</h1></high>

---

# Random Forest

In [Random Forest](https://en.wikipedia.org/wiki/Random_forest), the criterion is modeled as the <high>aggregate prediction of a large number of decision trees</high> each based on different features.

Algorithm

1 - <high>Repeat</high> *n* times

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1 - <high>Resample</high> data

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2 - <high>Grow</high> non-pruned decision tree

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Each split <high>consider only m features</high>

2 - <high>Average</high> fitted values

]

]

---

# Random Forest

Random forests make use of important machine learning elements, <high>resampling</high> and <high>averaging</high> that together are also referred to as <high>bagging</high>.

<table style="cellspacing:0; cellpadding:0; border:none;">
 <col width="30%">
 <col width="70%">
<tr>
 <td bgcolor="white">
 Element
 </td>
 <td bgcolor="white">
 Description
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 Resampling
 </td>
 <td bgcolor="white">
 Creates new data sets that vary in their composition thereby <high>deemphasizing idiosyncracies</high> of the available data. 
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 Averaging
 </td>
 <td bgcolor="white">
 Combining predictions typically <high>evens out idiosyncracies</high> of the models created from single data sets. 
 </td> 
</tr>
</table>
]

]

---

# Random Forest

Random forests make use of important machine learning elements, <high>resampling</high> and <high>averaging</high> that together are also referred to as <high>bagging</high>.

]

---

<h1>Fitting</h1>

<h1>Evaluation</h1>

<a><h1>Tuning</h1></a>

---

# Tuning

All machine learning models are equipped with tuning parameters that <high> control model complexity<high>.

These tuning parameters can be identified using a <high>validation set</high> created from the traning data.

Logic

1 - Create separate test set.

2 - Fit model using various tuning parameters.

3 - Select tuning leading to best prediction on
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;validation set.

4 - Refit model to entire training set (training + 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;validation).

]

]

---

# Resampling methods

Resampling methods automatize and generalize model tuning.

<table style="cellspacing:0; cellpadding:0; border:none;">
 <col width="30%">
 <col width="70%">
<tr>
 <td bgcolor="white">
 Method
 </td>
 <td bgcolor="white">
 Description
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 k-fold cross-validation
 </td>
 <td bgcolor="white">
 Splits the data in k-pieces, use <high>each piece once</high> as the validation set, while using the other one for training. 
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 Bootstrap
 </td>
 <td bgcolor="white">
 For B bootstrap rounds <high>sample</high> from the data <high>with replacement</high> and split the data in training and validation set. 
 </td> 
</tr>
</table>
]

]

---

# Resampling methods

Resampling methods automatize and generalize model tuning.

]

---

# Resampling methods

Resampling methods automatize and generalize model tuning.

]

---

# k-fold cross validation for Ridge and Lasso

Goal

Use 10-fold cross-validation to identify <high>optimal regularization parameters</high> for a regression model.

Consider

`$\alpha \in 0, .5, 1$` and `$\lambda \in 1, 2, ..., 100$`

]

]

---

<h1><a href=https://cdsbasel.github.io/dataanalytics/menu/materials.html>Materials</a></h1>