ML with R

# ML with R
### <a href='https://therbootcamp.github.io'> Data analytics for Psychology and Business </a> <a href='https://cdsbasel.github.io/dataanalytics_2021/menu/materials.html'> </a>  <a href='https://cdsbasel.github.io/dataanalytics_2021/'> </a>  <a href='mailto:dirk.wulff@unibas.ch'> </a> 
### April 2021

---

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://cdsbasel.github.io/dataanalytics_2021">
 
 
 https://cdsbasel.github.io/dataanalytics_2021
 
 
 </a>
 <a href="https://therbootcamp.github.io/">
 
 Data analytics for Psychology and Business | April 2021
 
 </a>
 
 </div>

---

# Fitting

<ul>
<li class="m1">Models are actually <high>families of models</high>, with every parameter combination specifying a different model.</li>
<li class="m2">To fit a model means to <high>identify</high> from the family of models <high>the specific model that fits the data best</high>.</li>
</ul>

]

<img src="image/curvefits.png" height=480px> 
adapted from <a href="https://www.explainxkcd.com/wiki/index.php/2048:_Curve-Fitting">explainxkcd.com</a>

]

---

# Which is better?

---

# Which is better?

---

# Loss function

<ul>
<li class="m1"><Possible <high>the most important concept</high> in statistics and machine learning.</li>
<li class="m2">The loss function defines some <high>summary of the errors committed by the model</high>.</li>
</ul>

`$$\Large Loss = f(Error)$$`

Two purposes

<table style="cellspacing:0; cellpadding:0; border:none;">
<tr>
 <td>
 Purpose
 </td>
 <td>
 Description
 </td>
</tr>
<tr>
 <td bgcolor="white">
 Fitting
 </td>
 <td bgcolor="white">
 Find parameters that optimize loss function.
 </td>
</tr>
<tr>
 <td>
 Evaluation
 </td>
 <td>
 Calculate loss function for fitted model.
 </td>
</tr>
</table>

]

]

---

# 2 types of supervised problems

<ul>
 <li class="m1">Regression</li>
 
 <ul class="level">
 <li>Prediction of a <high>quantitative criterion</high>.</li> 
 <li>Predict level cholesterol with age</li>
 </ul> 
 <li class="m2">Classification</li>
 
 <ul class="level">
 <li>Prediction of a <high>categorical criterion</high>.</li> 
 <li>Predict heart attack yes or no</li>
 </ul> 
</ul>

]

]

---

<high><h1>Regression</h1></high>

<h1>Decision Trees</h1>

<h1>Random Forests</h1>

---

# Regression

In [regression](https://en.wikipedia.org/wiki/Regression_analysis), the criterion `$Y$` is modeled as the <high>sum</high> of <high>features</high> `$X_1, X_2, ...$` <high>times weights</high> `$\beta_1, \beta_2, ...$` plus `$\beta_0$` the so-called the intercept.

`$$\large \hat{Y} =  \beta_{0} + \beta_{1} \times X_1 + \beta_{2} \times X2 + ...$$`

The weight `$\beta_{i}$` indiciates the <high>amount of change</high> in `$\hat{Y}$` for a change of 1 in `$X_{i}$`.

Ceteris paribus, the <high>more extreme</high> `$\beta_{i}$`, the <high>more important</high> `$X_{i}$` for the prediction of `$Y$` (Note: the scale of `$X_{i}$` matters too!).

If `$\beta_{i} = 0$`, then `$X_{i}$` <high>does not help</high> predicting `$Y$`

]

]

---

# Regression loss

<ul style="margin-bottom:-20px">
 <li class="m1">Mean Squared Error (MSE)
 
 <ul class="level">
 <li><high>Average squared distance</high> between predictions and true values.</li>
 </ul>
 </li>
</ul>

$$ MSE = \frac{1}{n}\sum_{i \in 1,...,n}(Y_{i} - \hat{Y}_{i})^{2}$$

<ul>
 <li class="m2">Mean Absolute Error (MAE)
 
 <ul class="level">
 <li><high>Average absolute distance</high> between predictions and true values.</li>
 </ul>
 </li>
</ul>

$$ MAE = \frac{1}{n}\sum_{i \in 1,...,n} \lvert Y_{i} - \hat{Y}_{i} \rvert$$

]

]

---

# Fitting

<ul style="margin-bottom:-20px">
 <li class="m1">Analytically
 
 <ul class="level">
 <li>In rare cases, the parameters can be <high>directly calculated</high>, e.g., using the normal equation</li>
 </ul>
 </li>
</ul>

`$$\boldsymbol \beta = (\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol y$$`

<ul>
 <li class="m2">Numerically
 
 <ul class="level">
 <li>In most cases, parameters need to be found using a <high>directed trial and error</high>, e.g., gradient descent:</li>
 </ul>
 </li>
</ul>

`$$\boldsymbol \beta_{n+1} = \boldsymbol \beta_{n}+\gamma \nabla F(\boldsymbol \beta_{n})$$`

]

<img src="image/gradient.png" height=420px> 
adapted from <a href="https://me.me/i/machine-learning-gradient-descent-machine-learning-machine-learning-behind-the-ea8fe9fc64054eda89232d7ffc9ba60e">me.me</a>

]

---

# Fitting

<ul style="margin-bottom:-20px">
 <li class="m1">Analytical
 
 <ul class="level">
 <li>In rare cases, the parameters can be <high>directly calculated</high>, e.g., using the normal equation</li>
 </ul>
 </li>
</ul>

`$$\boldsymbol \beta = (\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol y$$`

`$$\boldsymbol \beta_{n+1} = \boldsymbol \beta_{n}+\gamma \nabla F(\boldsymbol \beta_{n})$$`

]

<br2>

<img src="image/gradient1.gif" height=250px> 
adapted from <a href="https://dunglai.github.io/2017/12/21/gradient-descent/
">dunglai.github.io</a> 
<img src="image/gradient2.gif" height=250px> 
adapted from <a href="https://dunglai.github.io/2017/12/21/gradient-descent/
">dunglai.github.io</a>

]

---
class: center, middle

<h1>Regression</h1>

<high><h1>Decision Trees</h1></high>

<h1>Random Forests</h1>

---

# CART

<ul>
 <li class="m1">CART is short for <high>Classification and Regression Trees</high>, which are often simply called Decision trees.</li> 
 <li class="m2">Models criterion is modeled as a sequence of <high>logical TRUE or FALSE questions</high>.</li>
</ul>

]

<img src="image/tree.png">

]

---

# Regression trees

<ul>
 <li class="m1">Classification regression trees are created using a relatively simple <high>three-step algorithm</high>:</li> 
 <ul>
 <li>1 - <high>Split</high> nodes to maxmize homogeneity (regression) or purity (classification within nodes.</li> 
 <li>2 - <high>Repeat</high> until splits are no longer possible.</li> 
 <li>3 - <high>Prune</high> tree to reasonable size.</mono></li>
 </ul>
</ul>

]

<img src="image/tree.png">

]

---

# Homogeneity

<ul>
 <li class="m1">Regression trees attempt to maximize homogeneity, which means in turn to <high>minimize within-node variance</high></li>
</ul>

`$$\large SSE = \sum_{i \in S_1}(y_i - \bar{y}_1)^2+\sum_{i \in S_2}(y_i - \bar{y}_2)^2$$`

]

]

---

# Pruning trees

<ul>
 <li class="m1">Regression trees are <high>pruned</high> back such that every split has a homogeneity gain of at least <high><mono>cp</mono></high>.</li>
</ul>

$$
\large
`\begin{split}
Loss = & Homogeneity\,+\\
&cp*(n\:terminal\:nodes)\\
\end{split}`
$$

]

]

---
class: center, middle

<h1>Regression</h1>

<h1>Decision Trees</h1>

<high><h1>Random Forests</h1></high>

---

# Random Forest

<ul>
 <li class="m1">In Random Forests the criterion is modeled as the <high>aggregate prediction of <high>many decision trees</high> each based on different features.</li> 
 <li class="m2">Algorithmus:</li>
 <ul>
 <li>1 - <high>Repeat</high> <mono>n</mono> times.</li> 
 <ul>
 <li>1 - <high>Resample</high> data.</li> 
 <li>2 - Each split <high>consider m features</high>.</li> 
 </ul>
 <li>2 - <high>Average</high> predictions.</li> 
 </ul>
</ul>

]

]

---

# Random Forest

<ul>
 <li class="m1">Random forests make use of <high>bagging</high>, which consists of <high>resampling</high> and <high>averaging</high>.</li>
</ul>

<table style="cellspacing:0; cellpadding:0; border:none;">
 <col width="30%">
 <col width="70%">
<tr>
 <td bgcolor="white">
 Element
 </td>
 <td bgcolor="white">
 Description
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 Resampling
 </td>
 <td bgcolor="white">
 Creates new data sets that vary in their composition thereby <high>deemphasizing idiosyncracies</high> of the available data. 
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 Averaging
 </td>
 <td bgcolor="white">
 Combining predictions typically <high>evens out idiosyncracies</high> of the models created from single data sets. 
 </td> 
</tr>
</table>
]

]

---

<h1><a>Fitten in <mono>caret</mono></h1>

---

# Key functions

<table style="cellspacing:0; cellpadding:0; border:none;">
<tr>
 <td>
 Function
 </td>
 <td>
 Description
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <mono>trainControl()</mono>
 </td>
 <td bgcolor="white">
 Choose settings for how fitting should be carried out.
 </td>
</tr>
<tr>
 <td>
 <mono>train()</mono>
 </td>
 <td>
 Specify the model and find best parameters.
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <mono>postResample()</mono>
 </td>
 <td bgcolor="white">
 Evaluate model performance (fitting or prediction) for regression.
 </td>
</tr>
<tr>
 <td>
 <mono>confusionMatrix()</mono>
 </td>
 <td bgcolor="white">
 Evaluate model performance (fitting or prediction) for classification.
 </td>
</tr>
</table>

]

```r
# Step 1: Define control parameters
#   trainControl()

ctrl <- trainControl(...)

# Step 2: Train and explore model
#   train()

mod <- train(...)
summary(mod)
mod$finalModel # see final model

# Step 3: Assess fit
#   predict(), postResample(), fon

fit <- predict(mod)
postResample(fit, truth)
confusionMatrix(fit, truth)
```

]

---

# `trainControl()`

<ul>
 <li class="m1">Controls how <mono>caret</mono> fits an ML model.</li> 
 <li class="m2">Until Session Optimisierung we use <highm>method = "none"</highm>.</li>
</ul>

```r
# Fit the model without any 
#  advanced parameter tuning methods

ctrl <- trainControl(method = "none")

# show help

?trainControl
```

]

]

---

# `train()`

<ul>
 <li class="m1"><high>Workhorse</high> for fitting models, offering <high>200+ models</high> via the <high>method</high> argument!</li>
</ul>

<table style="cellspacing:0; cellpadding:0; border:none;">
<tr>
 <td>
 Argument
 </td>
 <td>
 Description
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <mono>form</mono>
 </td>
 <td bgcolor="white">
 Formula specifying features and criterion.
 </td>
</tr>
<tr>
 <td>
 <mono>data</mono>
 </td>
 <td>
 Training data.
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <mono>method</mono>
 </td>
 <td bgcolor="white">
 The model (algorithm). 
 </td>
</tr>
<tr>
 <td>
 <mono>trControl</mono>
 </td>
 <td bgcolor="white">
 Control parameters for fitting.
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <mono>tuneGrid</mono>, <mono>preProcess</mono>
 </td>
 <td bgcolor="white">
 Cool stuff for later.
 </td>
</tr>
</table>

]

```r
# Fit a regression model predicting Price

income_mod <- 
 train(form = income ~ ., # Formula
 data = basel, # Training data
 method = "glm", # Regression
 trControl = ctrl) # Control Param's
income_mod
```

```
Generalized Linear Model

1000 samples
  19 predictor

No pre-processing
Resampling: None 
```

]

---

# `train()`

<ul>
 <li class="m1"><high>Workhorse</high> for fitting models, offering <high>200+ models</high> via the <high>method</high> argument!</li>
</ul>

]

```r
# Fit a random forest predicting Price

income_mod <- 
 train(form = income ~ .,# Formula
 data = basel, # Training data
 method = "rpart", # Decision tree
 trControl = ctrl) # Control Param's
income_mod
```

```
CART

1000 samples
  19 predictor

No pre-processing
Resampling: None 
```

]

---

# `train()`

<ul>
 <li class="m1"><high>Workhorse</high> for fitting models, offering <high>200+ models</high> via the <high>method</high> argument!</li>
</ul>

]

```r
# Fit a random forest predicting Price

income_mod <- 
 train(form = income ~ .,# Formula
 data = basel, # Training data
 method = "rf", # Random Forest
 trControl = ctrl) # Control Param's
income_mod
```

```
Random Forest

1000 samples
  19 predictor

No pre-processing
Resampling: None 
```

]

---

# `train()`

<ul>
 <li class="m1"><high>Workhorse</high> for fitting models, offering <high>200+ models</high> via the <high>method</high> argument!</li>
 <li class="m2">Find all 200+ Models <a href="http://topepo.github.io/caret/available-models.html">here</a>.</li>
</ul>

]

]

---

# `train()`

<ul style="margin-bottom:-20px">
 <li class="m1">The criterion must be the right type:
 
 <ul class="level">
 <li><high><mono>numeric</mono></high> criterion &rarr; <high>Regression</high> </li>
 <li><high><mono>factor</mono></high> criterion &rarr; <high>Klassifkation</high> </li>
 </ul>
 </li>
</ul>

]

```r
# Will be a regression task

loan_mod <- train(form = Default ~ .,
 data = Loans,
 method = "glm",
 trControl = ctrl)

# Will be a classification task

load_mod <- train(form = factor(Default) ~ .,
 data = Loans,
 method = "glm",
 trControl = ctrl)
```

]

---

# <mono>.$finalModel</mono>

<ul>
 <li class="m1"><mono>train()</mono> returns a <mono>list</mono> with an object called <mono>finalModel</mono> containing the best fitting model.</li>
 <li class="m2"><mono>.$finalModel</mono> <high>extracts the model</high>, then explore using:</li>
</ul>

<table style="cellspacing:0; cellpadding:0; border:none;">
<tr>
 <td>
 Function
 </td>
 <td>
 Description
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <mono>summary()</mono>
 </td>
 <td bgcolor="white">
 Overview of the most important results.
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <mono>names()</mono>
 </td>
 <td bgcolor="white">
 See all named elements you can access with $.
 </td>
</tr>
</table>

]

```r
# Fit regression model
eink_mod <-
 train(form = income ~ age + height,
 data = basel, # Data
 method = "glm", # Regression
 trControl = ctrl) # Train control

# Show names of final model
names(eink_mod$finalModel)
```

```
[1] "coefficients"  "residuals"     "fitted.values"
[4] "effects"       "R"             "rank"         
 [ reached getOption("max.print") -- omitted 28 entries ]
```

]

---

# <mono>.$finalModel</mono>

]

```r
# Fit regression model
eink_mod <-
 train(form = income ~ age + height,
 data = basel, # Data
 method = "glm", # Regression
 trControl = ctrl) # Train control

# Show coefficients
eink_mod$finalModel$coefficients
```

```
(Intercept)         age      height 
    177.084     151.786       3.466 
```

]

---

# <mono>.$finalModel</mono>

]

```r
# Show model output
summary(eink_mod)
```

```

Call:
NULL

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
 -3567    -773      30     793    4186

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   177.08     498.81    0.36     0.72    
 [ reached getOption("max.print") -- omitted 2 rows ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 1380255)

Null deviance: 7687291073  on 999  degrees of freedom
Residual deviance: 1376114147  on 997  degrees of freedom
AIC: 16981

Number of Fisher Scoring iterations: 2
```

]

---

# `predict()`

<ul>
 <li class="m1"><high>Produces predictions</high> from a model. Simply put model object as the first argument.</li>
</ul>

```r
# Extrahiere gefittete Werte
glm_fits <- predict(object = eink_mod)
glm_fits[1:8]
```

```
    1     2     3     4     5     6     7     8 
 5508  6960  6982  8645  5325 10648  8663  4592 
```

]

]

---

# `postResample()`

<ul>
 <li class="m1"><high>Gives a summary</high> of a models' performance in a <high>regression task</high.</li>
 <li class="m2">Specify redicted values and the true values inside the function.</li>
</ul>

```r
# evaluate
postResample(glm_fits,
             basel$income)
```

```
    RMSE Rsquared      MAE 
1173.079    0.821  937.113 
```

]

]

---

<h1><a href=https://therbootcamp.github.io/ML-DHLab/_sessions/Fitting/Fitting_practical.html>Practical</a></h1>