Prediction

# Prediction
### Machine Learning with R <a href='https://therbootcamp.github.io'> The R Bootcamp @ DHLab </a> <a href='https://therbootcamp.github.io/ML-DHLab/'> </a>  <a href='https://therbootcamp.github.io'> </a>  <a href='mailto:therbootcamp@gmail.com'> </a>  <a href='https://www.linkedin.com/company/basel-r-bootcamp/'> </a>
### November 2020

---

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://therbootcamp.github.io/">
 
 
 www.therbootcamp.com
 
 
 </a>
 <a href="https://therbootcamp.github.io/">
 
 Machine Learning with R | November 2020
 
 </a>
 
 </div>

---

# Prediction is...

Prediction is very difficult, especially if it's about the future.
 
Nils Bohr, Nobel Laureate in Physics
 
An economist is an expert who will know tomorrow why the things he predicted yesterday didn't happen today.

Evan Esar, Humorist

]

<img src="image/bohr.jpg"> 
from <a href="https://futurism.com/know-your-scientist-niels-bohr-the-father-of-the-atom">futurism.com</a>

]

---

# Hold-out data

<ul>
 <li class="m1">Model performance must be evaluated as true prediction on an <high>unseen data set</high>.</li> 
 <li class="m2">The unseen data set can be <high>naturally</high> occurring.</li>
 <ul class="level">
 <li>e.g. using 2019 stock prizes to evaluate a model fit using 2018 stock prizes</li>
 </ul> 
 <li class="m3">More commonly unseen data is created by <high>splitting the available data</high> into a training set and a test set..</li>
</ul>

]

]

---

# Training

---

# Test

---

# Overfitting

<ul>
 <li class="m1">Occurs when a model <high>fits data too closely</high> and therefore fails to reliably predict future observations.</li> 
 <li class="m2">Overfitting occurs when a model 'mistakes' random <high>noise</high> for a predictable <high>signal</high>.</li> 
 <li class="m3">More <high>complex models</high> are more prone to overfitting.</li>
</ul>

]

]

---

# Overfitting

---

<h1><a>Evaluating model predictions with <mono>caret</mono></h1>

<!---

# <mono>createDataPartition()</mono>

Use `createDataPartition()` to <high>split a dataset</high> into separate training and test datasets.

<table style="cellspacing:0; cellpadding:0; border:none;">
 <col width="30%">
 <col width="70%">
<tr>
 <td bgcolor="white">
 Argument
 </td>
 <td bgcolor="white">
 Description
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 <mono>y</mono>
 </td>
 <td bgcolor="white">
 The criterion. Used to create a <high>balanced split</high>. 
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 <mono>p</mono>
 </td>
 <td bgcolor="white">
 The <high>proportion of data</high> going into the training set. Often <mono>.8</mono> or <mono>.5</mono>. 
 </td> 
</tr>
</table>

]

```r
# Set the randomisation seed to get the 
#  same results each time
set.seed(100)

# Get indices for training
index <- 
 createDataPartition(y = basel$income,
 p = .8,
 list = FALSE)

# Create training data
basel_train <- basel %>% 
 slice(index)

# Create test data
basel_test <- basel %>% 
 slice(-index)
```

]

--->

---

# <mono>predict(, newdata)</mono>

<ul>
 <li class="m1">To <high>test model predictions</high>, you nee to compute a vector of predictions from a the test data (<mono>newdata</mono> using the <mono>predict()</mono> function:</li>
</ul>

<table style="cellspacing:0; cellpadding:0; border:none;">
 <col width="30%">
 <col width="70%">
<tr>
 <td bgcolor="white">
 Argument
 </td>
 <td bgcolor="white">
 Description
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 <mono>object</mono>
 </td>
 <td bgcolor="white">
 <mono>caret</mono> fit object. 
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 <mono>newdata</mono>
 </td>
 <td bgcolor="white">
 Test data sest. Must contain same features as provided in <mono>object</mono>. 
 </td> 
</tr>
</table>

]

```r
# Fit model to training data
mod <- train(form = income ~ .,
 method = "glm",
 data = basel_train)

# Get fitted values (for training data)
mod_fit <- predict(mod)

# Predictions for NEW data_test data!
mod_pred <- predict(mod, 
 newdata = basel_test)

# Evaluate prediction results
postResample(pred = mod_pred, 
             obs = basel_test$income)
```

]

---
class: middle, center

<h1><a href=https://therbootcamp.github.io/ML-DHLab/_sessions/Prediction/Prediction_practical.html>Practical</a></h1>