class: center, middle, inverse, title-slide # Prediction ### Machine Learning with R
The R Bootcamp @ DHLab
### November 2020 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> www.therbootcamp.com </font> </span> </a> <a href="https://therbootcamp.github.io/"> <font color="#7E7E7E"> Machine Learning with R | November 2020 </font> </a> </span> </div> --- # Prediction is... .pull-left45[ <p> <font style="font-size:32px"><i>Prediction is very difficult, especially if it's about the future.</i></font> <br><br> Nils Bohr, Nobel Laureate in Physics <br><br> <font style="font-size:32px"><i>An economist is an expert who will know tomorrow why the things he predicted yesterday didn't happen today.</i></font> <br><br> Evan Esar, Humorist </p> ] .pull-right45[ <p align = "center"> <img src="image/bohr.jpg"><br> <font style="font-size:10px">from <a href="https://futurism.com/know-your-scientist-niels-bohr-the-father-of-the-atom">futurism.com</a></font> </p> ] --- # Hold-out data .pull-left45[ <ul> <li class="m1"><span>Model performance must be evaluated as true prediction on an <high>unseen data set</high>.</span></li><br> <li class="m2"><span>The unseen data set can be <high>naturally</high> occurring.</span></li> <ul class="level"> <li><span>e.g. using 2019 stock prizes to evaluate a model fit using 2018 stock prizes</span></li> </ul><br> <li class="m3"><span>More commonly unseen data is created by <high>splitting the available data</high> into a training set and a test set..</span></li> </ul> ] .pull-right45[ <p align = "center"> <img src="image/testdata.png" height=430px> </p> ] --- # Training <p align = "center" style="padding-top:30px;padding-left:40px"> <img src="image/training_flow.png" height=400px> </p> --- # Test <p align = "center" style="padding-top:30px;padding-left:40px"> <img src="image/testing_flow.png" height=400px> </p> --- .pull-left4[ <br><br> # Overfitting <ul> <li class="m1"><span>Occurs when a model <high>fits data too closely</high> and therefore fails to reliably predict future observations.</span></li><br> <li class="m2"><span>Overfitting occurs when a model 'mistakes' random <high>noise</high> for a predictable <high>signal</high>.</span></li><br> <li class="m3"><span>More <high>complex models</high> are more prone to overfitting.</span></li> </ul> ] .pull-right5[ <br><br><br> <p align = "center" style="padding-top:0px"> <img src="image/overfitting.png"> </p> ] --- # Overfitting <img src="Prediction_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- class: center, middle <h1><a>Evaluating model predictions with <mono>caret</mono></h1> <!--- # <mono>createDataPartition()</mono> .pull-left4[ Use `createDataPartition()` to <high>split a dataset</high> into separate training and test datasets. <table style="cellspacing:0; cellpadding:0; border:none;"> <col width="30%"> <col width="70%"> <tr> <td bgcolor="white"> <b>Argument</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>y</mono> </td> <td bgcolor="white"> The criterion. Used to create a <high>balanced split</high>. </td> </tr> <tr> <td bgcolor="white"> <mono>p</mono> </td> <td bgcolor="white"> The <high>proportion of data</high> going into the training set. Often <mono>.8</mono> or <mono>.5</mono>. </td> </tr> </table> ] .pull-right5[ ```r # Set the randomisation seed to get the # same results each time set.seed(100) # Get indices for training index <- createDataPartition(y = basel$income, p = .8, list = FALSE) # Create training data basel_train <- basel %>% slice(index) # Create test data basel_test <- basel %>% slice(-index) ``` ] ---> --- # <mono>predict(, newdata)</mono> .pull-left4[ <ul> <li class="m1"><span>To <high>test model predictions</high>, you nee to compute a vector of predictions from a the test data (<mono>newdata</mono> using the <mono>predict()</mono> function:</span></li> </ul> <table style="cellspacing:0; cellpadding:0; border:none;"> <col width="30%"> <col width="70%"> <tr> <td bgcolor="white"> <b>Argument</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>object</mono> </td> <td bgcolor="white"> <mono>caret</mono> fit object. </td> </tr> <tr> <td bgcolor="white"> <mono>newdata</mono> </td> <td bgcolor="white"> Test data sest. Must contain same features as provided in <mono>object</mono>. </td> </tr> </table> ] .pull-right5[ ```r # Fit model to training data mod <- train(form = income ~ ., method = "glm", data = basel_train) # Get fitted values (for training data) mod_fit <- predict(mod) # Predictions for NEW data_test data! mod_pred <- predict(mod, newdata = basel_test) # Evaluate prediction results postResample(pred = mod_pred, obs = basel_test$income) ``` ] --- class: middle, center <h1><a href=https://therbootcamp.github.io/ML-DHLab/_sessions/Prediction/Prediction_practical.html>Practical</a></h1>