class: center, middle, inverse, title-slide # ML with R ###
Data analytics for Psychology and Business
### April 2021 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://cdsbasel.github.io/dataanalytics_2021"> <span style="padding-left:82px"> <font color="#7E7E7E"> https://cdsbasel.github.io/dataanalytics_2021 </font> </span> </a> <a href="https://therbootcamp.github.io/"> <font color="#7E7E7E"> Data analytics for Psychology and Business | April 2021 </font> </a> </span> </div> --- .pull-left45[ # Fitting <p style="padding-top:1px"></p> <ul> <li class="m1"><span>Models are actually <high>families of models</high>, with every parameter combination specifying a different model.</span></li> <li class="m2"><span>To fit a model means to <high>identify</high> from the family of models <high>the specific model that fits the data best</high>.</span></li> </ul> ] .pull-right45[ <br><br> <p align = "center"> <img src="image/curvefits.png" height=480px><br> <font style="font-size:10px">adapted from <a href="https://www.explainxkcd.com/wiki/index.php/2048:_Curve-Fitting">explainxkcd.com</a></font> </p> ] --- # Which is better? <img src="Fitting_files/figure-html/unnamed-chunk-2-1.png" width="90%" style="display: block; margin: auto;" /> --- # Which is better? <img src="Fitting_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" /> --- # Loss function .pull-left45[ <ul> <li class="m1"><span><Possible <high>the most important concept</high> in statistics and machine learning.</span></li> <li class="m2"><span>The loss function defines some <high>summary of the errors committed by the model</high>.</span></li> </ul> <p style="padding-top:7px"> `$$\Large Loss = f(Error)$$` <p style="padding-top:7px"> <u>Two purposes</u> <table style="cellspacing:0; cellpadding:0; border:none;"> <tr> <td> <b>Purpose</b> </td> <td> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> Fitting </td> <td bgcolor="white"> Find parameters that optimize loss function. </td> </tr> <tr> <td> Evaluation </td> <td> Calculate loss function for fitted model. </td> </tr> </table> ] .pull-right45[ <img src="Fitting_files/figure-html/unnamed-chunk-4-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # 2 types of supervised problems .pull-left5[ <ul> <li class="m1"><span><b>Regression</b></span></li> <br> <ul class="level"> <li><span>Prediction of a <high>quantitative criterion</high>.</span></li><br> <li><span><i>Predict level cholesterol with age</i></span></li> </ul><br> <li class="m2"><span><b>Classification</b></span></li> <br> <ul class="level"> <li><span>Prediction of a <high>categorical criterion</high>.</span></li><br> <li><span><i>Predict heart attack yes or no</i></span></li> </ul><br> </ul> ] .pull-right4[ <p align = "center"> <img src="image/twotypes.png" height=440px><br> </p> ] --- class: center, middle <high><h1>Regression</h1></high> <font color = "gray"><h1>Decision Trees</h1></font> <font color = "gray"><h1>Random Forests</h1></font> --- # Regression .pull-left45[ In [regression](https://en.wikipedia.org/wiki/Regression_analysis), the criterion `\(Y\)` is modeled as the <high>sum</high> of <high>features</high> `\(X_1, X_2, ...\)` <high>times weights</high> `\(\beta_1, \beta_2, ...\)` plus `\(\beta_0\)` the so-called the intercept. <p style="padding-top:10px"></p> `$$\large \hat{Y} = \beta_{0} + \beta_{1} \times X_1 + \beta_{2} \times X2 + ...$$` <p style="padding-top:10px"></p> The weight `\(\beta_{i}\)` indiciates the <high>amount of change</high> in `\(\hat{Y}\)` for a change of 1 in `\(X_{i}\)`. Ceteris paribus, the <high>more extreme</high> `\(\beta_{i}\)`, the <high>more important</high> `\(X_{i}\)` for the prediction of `\(Y\)` <font style="font-size:12px">(Note: the scale of `\(X_{i}\)` matters too!).</font> If `\(\beta_{i} = 0\)`, then `\(X_{i}\)` <high>does not help</high> predicting `\(Y\)` ] .pull-right45[ <img src="Fitting_files/figure-html/unnamed-chunk-5-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Regression loss .pull-left45[ <p> <ul style="margin-bottom:-20px"> <li class="m1"><span><b>Mean Squared Error (MSE)</b> <br><br> <ul class="level"> <li><span><high>Average squared distance</high> between predictions and true values.</span></li> </ul> </span></li> </ul> $$ MSE = \frac{1}{n}\sum_{i \in 1,...,n}(Y_{i} - \hat{Y}_{i})^{2}$$ <ul> <li class="m2"><span><b>Mean Absolute Error (MAE)</b> <br><br> <ul class="level"> <li><span><high>Average absolute distance</high> between predictions and true values.</span></li> </ul> </span></li> </ul> $$ MAE = \frac{1}{n}\sum_{i \in 1,...,n} \lvert Y_{i} - \hat{Y}_{i} \rvert$$ </p> ] .pull-right45[ <img src="Fitting_files/figure-html/unnamed-chunk-6-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Fitting .pull-left45[ <p style="margin-top:20px"> <ul style="margin-bottom:-20px"> <li class="m1"><span><b>Analytically</b> <br><br> <ul class="level"> <li><span>In rare cases, the parameters can be <high>directly calculated</high>, e.g., using the <i>normal equation</i></span></li> </ul> </span></li> </ul> <br> `$$\boldsymbol \beta = (\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol y$$` <ul> <li class="m2"><span><b>Numerically</b> <br><br> <ul class="level"> <li><span>In most cases, parameters need to be found using a <high>directed trial and error</high>, e.g., <i>gradient descent</i>:</span></li> </ul> </span></li> </ul> <br> `$$\boldsymbol \beta_{n+1} = \boldsymbol \beta_{n}+\gamma \nabla F(\boldsymbol \beta_{n})$$` ] .pull-right45[ <p align = "center"> <img src="image/gradient.png" height=420px><br> <font style="font-size:10px">adapted from <a href="https://me.me/i/machine-learning-gradient-descent-machine-learning-machine-learning-behind-the-ea8fe9fc64054eda89232d7ffc9ba60e">me.me</a></font> </p> ] --- .pull-left45[ # Fitting <p style="padding-top:1px"></p> <p style="margin-top:20px"> <ul style="margin-bottom:-20px"> <li class="m1"><span><b>Analytical</b> <br><br> <ul class="level"> <li><span>In rare cases, the parameters can be <high>directly calculated</high>, e.g., using the <i>normal equation</i></span></li> </ul> </span></li> </ul> <br> `$$\boldsymbol \beta = (\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol y$$` <ul> <li class="m2"><span><b>Numerically</b> <br><br> <ul class="level"> <li><span>In most cases, parameters need to be found using a <high>directed trial and error</high>, e.g., <i>gradient descent</i>:</span></li> </ul> </span></li> </ul> <br> `$$\boldsymbol \beta_{n+1} = \boldsymbol \beta_{n}+\gamma \nabla F(\boldsymbol \beta_{n})$$` ] .pull-right45[ <br><br2> <p align = "center"> <img src="image/gradient1.gif" height=250px><br> <font style="font-size:10px">adapted from <a href="https://dunglai.github.io/2017/12/21/gradient-descent/ ">dunglai.github.io</a></font><br> <img src="image/gradient2.gif" height=250px><br> <font style="font-size:10px">adapted from <a href="https://dunglai.github.io/2017/12/21/gradient-descent/ ">dunglai.github.io</a></font> </p> ] --- class: center, middle <font color = "gray"><h1>Regression</h1></font> <high><h1>Decision Trees</h1></high> <font color = "gray"><h1>Random Forests</h1></font> --- # CART .pull-left45[ <ul> <li class="m1"><span>CART is short for <high>Classification and Regression Trees</high>, which are often simply called Decision trees.</span></li><br> <li class="m2"><span>Models criterion is modeled as a sequence of <high>logical TRUE or FALSE questions</high>.</span></li> </ul> ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/tree.png"> </p> ] --- # Regression trees .pull-left45[ <ul> <li class="m1"><span>Classification regression trees are created using a relatively simple <high>three-step algorithm</high>:</span></li><br> <ul> <li><span>1 - <high>Split</high> nodes to maxmize <b>homogeneity</b> (regression) or purity (classification within nodes.</span></li><br> <li><span>2 - <high>Repeat</high> until splits are no longer possible.</span></li><br> <li><span>3 - <high>Prune</high> tree to reasonable size.</mono></span></li> </ul> </ul> ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/tree.png"> </p> ] --- # Homogeneity .pull-left45[ <ul> <li class="m1"><span>Regression trees attempt to maximize homogeneity, which means in turn to <high>minimize within-node variance</high></span></li> </ul> <br><br> `$$\large SSE = \sum_{i \in S_1}(y_i - \bar{y}_1)^2+\sum_{i \in S_2}(y_i - \bar{y}_2)^2$$` ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/splitting_regr.png"> </p> ] --- # Pruning trees .pull-left45[ <ul> <li class="m1"><span>Regression trees are <high>pruned</high> back such that every split has a homogeneity gain of at least <high><mono>cp</mono></high>.</span></li> </ul> <br> $$ \large `\begin{split} Loss = & Homogeneity\,+\\ &cp*(n\:terminal\:nodes)\\ \end{split}` $$ ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/splitting_regr.png"> </p> ] --- class: center, middle <font color = "gray"><h1>Regression</h1></font> <font color = "gray"><h1>Decision Trees</h1></font> <high><h1>Random Forests</h1></high> --- .pull-left45[ # Random Forest <ul> <li class="m1"><span>In Random Forests the criterion is modeled as the <high>aggregate prediction of <high>many decision trees</high> each based on different features.</span></li><br> <li class="m2"><span>Algorithmus:</span></li> <ul> <li><span>1 - <high>Repeat</high> <mono>n</mono> times.</li></span><br> <ul> <li><span>1 - <high>Resample</high> data.</span></li><br> <li><span>2 - Each split <high>consider <i>m</i> features</high>.</span></li><br> </ul> <li><span>2 - <high>Average</high> predictions.</li></span><br> </ul> </ul> ] .pull-right45[ <br> <p align = "center" style="padding-top:0px"> <img src="image/rf.png"> </p> ] --- # Random Forest .pull-left45[ <p style="padding-top:1px"></p> <ul> <li class="m1"><span>Random forests make use of <high>bagging</high>, which consists of <high>resampling</high> and <high>averaging</high>.</span></li> </ul> <table style="cellspacing:0; cellpadding:0; border:none;"> <col width="30%"> <col width="70%"> <tr> <td bgcolor="white"> <b>Element</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <i>Resampling</i> </td> <td bgcolor="white"> Creates new data sets that vary in their composition thereby <high>deemphasizing idiosyncracies</high> of the available data. </td> </tr> <tr> <td bgcolor="white"> <i>Averaging</i> </td> <td bgcolor="white"> Combining predictions typically <high>evens out idiosyncracies</high> of the models created from single data sets. </td> </tr> </table> ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/tree_crowd.png"> </p> ] --- class: center, middle <br><br> <h1><a>Fitten in <mono>caret</mono></h1> --- # Key functions .pull-left45[ <table style="cellspacing:0; cellpadding:0; border:none;"> <tr> <td> <b>Function</b> </td> <td> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>trainControl()</mono> </td> <td bgcolor="white"> Choose settings for how fitting should be carried out. </td> </tr> <tr> <td> <mono>train()</mono> </td> <td> Specify the model and find <i>best</i> parameters. </td> </tr> <tr> <td bgcolor="white"> <mono>postResample()</mono> </td> <td bgcolor="white"> Evaluate model performance (fitting or prediction) for regression. </td> </tr> <tr> <td> <mono>confusionMatrix()</mono> </td> <td bgcolor="white"> Evaluate model performance (fitting or prediction) for classification. </td> </tr> </table> ] .pull-right45[ ```r # Step 1: Define control parameters # trainControl() ctrl <- trainControl(...) # Step 2: Train and explore model # train() mod <- train(...) summary(mod) mod$finalModel # see final model # Step 3: Assess fit # predict(), postResample(), fon fit <- predict(mod) postResample(fit, truth) confusionMatrix(fit, truth) ``` <!-- Caret documentation: [http://topepo.github.io/caret/](http://topepo.github.io/caret/) --> <!-- <iframe src="http://topepo.github.io/caret/" height="480px" width = "500px"></iframe> --> ] --- # `trainControl()` .pull-left45[ <ul> <li class="m1"><span>Controls how <mono>caret</mono> fits an ML model.</span></li><br> <li class="m2"><span>Until Session <b>Optimisierung</b> we use <highm>method = "none"</highm>.</span></li> </ul> <br> ```r # Fit the model without any # advanced parameter tuning methods ctrl <- trainControl(method = "none") # show help ?trainControl ``` ] .pull-right45[ <img src="image/traincontrol_help.jpg" width="100%" style="display: block; margin: auto;" /> ] --- # `train()` .pull-left4[ <ul> <li class="m1"><span><high>Workhorse</high> for fitting models, offering <high>200+ models</high> via the <high>method</high> argument!</span></li> </ul> <table style="cellspacing:0; cellpadding:0; border:none;"> <tr> <td> <b>Argument</b> </td> <td> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>form</mono> </td> <td bgcolor="white"> Formula specifying features and criterion. </td> </tr> <tr> <td> <mono>data</mono> </td> <td> Training data. </td> </tr> <tr> <td bgcolor="white"> <mono>method</mono> </td> <td bgcolor="white"> The model (algorithm). </td> </tr> <tr> <td> <mono>trControl</mono> </td> <td bgcolor="white"> Control parameters for fitting. </td> </tr> <tr> <td bgcolor="white"> <mono>tuneGrid</mono>, <mono>preProcess</mono> </td> <td bgcolor="white"> Cool stuff for later. </td> </tr> </table> ] .pull-right5[ ```r # Fit a regression model predicting Price income_mod <- train(form = income ~ ., # Formula data = basel, # Training data method = "glm", # Regression trControl = ctrl) # Control Param's income_mod ``` ``` Generalized Linear Model 1000 samples 19 predictor No pre-processing Resampling: None ``` ] --- # `train()` .pull-left4[ <ul> <li class="m1"><span><high>Workhorse</high> for fitting models, offering <high>200+ models</high> via the <high>method</high> argument!</span></li> </ul> <table style="cellspacing:0; cellpadding:0; border:none;"> <tr> <td> <b>Argument</b> </td> <td> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>form</mono> </td> <td bgcolor="white"> Formula specifying features and criterion. </td> </tr> <tr> <td> <mono>data</mono> </td> <td> Training data. </td> </tr> <tr> <td bgcolor="white"> <mono>method</mono> </td> <td bgcolor="white"> The model (algorithm). </td> </tr> <tr> <td> <mono>trControl</mono> </td> <td bgcolor="white"> Control parameters for fitting. </td> </tr> <tr> <td bgcolor="white"> <mono>tuneGrid</mono>, <mono>preProcess</mono> </td> <td bgcolor="white"> Cool stuff for later. </td> </tr> </table> ] .pull-right5[ ```r # Fit a random forest predicting Price income_mod <- train(form = income ~ .,# Formula data = basel, # Training data method = "rpart", # Decision tree trControl = ctrl) # Control Param's income_mod ``` ``` CART 1000 samples 19 predictor No pre-processing Resampling: None ``` ] --- # `train()` .pull-left4[ <ul> <li class="m1"><span><high>Workhorse</high> for fitting models, offering <high>200+ models</high> via the <high>method</high> argument!</span></li> </ul> <table style="cellspacing:0; cellpadding:0; border:none;"> <tr> <td> <b>Argument</b> </td> <td> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>form</mono> </td> <td bgcolor="white"> Formula specifying features and criterion. </td> </tr> <tr> <td> <mono>data</mono> </td> <td> Training data. </td> </tr> <tr> <td bgcolor="white"> <mono>method</mono> </td> <td bgcolor="white"> The model (algorithm). </td> </tr> <tr> <td> <mono>trControl</mono> </td> <td bgcolor="white"> Control parameters for fitting. </td> </tr> <tr> <td bgcolor="white"> <mono>tuneGrid</mono>, <mono>preProcess</mono> </td> <td bgcolor="white"> Cool stuff for later. </td> </tr> </table> ] .pull-right5[ ```r # Fit a random forest predicting Price income_mod <- train(form = income ~ .,# Formula data = basel, # Training data method = "rf", # Random Forest trControl = ctrl) # Control Param's income_mod ``` ``` Random Forest 1000 samples 19 predictor No pre-processing Resampling: None ``` ] --- .pull-left4[ # `train()` <p style="padding-top:1px"></p> <ul> <li class="m1"><span><high>Workhorse</high> for fitting models, offering <high>200+ models</high> via the <high>method</high> argument!</span></li> <li class="m2"><span>Find all 200+ Models <a href="http://topepo.github.io/caret/available-models.html">here</a>.</span></li> </ul> ] .pull-right5[ <br><br> <iframe width="600" height="480" src="https://topepo.github.io/caret/available-models.html" frameborder="0"></iframe> ] --- # `train()` .pull-left4[ <ul style="margin-bottom:-20px"> <li class="m1"><span>The criterion must be the right type: <br><br> <ul class="level"> <li><span><high><mono>numeric</mono></high> criterion → <high>Regression</high><br></span></li> <li><span><high><mono>factor</mono></high> criterion → <high>Klassifkation</high><br></span></li> </ul> </span></li> </ul> ] .pull-right5[ ```r # Will be a regression task loan_mod <- train(form = Default ~ ., data = Loans, method = "glm", trControl = ctrl) # Will be a classification task load_mod <- train(form = factor(Default) ~ ., data = Loans, method = "glm", trControl = ctrl) ``` ] --- # <mono>.$finalModel</mono> .pull-left4[ <ul> <li class="m1"><span><mono>train()</mono> returns a <mono>list</mono> with an object called <mono>finalModel</mono> containing the best fitting model.</span></li> <li class="m2"><span><mono>.$finalModel</mono> <high>extracts the model</high>, then explore using:</span></li> </ul> <br> <table style="cellspacing:0; cellpadding:0; border:none;"> <tr> <td> <b>Function</b> </td> <td> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>summary()</mono> </td> <td bgcolor="white"> Overview of the most important results. </td> </tr> <tr> <td bgcolor="white"> <mono>names()</mono> </td> <td bgcolor="white"> See all named elements you can access with $. </td> </tr> </table> ] .pull-right5[ ```r # Fit regression model eink_mod <- train(form = income ~ age + height, data = basel, # Data method = "glm", # Regression trControl = ctrl) # Train control # Show names of final model names(eink_mod$finalModel) ``` ``` [1] "coefficients" "residuals" "fitted.values" [4] "effects" "R" "rank" [ reached getOption("max.print") -- omitted 28 entries ] ``` ] --- # <mono>.$finalModel</mono> .pull-left4[ <ul> <li class="m1"><span><mono>train()</mono> returns a <mono>list</mono> with an object called <mono>finalModel</mono> containing the best fitting model.</span></li> <li class="m2"><span><mono>.$finalModel</mono> <high>extracts the model</high>, then explore using:</span></li> </ul> <br> <table style="cellspacing:0; cellpadding:0; border:none;"> <tr> <td> <b>Function</b> </td> <td> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>summary()</mono> </td> <td bgcolor="white"> Overview of the most important results. </td> </tr> <tr> <td bgcolor="white"> <mono>names()</mono> </td> <td bgcolor="white"> See all named elements you can access with $. </td> </tr> </table> ] .pull-right5[ ```r # Fit regression model eink_mod <- train(form = income ~ age + height, data = basel, # Data method = "glm", # Regression trControl = ctrl) # Train control # Show coefficients eink_mod$finalModel$coefficients ``` ``` (Intercept) age height 177.084 151.786 3.466 ``` ] --- .pull-left4[ # <mono>.$finalModel</mono> <ul> <li class="m1"><span><mono>train()</mono> returns a <mono>list</mono> with an object called <mono>finalModel</mono> containing the best fitting model.</span></li> <li class="m2"><span><mono>.$finalModel</mono> <high>extracts the model</high>, then explore using:</span></li> </ul> <br> <table style="cellspacing:0; cellpadding:0; border:none;"> <tr> <td> <b>Function</b> </td> <td> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>summary()</mono> </td> <td bgcolor="white"> Overview of the most important results. </td> </tr> <tr> <td bgcolor="white"> <mono>names()</mono> </td> <td bgcolor="white"> See all named elements you can access with $. </td> </tr> </table> ] .pull-right5[ <br><br> ```r # Show model output summary(eink_mod) ``` ``` Call: NULL Deviance Residuals: Min 1Q Median 3Q Max -3567 -773 30 793 4186 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 177.08 498.81 0.36 0.72 [ reached getOption("max.print") -- omitted 2 rows ] --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for gaussian family taken to be 1380255) Null deviance: 7687291073 on 999 degrees of freedom Residual deviance: 1376114147 on 997 degrees of freedom AIC: 16981 Number of Fisher Scoring iterations: 2 ``` ] --- # `predict()` .pull-left5[ <ul> <li class="m1"><span><high>Produces predictions</high> from a model. Simply put model object as the first argument.</span></li> </ul> <br> ```r # Extrahiere gefittete Werte glm_fits <- predict(object = eink_mod) glm_fits[1:8] ``` ``` 1 2 3 4 5 6 7 8 5508 6960 6982 8645 5325 10648 8663 4592 ``` ] .pull-right45[ <img src="Fitting_files/figure-html/unnamed-chunk-23-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # `postResample()` .pull-left45[ <ul> <li class="m1"><span><high>Gives a summary</high> of a models' performance in a <high>regression task</high.</span></li> <li class="m2"><span>Specify redicted values and the true values inside the function.</span></li> </ul> <br> ```r # evaluate postResample(glm_fits, basel$income) ``` ``` RMSE Rsquared MAE 1173.079 0.821 937.113 ``` ] .pull-right45[ <img src="Fitting_files/figure-html/unnamed-chunk-25-1.png" width="90%" style="display: block; margin: auto;" /> ] --- class: middle, center <h1><a href=https://therbootcamp.github.io/ML-DHLab/_sessions/Fitting/Fitting_practical.html>Practical</a></h1>