Data Analytics for Psychology and Business |
from xkcd.com
By the end of this practical you will know how to:
Open your dataanalytics
R project. It should already have the folders 1_Data
and 2_Code
.
Download the college and place it into your 1_Data
folder.
Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called supervised_learning_practical.R
in the 2_Code
folder.
Using library()
load the set of packages for this practical listed in the packages section above. If you are missing packages, install them first using install.packages()
.
# install.packages("tidyverse")
# install.packages("caret")
# install.packages("party")
# install.packages("partykit")
# Load packages necessary for this script
library(tidyverse)
library(caret)
library(party)
library(partykit)
college
as a new object.# College data
college <- read_csv(file = "1_Data/college.csv")
# Print dataframes to the console
college
View()
and names()
to get a feel for the variables contained within it. Also see the Dataset tab for feature descriptions.# Convert all character columns to factor
college <- college %>% mutate_if(is.character, factor)
createDataPartition
needs to be supplied with the criterion, which is the graduation rate (Grad.Rate
).# split index
train_index <- createDataPartition(XX$XX, p = .2, list = FALSE)
# train and test sets
college_train <- XX %>% slice(train_index)
college_test <- XX %>% slice(-train_index)
caret
needs to be supplied with a trainControl()
object. To start with define an object called ctrl_none
that sets the training control method to "none"
.# Set training method to "none" for simple fitting
ctrl_none <- trainControl(method = "XX")
train()
fit a regression model called grad_glm
predicting Grad.Rate
as a function of all features. Specifically,…form
argument, use Grad.Rate ~ .
.data
argument, use college_train
t.method
argument, use method = "glm"
.trControl
argument, use your ctrl_none
.grad_glm <- train(form = XX ~ .,
data = XX,
method = "XX",
trControl = ctrl_none)
grad_glm
object by looking at grad_glm$finalModel
and using summary()
, what do you find?grad_glm$XX
summary(XX)
predict()
save the fitted values of grad_glm
object as glm_fit
.# Save fitted values of regression model
glm_fit <- predict(XX)
glm_fit
object, look at summary statistics with summary(glm_fit)
, and create a histogram with hist()
do they make sense?# Explore regression model fits
XX
summary(XX)
hist(XX)
train()
, fit a decision tree model called grad_rpart
. Specifically,…form
argument, use Grad.Rate ~ .
.data
argument, use college_train
.method
argument, use method = "rpart"
.trControl
argument, use your ctrl_none
.tuneGrid
argument, use cp = 0.01
to specify the value of the complexity parameter. This is a pretty low value which means your trees will be, relatively, complex, i.e., deep.grad_rpart <- train(form = XX ~ .,
data = XX,
method = "XX",
trControl = XX,
tuneGrid = expand.grid(cp = XX))
Explore your grad_rpart
object by looking at grad_rpart$finalModel
and plotting it with plot(as.party(grad_rpart$finalModel))
, what do you find?
Using predict()
, save the fitted values of grad_rpart
object as _fit
.
# Save fitted values of decision tree model
rpart_fit <- predict(XX)
rpart_fit
object, look at summary statistics with summary()
, and create a histogram with hist()
. Do they make sense?# Explore decision tree fits
XX
summary(XX)
hist(XX)
train()
, fit a random forest model called grad_rf
. Speicifically,…form
argument, use Grad.Rate ~ .
.data
argument, use college_train
.method
argument, use method = "rf"
.trControl
argument, use your ctrl_none
.mtry
parameter, use mtry
= 2. This is a relatively low value, so the forest will be very diverse.grad_rf <- train(form = XX ~ .,
data = XX,
method = "XX",
trControl = XX,
tuneGrid = expand.grid(mtry = XX))
predict()
, save the fitted values of grad_rf
object as rf_fit
.# Save fitted values of random forest model
rf_fit <- predict(XX)
rf_fit
object, look at summary statistics with summary()
, and create a histogram with hist()
. Do they make sense?# Explore random forest fits
XX
summary(XX)
hist(XX)
college_train$Grad.Rate
) as a vector called criterion_train
.# Save training criterion values
criterion_train <- XX$XX
postResample()
, determine the fitting performance of each of your models separately. Make sure to set your criterion_train
values to the obs
argument, and your true model fits XX_fit
to the pred
argument.# Regression
postResample(pred = XX, obs = XX)
# Decision Trees
postResample(pred = XX, obs = XX)
# Random Forests
postResample(pred = XX, obs = XX)
college_test$Grad.Rate
as a new vector called criterion_test
.# Save criterion values
criterion_test <- XX$XX
predict()
, save the predicted values of each model for the test data college_test
as glm_pred
, rpart_pred
and rf_pred
. Set the newdata
argument to the test data.# Regression
glm_pred <- predict(XX, newdata = XX)
# Decision Trees
rpart_pred <- predict(XX, newdata = XX)
# Random Forests
rf_pred <- predict(XX, newdata = XX)
postResample()
, determine the prediction performance of each of your models against the test criterion criterion_test
.# Regression
postResample(pred = XX, obs = XX)
# Decision Trees
postResample(pred = XX, obs = XX)
# Random Forests
postResample(pred = XX, obs = XX)
How does each model’s prediction or test performance (on the XX_test
data) compare to its fitting or training performance (on the XX_train
data)? Is it worse? Better? The same? What does the change tell you about the models?
Which of the three models has the best prediction performance?
ctrl_cv
that sets the training method
to cross validation "cv"
and the number of slices to 10
. Use this new control object to re-fit the three models.# Set training method to "none" for simple fitting
ctrl_cv <- trainControl(method = "XX", number = XX)
lambda_vec
which contains 100 values spanning a wide range, from very close to 0 to 1,000.# Vector of lambda values to try
lambda_vec <- 10 ^ (seq(-4, 4, length = 100))
Grad.Rate
as a function of all features. Specifically,…Grad.Rate ~ .
.college_train
."glmnet"
for regularized regression.ctrl_cv
.preProcess
argument to c("center", "scale")
to make sure the variables are standarised when estimating the beta weights, which is specifically necessary for regularized regression.alpha
is 1 (for lasso regression), and lambda is vector you specified in lambda_vec
(see below)# lasso regression
grad_tune_glm <- train(form = XX ~ .,
data = XX,
method = "XX",
trControl = XX,
preProcess = c("XX", "XX"),
tuneGrid = expand.grid(alpha = 1,
lambda = lambda_vec))
Print your grad_tune_glm
object. What do you see?
Plot your grad_tune_glm
object. What do you see? Which value of the regularization parameter seems to be the best?
# Plot grad_tune_glm object
plot(XX)
# Get coefficients from best lambda value
coef(grad_tune_glm$finalModel,
grad_tune_glm$bestTune$lambda)
Before fitting the decision tree, specify which values of the complexity parameter cp
to try. Using the code below, create a vector called cp_vec
which contains 100 values between 0 and 1.
Fit a decision tree model predicting Grad.Rate
as a function of all features. Specifically,…
Grad.Rate ~ .
.college_train
."rpart"
.ctrl_cv
.cp
values you specified in cp_vec
.# Decision tree
grad_tune_rpart <- train(form = XX ~ .,
data = XX,
method = "XX",
trControl = XX,
tuneGrid = expand.grid(cp = cp_vec))
Print your grad_tune_rpart
object. What do you see?
Plot your grad_tune_rpart
object. What do you see? Which value of the complexity parameter seems to be the best?
# Plot grad_tune_rpart object
plot(XX)
cp
by running the following code. Does this match what you saw in the plot above?# Print best regularisation parameter
grad_tune_rpart$bestTune$cp
# Visualise your trees
plot(as.party(grad_tune_rpart$finalModel))
Before fitting a random forest model, specify which values of the diversity parameter mtry
to try. Using the code below, create a vector called mtry_vec
which is a sequence of numbers from 1 to 10.
Fit a random forest model predicting Grad.Rate
as a function of all features. Specifically,…
Grad.Rate ~ .
.college_train
."rf"
.ctrl_cv
.mtry_vec
.# Random Forest
grad_tune_rf <- train(form = XX ~ .,
data = XX,
method = "XX",
trControl = XX,
tuneGrid = expand.grid(mtry = mtry_vec))
Print your grad_tune_rf
object. What do you see?
Plot your grad_tune_rf
object. What do you see? Which value of the regularization parameter seems to be the best?
# Plot grad_rf object
plot(XX)
mtry
by running the following code. Does this match what you saw in the plot above?# Print best mtry parameter
grad_tune_rf$bestTune$mtry
predict()
, save the predicted values of each tuned model for the test data college_test
as glm_tune_pred
, rpart_tune_pred
and rf_tune_pred
. Set the newdata
argument to the test data.# Regression
glm_tune_pred <- predict(XX, newdata = XX)
# Decision Trees
rpart_tune_pred <- predict(XX, newdata = XX)
# Random Forests
rf_tune_pred <- predict(XX, newdata = XX)
postResample()
, determine the prediction performance of each of your models against the test criterion criterion_test
.# Regression
postResample(pred = XX, obs = XX)
# Decision Trees
postResample(pred = XX, obs = XX)
# Random Forests
postResample(pred = XX, obs = XX)
How does each model’s tuned prediction performance compare to its prediction performance without tuning? Is it worse? Better? The same? What does the change tell you about the models?
Which of the three models has the best prediction performance?
Private
. Note, it is not necessary to again perform the splitting. Jump right in using the existing college_train
and college_test
datasets. You can follow the same steps as above. Simply change the criterion to Private
and use confusionMatrix()
instead of postResample()
.# Fitting, tune evaluating regression, decision trees, and random forests
# Step 0: Load packages-----------
library(tidyverse)
library(caret)
library(partykit)
library(party)
# Step 1: Load, Clean, Split, and Explore data ----------------------
# mpg data
data(mpg)
# Explore data
data_train
View(mpg)
dim(mpg)
names(mpg)
# Convert all characters to factor
mpg <- mpg %>% mutate_if(is.character, factor)
# split index
train_index <- createDataPartition(mpg$hwy, p = .2, list = FALSE)
# train and test sets
data_train <- mpg %>% slice(train_index)
data_test <- mpg %>% slice(-train_index)
# Step 2: Define training control parameters -------------
# Set method = "none" for now
ctrl_none <- trainControl(method = "none")
# Step 3: Train model: -----------------------------
# Regression -------
# Fit model
hwy_glm <- train(form = hwy ~ year + cyl + displ,
data = data_train,
method = "glm",
trControl = ctrl_none)
# Look at summary information
hwy_glm$finalModel
summary(hwy_glm)
# Save fitted values
glm_fit <- predict(hwy_glm)
# Calculate fitting accuracies
postResample(pred = glm_fit,
obs = criterion_train)
# Decision Trees -------
# Fit model
hwy_rpart <- train(form = hwy ~ year + cyl + displ,
data = data_train,
method = "rpart",
trControl = ctrl_none,
tuneGrid = expand.grid(cp = .01)) # Set complexity parameter
# Look at summary information
hwy_rpart$finalModel
plot(as.party(hwy_rpart$finalModel)) # Visualise your trees
# Save fitted values
rpart_fit <- predict(hwy_rpart)
# Calculate fitting accuracies
postResample(pred = rpart_fit, obs = criterion_train)
# Random Forests -------
# fit model
hwy_rf <- train(form = hwy ~ year + cyl + displ,
data = data_train,
method = "rf",
trControl = ctrl_none,
tuneGrid = expand.grid(mtry = 2)) # Set number of features randomly selected
# Look at summary information
hwy_rf$finalModel
# Save fitted values
rf_fit <- predict(hwy_rf)
# Calculate fitting accuracies
postResample(pred = rf_fit, obs = criterion_train)
# Step 5: Evaluate prediction ------------------------------
# Define criterion_train
criterion_test <- data_test$hwy
# Save predicted values
glm_pred <- predict(hwy_glm, newdata = data_test)
rpart_pred <- predict(hwy_rpart, newdata = data_test)
rf_pred <- predict(hwy_rf, newdata = data_test)
# Calculate fitting accuracies
postResample(pred = glm_pred, obs = criterion_test)
postResample(pred = rpart_pred, obs = criterion_test)
postResample(pred = rf_pred, obs = criterion_test)
# Step 6: Modeling tuning ------------------------------
# Use 10-fold cross validation
ctrl_cv <- trainControl(method = "cv",
number = 10)
# Lasso
lambda_vec <- 10 ^ seq(-3, 3, length = 100)
hwy_lasso <- train(form = hwy ~ year + cyl + displ,
data = data_train,
method = "glmnet",
trControl = ctrl_cv,
preProcess = c("center", "scale"),
tuneGrid = expand.grid(alpha = 1,
lambda = lambda_vec))
# decision tree
cp_vec <- seq(0, .1, length = 100)
hwy_rpart <- train(form = hwy ~ year + cyl + displ,
data = data_train,
method = "rpart",
trControl = ctrl_cv,
tuneGrid = expand.grid(cp = cp_vec))
# decision tree
mtry_vec <- seq(2, 5, 1)
hwy_rpart <- train(form = hwy ~ year + cyl + displ,
data = data_train,
method = "rf",
trControl = ctrl_cv,
tuneGrid = expand.grid(mtry = mtry_vec))
# Step 7: Evaluate tuned models ------------------------------
# you know how ...
File | Rows | Columns |
---|---|---|
college.csv | 777 | 18 |
The college
data are taken from the College
dataset in the ISLR
package. They contain statistics for a large number of US Colleges from the 1995 issue of US News and World Report.
Name | Description |
---|---|
Private |
A factor with levels No and Yes indicating private or public university. |
Apps |
Number of applications received. |
Accept |
Number of applications accepted. |
Enroll |
Number of new students enrolled. |
Top10perc |
Pct. new students from top 10% of H.S. class. |
Top25perc |
Pct. new students from top 25% of H.S. class. |
F.Undergrad |
Number of fulltime undergraduates. |
P.Undergrad |
Number of parttime undergraduates. |
Outstate |
Out-of-state tuition. |
Room.Board |
Room and board costs. |
Books |
Estimated book costs. |
Personal |
Estimated personal spending. |
PhD |
Pct. of faculty with Ph.D.’s. |
Terminal |
Pct. of faculty with terminal degree. |
S.F.Ratio |
Student/faculty ratio. |
perc.alumni |
Pct. alumni who donate. |
Expend |
Instructional expenditure per student. |
Grad.Rate |
Graduation rate. |
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
caret |
install.packages("caret") |
partykit |
install.packages("partykit") |
party |
install.packages("party") |
Function | Package | Description |
---|---|---|
trainControl() |
caret |
Define modelling control parameters |
train() |
caret |
Train a model |
predict(object, newdata) |
stats |
Predict the criterion values of newdata based on object |
postResample() |
caret |
Calculate aggregate model performance in regression tasks |
confusionMatrix() |
caret |
Calculate aggregate model performance in classification tasks |