In this practical you’ll practice “data wrangling” with the dplyr
and tidyr
packages.
By the end of this practical you will know how to:
Open your dataanalytics
R project. It should already have the folders 1_Data
and 2_Code
. Make sure that the data files listed in the Datasets
section above are in your 1_Data
folder
Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called wrangling_practical.R
in the 2_Code
folder.
Using library()
load the tidyverse
package (if you don’t have it, you’ll need to install it with install.packages()
)!
# Load packages necessary for this script
library(tidyverse)
trial_act.csv
data into R and store it as a new object called trial_act
# Load trial_act.csv from the data folder in your
trial_act <- read_csv(file = "XX/XX")
trial_act_demo_fake.csv
data as a new dataframe called trial_act_demo
.# Load trial_act_demo_fake.csv from the data folder in your
XX <- XX(XX = "XX/XX")
# Print trial_act object
trial_act
# Print trial_act_demo object
trial_act_demo
trial_act
data with names(XXX)
names(XXX)
wtkg
. Using rename()
, change the name of this column to weight_kg
. Be sure to assign the result back to trial_act
to change it!# Change the name to weight_kg from wtkg
trial_act <- trial_act %>%
rename(XX = XX) # NEW = OLD
Look at the names of your trial_act
dataframe again using names()
, do you now see the column weight_kg
?
One of the columns is called age
, change it to age_y
(to specify that age is in years).
select()
function, select only the column age_y
and print the result (but don’t assign it to anything). Do you see only the age_y
column now?XX %>%
select(XX)
select()
select the columns pidnum
, age_y
, gender
, and weight_kg
(but don’t assign the result to anything)XX %>%
select(XX, XX, XX, XX)
trial_act_anon_
that does not contain the columns pidnum
and age_y
. Create this dataframe by selecting all columns in trial_act
except pidnum
and age_y
(hint: use the notation select(-XXX, -XXX))
to select everything except specified columns). Assign the result to a new object called trial_act_anon
XX <- XX %>%
select(-XX, -XX)
CD4_wide
which contains the following columns: pidnum
, arms
, cd40
, cd420
, and cd496
. The cd40
, cd420
, and cd496
columns show patient’s CD4 T cell counts at baseline, 20 weeks, and 96 weeks. After you create CD4_wide
, print it to make sure it worked!XX <- trial_act %>%
select(XX, XX, XX, XX, XX)
starts_with()
? Try adapting the following code to get the same result you got before.CD4_wide <- trial_act %>%
select(pidnum, arms, starts_with("XXX"))
mutate()
function, add the column age_m
which shows each patient’s age in months instead of years (Hint: Just multiply age_y
by 12!)trial_act <- trial_act %>%
mutate(XX = XX * 12)
mutate
, add the following new columns to trial_act
. (Try combining these into one call to the mutate()
function!)weight_lb
: Weight in lbs instead of kilograms. You can do this by multiplying weight_kg
by 2.2.cd_change_20
: Change in CD4 T cell count from baseline to 20 weeks. You can do this by taking cd420 - cd40
cd_change_960
: Change in CD4 T cell count from baseline to 96 weeks. You can do this by taking cd496 - cd40
XXX <- XXX %>%
mutate(weight_lb = XXX,
cd_change_20 = XXX,
XXX = XXX)
gender
column, you will see that it is numeric (0s and 1s). Using the mutate()
and case_when()
functions, create a new column called gender_char
which shows the gender as a string, where 0 = “female” and 1 = “male”:# Create gender_char which shows gender as a string
trial_act <- trial_act %>%
mutate(
gender_char = case_when(
gender == XX ~ "XX",
gender == XX ~ "XX"))
arms
is also numeric and not very meaningful. Create a new column arms_char
contains the names of the arms. Here is a table of the mappingarms | arms_char |
---|---|
0 | zidovudine |
1 | zidovudine and didanosine |
2 | zidovudine and zalcitabine |
3 | didanosine |
mutate()
. That is, in one block of code, create age_m
, weight_lb
, cd_change_20
, cd_change_960
and gender_char
and arms_char
using the mutate()
function only once. Here’s how your code should look:trial_act <- trial_act %>%
mutate(
age_m = XXX,
weight_lb = XXX,
cd_change_20 = XXX,
cd_change_960 = XXX,
gender_char = XXX,
arms_char = XXX
)
arrange()
function, arrange the trial_act
data in ascending order of age_y
(from lowest to highest). After you do, print the data to make sure it worked!trial_act <- trial_act %>%
arrange(XXX)
age_y
(from highest to lowest). After, look the data to make sure it worked. To arrange data in descending order, just include desc()
around the variabletrial_act <- trial_act %>%
arrange(desc(XXX))
arrange()
. Now sort the data by arms (arms
) and then age_y (age_y
). Print the result to make sure it looks right!trial_act <- trial_act %>%
arrange(XXX, XXX)
filter()
filter()
function, filter only the rows from males but don’t save the result (Hint: gender_char == "male"
)trial_act %>%
filter(XXX == "XXX")
Create a new dataframe called trial_act_male
that only contains rows from males (hint: just assign what you did in the previous question to a new object!). After you create the object, print it to make sure it looks right.
Filter only rows for patients under the age of 60 and save the result to trial_act_young
. After you create it, print the object to make sure it looks right.
A colleague of yours named Tracy wants a dataframe only containing data from females over the age of 40. Create this dataframe with filter()
and call it trial_act_tracy
. After you create the object, print it to make sure it looks right.
trial_act_tracy <- XXX %>%
filter(XXX > XXX & XXX == XXX)
left_join()
The trial_act_demo.csv
file contains additional (fictional) demographic data about the patients, namely the number of days of exercise they get per week, and their highest level of education. Our goal is to add the demographic information to our trial_act
data.
In order to combine the two dataframes, we need to find one ‘key’ column that we can use to match rows. Look at both the trial_act
and trial_act_demo
dataframes. Which column can we use as the ‘key’ column?
Use the left_join()
function to combine the trial_act
and trial_act_demo
datasets, set the by
argument to the name of the key column that is common in both data sets. Assign the result to trial_act
.
trial_act <- trial_act %>%
left_join(XX, by = "XX")
trial_act
dataframe. Do you now see the demographic data?Remember the CD4_wide
dataframe you created before? Currently it is in the wide format, where each row is a patient, where key data (different CD4 T cell counts) are in different columns like this:
# Data is in a 'wide' format
CD4_wide
# A tibble: 2,139 x 5
pidnum arms cd40 cd420 cd496
<dbl> <dbl> <dbl> <dbl> <dbl>
1 10056 2 422 477 660
2 10059 3 162 218 NA
3 10089 3 326 274 122
4 10093 3 287 394 NA
5 10124 0 504 353 660
6 10140 1 235 339 264
7 10165 0 244 225 106
8 10190 0 401 366 453
9 10198 3 214 107 8
10 10229 0 221 132 NA
# … with 2,129 more rows
Our goal is to convert this data to a ‘long’ format, where each row represents a single CD4 T cell count for a specific patient, like this:
# This is the same data in 'long' format
CD4_long
# A tibble: 6,417 x 4
pidnum arms name value
<dbl> <dbl> <chr> <dbl>
1 10056 2 cd40 422
2 10056 2 cd420 477
3 10056 2 cd496 660
4 10059 3 cd40 162
5 10059 3 cd420 218
6 10059 3 cd496 NA
7 10089 3 cd40 326
8 10089 3 cd420 274
9 10089 3 cd496 122
10 10093 3 cd40 287
# … with 6,407 more rows
pivot_longer()
function, create a new dataframe called CD4_long
that shows the CD4_wide
data in the ‘long’ format. To do this, use the following template.CD4_long <- CD4_wide %>%
pivot_longer(starts_with("XX"))
Print your CD4_long
dataframe! Do you now see that each row is a specific observation for a patient?
Now use the pivot_wider()
function to bring the data back into the wide format!
CD4_long %>%
pivot_wider(names_from = XX,
values_from = XX)
group_by()
- summarize()
idiom to the CD4_wide
dataset to calculate the average (mean()
) CD4 T cell counts for the various treament arms
.CD4_wide %>%
group_by(XX) %>%
summarize(XX = XX,
XX = XX,
XX = XX)
Handle the missing values by including na.rm = TRUE
in the mean()
functions.
Use n()
to add another summary showing the number of cases per arm.
Now try to calculate the same values, but using the CD4_long
dataset. How do you need to adjust the code?
Finally, use pivot_wider
to change the format of the summarized results to the one obtained from CD4_wide
.
File | Rows | Columns |
---|---|---|
trial_act.csv | 2139 | 27 |
trial_act_demo_fake.csv | 2139 | 3 |
First 5 rows of trial_act.csv
pidnum | age | wtkg | hemo | homo | drugs | karnof | oprior | z30 | zprior | preanti | race | gender | str2 | strat | symptom | treat | offtrt | cd40 | cd420 | cd496 | r | cd80 | cd820 | cens | days | arms |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10056 | 48 | 89.8 | 0 | 0 | 0 | 100 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 422 | 477 | 660 | 1 | 566 | 324 | 0 | 948 | 2 |
10059 | 61 | 49.4 | 0 | 0 | 0 | 90 | 0 | 1 | 1 | 895 | 0 | 0 | 1 | 3 | 0 | 1 | 0 | 162 | 218 | NA | 0 | 392 | 564 | 1 | 1002 | 3 |
10089 | 45 | 88.5 | 0 | 1 | 1 | 90 | 0 | 1 | 1 | 707 | 0 | 1 | 1 | 3 | 0 | 1 | 1 | 326 | 274 | 122 | 1 | 2063 | 1893 | 0 | 961 | 3 |
10093 | 47 | 85.3 | 0 | 1 | 0 | 100 | 0 | 1 | 1 | 1399 | 0 | 1 | 1 | 3 | 0 | 1 | 0 | 287 | 394 | NA | 0 | 1590 | 966 | 0 | 1166 | 3 |
10124 | 43 | 66.7 | 0 | 1 | 0 | 100 | 0 | 1 | 1 | 1352 | 0 | 1 | 1 | 3 | 0 | 0 | 0 | 504 | 353 | 660 | 1 | 870 | 782 | 0 | 1090 | 0 |
The trial_act.csv
data set contains a randomized clinical trial to compare monotherapy with zidovudine or didanosine with combination therapy with zidovudine and didanosine or zidovudine and zalcitabine in adults infected with the human immunodeficiency virus type I whose CD4 T cell counts were between 200 and 500 per cubic millimeter.
Name | Description |
---|---|
pidnum | patient’s ID number |
age | age in years at baseline |
wtkg | weight in kg at baseline |
hemo | hemophilia (0=no, 1=yes) |
homo | homosexual activity (0=no, 1=yes) |
drugs | history of intravenous drug use (0=no, 1=yes) |
karnof | Karnofsky score (on a scale of 0-100) |
oprior | non-zidovudine antiretroviral therapy prior to initiation of study treatment (0=no, 1=yes) |
z30 | zidovudine use in the 30 days prior to treatment initiation (0=no, 1=yes) |
zprior | zidovudine use prior to treatment initiation (0=no, 1=yes) |
preanti | number of days of previously received antiretroviral therapy |
race | race (0=white, 1=non-white) |
gender | gender (0=female, 1=male) |
str2 | antiretroviral history (0=naive, 1=experienced) |
strat | antiretroviral history stratification (1=’antiretroviral naive’, 2=’> 1 but ≤ 52 weeks of prior antiretroviral therapy’, 3=’> 52 weeks’) |
symptom | symptomatic indicator (0=asymptomatic, 1=symptomatic) |
treat | treatment indicator (0=zidovudine only, 1=other therapies) |
offtrt | indicator of off-treatment before 96±5 weeks (0=no,1=yes) |
cd40 | CD4 T cell count at baseline |
cd420 | CD4 T cell count at 20±5 weeks |
cd496 | CD4 T cell count at 96±5 weeks (=NA if missing) |
r | missing CD4 T cell count at 96±5 weeks (0=missing, 1=observed) |
cd80 | CD8 T cell count at baseline |
cd820 | CD8 T cell count at 20±5 weeks |
cens | indicator of observing the event in days |
days | number of days until the first occurrence of: (i) a decline in CD4 T cell count of at least 50 (ii) an event indicating progression to AIDS, or (iii) death. |
arms | treatment arm (0=zidovudine, 1=zidovudine and didanosine, 2=zidovudine and zalcitabine, 3=didanosine). |
First 5 rows of trial_act_demo_fake.csv
pidnum | exercise | education |
---|---|---|
10931 | 4 | HS |
11971 | 1 | <HS |
330244 | 1 | HS |
270879 | 1 | BA |
11435 | 0 | <HS |
The trial_act_demo_fake.csv
data set contains fake demogrpahic information corresponding to the patients in trial_act.csv
Name | Description |
---|---|
pidnum | patient’s ID number |
exercise | the number of days per week that the patient exercises |
education | the patient’s education level |
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
Function | Package | Description |
---|---|---|
rename() |
dplyr |
Rename columns |
select() |
dplyr |
Select columns based on name or index |
filter() |
dplyr |
Select rows based on some logical criteria |
arrange() |
dplyr |
Sort rows |
mutate() |
dplyr |
Add new columns |
case_when() |
dplyr |
Recode values of a column |
group_by(), summarise() |
dplyr |
Group data and then calculate summary statistics |
left_join() |
dplyr |
Combine multiple data sets using a key column |
spread() |
tidyr |
Convert long data to wide format - from rows to columns |
gather() |
tidyr |
Convert wide data to long format - from columns to rows |
dplyr
vignetteSee https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html for the full dplyr vignette with lots of wrangling tips and tricks.
from R Studio