class: center, middle, inverse, title-slide # Wrangling ###
Data Analytics for Psychology and Business
### February 2019 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://cdsbasel.github.io/dataanalytics/"> <span style="padding-left:82px"> <font color="#7E7E7E"> cdsbasel.github.io/dataanalytics/ </font> </span> </a> <a href="https://cdsbasel.github.io/dataanalytics/"> <font color="#7E7E7E"> Data Analytics for Psychology and Business | February 2019 </font> </a> </span> </div> --- # Datentypen ausserhalb von R <table class="tg" style="cellspacing:0; cellpadding:0; border:none;" width="95%"> <col width=30%> <col width=30%> <col width=30%> <tr> <td bgcolor = 'white' style='vertical-align:top'> <ul> <li class="m1"><span><high>Structured data</high> <ul class="level"> <li><span>Delimiter separated: <mono>.csv</mono>, <mono>.txt</mono>, etc.</span></li> <li><span>Relational databases: <mono>SQL</mono></span></li> </ul> <br><img src="image/structured.png" height=250px> </span></li> </ul> </td> <td bgcolor = 'white' style='vertical-align:top'> <ul> <li class="m2"><span><high>Semi-structured data</high> <ul class="level"> <li><span>Markup: <mono>.xml</mono>, <mono>.xls</mono>, <mono>.html</mono> etc.</span></li> <li><span>Non markup: <mono>JSON</mono>, <mono>MongoDB</mono></span></li> </ul> <br><img src="image/html.png" height=250px> </span></li> </ul> </td> <td bgcolor = 'white' style='vertical-align:top'> <ul> <li class="m3"><span><high>Unstructued data</high> <ul class="level"> <li><span>z.B. Text</span></li> </ul> <br><br><br><br><br2><img src="image/text.png" height=250px> </span></li> </ul> </td> </tr> </table> --- # Delimiter separated data .pull-left45[ <ul> <li class="m1"><span><high>Delimiter</high> separates columns.</span></li> <li class="m2"><span>Typically available as <high>local text file</high>.</span></li> <li class="m3"><span><high>data types</high> are inferred.</span></li> </ul> <br> <p align="center"> <img src="image/readr.png" height=200> </p> ] .pull-right45[ <p align="center"> <img src="image/structured.png" height=400> </p> ] --- # Delimiter separated data .pull-left45[ <ul> <li class="m1"><span><high>Delimiter</high> separates columns.</span></li> <li class="m2"><span>Typically available as <high>local text file</high>.</span></li> <li class="m3"><span><high>data types</high> are inferred.</span></li> </ul> <br> <p align="center"> <img src="image/readr.png" height=200> </p> ] .pull-right45[ ```r # Lese Basel Datensatz ein basel <- read_csv("1_Data/basel.csv") # Benutze expliziten Delimiter basel <- read_delim("1_Data/basel.csv", delim = ",") basel ``` ``` ## # A tibble: 10,000 x 20 ## id geschlecht alter groesse ## <dbl> <chr> <dbl> <dbl> ## 1 1 f 87 165 ## 2 2 m 54 175. ## 3 3 f 34 147. ## 4 4 m 31 166. ## 5 5 m 24 180. ## # … with 9,995 more rows, and 16 ## # more variables ``` ] --- # Identify file path .pull-left45[ <ul> <li class="m1"><span>Identify file path using RStudio's <high>auto-complete</high>.</span></li> <li class="m2"><span>Place cursor between quotation marks and press <highm>tab</highm>.</span></li> </ul> <p align="center"> <img src="image/tab.png" height="150px"></img> </p> ] .pull-right45[ <p align="center"> <img src="image/auto_complete_1.png"></img> </p> ] --- # Den Filepath finden .pull-left45[ <ul> <li class="m1"><span>Identify file path using RStudio's <high>auto-complete</high>.</span></li> <li class="m2"><span>Place cursor between quotation marks and press <highm>tab</highm>.</span></li> </ul> <p align="center"> <img src="image/tab.png" height="150px"></img> </p> ] .pull-right45[ <p align="center"> <img src="image/auto_complete_2.png"></img> </p> ] --- # Inferred data types .pull-left45[ ```r # Read basel data set basel <- read_csv("1_Data/basel.csv") ``` ``` ## Parsed with column specification: ## cols( ## .default = col_double(), ## geschlecht = col_character(), ## bildung = col_character(), ## konfession = col_character(), ## fasnacht = col_character(), ## sehhilfe = col_character() ## ) ``` ``` ## See spec(...) for full column specifications. ``` ] .pull-right45[ <img src="image/structured.png"></image> ] --- # Inferred data types .pull-left45[ <ul> <li class="m1"><span>Sometimes <mono>readr</mono> needs a little help to <high>correctly identify data types</high>.</span></li> </ul> ```r # Explicitly define missing values basel <- read_csv("1_Data/basel.csv", na = c('NA')) # Re-infer data types basel <- type_convert(basel) ``` ] .pull-right45[ <img src="image/structured.png"></image> ] --- # Semi-structured data <font style="font-size: 16px">mit <a href="https://github.com/r-lib/xml2"><mono>xml2</mono></a> und <a href="https://github.com/hadley/rvest"><mono>rvest</mono></a></font> ```r # Load table from Wikipedia (with xml2 and rvest) read_html("https://en.wikipedia.org/wiki/R_(programming_language)") %>% html_node(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>% html_table() %>% as_tibble() ``` ``` ## # A tibble: 17 x 3 ## Release Date Description ## <chr> <chr> <chr> ## 1 0.16 "" "This is the last alpha version developed primarily by Ihaka and G… ## 2 0.49 "1997-04-2… "This is the oldest source release which is currently available on… ## 3 0.60 "1997-12-0… "R becomes an official part of the GNU Project. The code is hosted… ## 4 0.65.1 "1999-10-0… "First versions of update.packages and install.packages functions … ## 5 1.0 "2000-02-2… "Considered by its developers stable enough for production use.[50… ## 6 1.4 "2001-12-1… "S4 methods are introduced and the first version for Mac OS X is m… ## 7 1.8 "2003-10-0… "Introduced a flexible condition handling mechanism for signalling… ## 8 2.0 "2004-10-0… "Introduced lazy loading, which enables fast loading of data with … ## 9 2.1 "2005-04-1… "Support for UTF-8 encoding, and the beginnings of internationaliz… ## 10 2.11 "2010-04-2… "Support for Windows 64 bit systems." ## # … with 7 more rows ``` --- # Other data <font style="font-size: 16px">siehe <a href="https://cran.r-project.org/web/packages/rio/vignettes/rio.html">rio</a></font> .pull-left45[ ### `readr` <img src="image/readr.png" width="50", align="right"> ```r # read fixed width files (can be fast) data <- read_fwf(file, ...) # read Apache style log files data <- read_log(file, ...) ``` ### `haven` <img src="image/haven.png" width="50" align="right"> ```r # read SAS's .sas7bat and sas7bcat files data <- read_sas(file, ...) # read SPSS's .sav files data <- read_sav(file, ...) # etc ``` ] .pull-right45[ ### `readxl` <img src="image/readxl.png" width="50" align="right"> ```r # read Excel's .xls and xlsx files data <- read_excel(file, ...) ``` <br> ### Other ```r # Read Matlab .mat files data <- R.matlab::readMat(file, ...) # Read and wrangle .xml and .html data <- XML::xmlParseParse(file, ...) # from package jsonlite: read .json files data <- jsonlite::read_json(file, ...) ``` ] --- # What is Wrangling? <p align = "center"> <img src="image/wrangling.jpeg" height=450px><br> <font style="font-size:10px">from <a href="https://datasciencebe.com/tag/data-wrangling/">datasciencebe.com</a></font> </p> --- .pull-left45[ # This is Wrangling! <ul> <li class="m1"><span><high>Transform</high> <br><br> <ul class="level"> <li><span>Rename columns names</span></li> <li><span>Create new variables</span></li> </ul></span></li> <li class="m2"><span><high>Organize</high> <br><br> <ul class="level"> <li><span>Sort</span></li> <li><span>Join data sets</span></li> <li><span>Flip columns and rows</span></li> </ul></span></li> <li class="m3"><span><high>Aggregate</high> <br><br> <ul class="level"> <li><span>Create groups</span></li> <li><span>Calculate statistics for groups</span></li> </ul></span></li> </ul> ] .pull-right5[ <br> <p align="center"> <img src="image/wrangling.png" height = "530px"> </p> ] --- .pull-left4[ # 2 'dirty' data sets <ul> <li class="m1"><span><high>Rename</high>: Add intuitive column names</span></li> <li class="m2"><span><high>Recode</high>: Change to appropriate units.</span></li> <li class="m3"><span><high>Join</high>: Join datasets.</span></li> <li class="m4"><span><high>Sort</high>: Sort datasets.</span></li> <li class="m5"><span><high>Filter</high>: Select relevant cases.</span></li> <li class="m6"><span><high>Select</high>: Select relevant variables.</span></li> </ul> ] .pull-right45[ <br> ```r patients ``` ``` ## # A tibble: 5 x 3 ## id X1 X2 ## <dbl> <dbl> <dbl> ## 1 1 37 1 ## 2 2 65 2 ## 3 3 57 2 ## 4 4 34 1 ## 5 5 45 2 ``` ```r results ``` ``` ## # A tibble: 5 x 3 ## id t_1 t_2 ## <dbl> <dbl> <dbl> ## 1 4 100 105 ## 2 92 134 150 ## 3 1 123 135 ## 4 2 143 140 ## 5 99 102 68 ``` ] --- # The mighty `tidyverse` The [`tidyverse`](https://www.tidyverse.org/) is a collection of high-performing, user-friendly R packages, created explicitly for efficient data analytics. 1. `ggplot2` for graphics 2. <high><mono>dplyr</mono> for data wrangling</high>. 3. <high><mono>tidyr</mono> for data wrangling</high>. 4. `readr` for data I/O. 5. `purrr` for function programming. 6. `tibble` for modern `data.frame`s. <br><br> <table style="cellspacing:0; cellpadding:0; border:none;"> <col width="15%"> <col width="15%"> <col width="15%"> <col width="15%"> <col width="15%"> <col width="15%"> <tr> <td bgcolor="white"> <img src="image/hex-ggplot2.png" height=160px style="opacity:.2"></img> </td> <td bgcolor="white"> <img src="image/hex-dplyr.png"height=160px></img> </td> <td bgcolor="white"> <img src="image/hex-tidyr.png"height=160px></img> </td> <td bgcolor="white"> <img src="image/hex-readr.png"height=160px style="opacity:.2"></img> </td> <td bgcolor="white"> <img src="image/hex-purrr.png"height=160px style="opacity:.2"></img> </td> <td bgcolor="white"> <img src="image/hex-tibble.png"height=160px style="opacity:.2"></img> </td> </tr> </table> --- # <mono>`%>%`</mono> .pull-left45[ <ul> <li class="m1"><span>The preferred way of using `dplyr` includes a <high>novel operator</high>, the pipe <highm>%>%</highm>.</span></li> </ul> <br> <p align="center"> <img src="image/pipe.jpg" width = "300px"><br> <font style="font-size:10px">from <a href="https://upload.wikimedia.org/wikipedia/en/thumb/b/b9/MagrittePipe.jpg">wikimedia.org</a></font> </p> ] .pull-right45[ ```r # Numerical vector score <- c(8, 4, 6, 3, 7, 3) score ``` ``` ## [1] 8 4 6 3 7 3 ``` ```r # mean: Base-R-style mean(score) ``` ``` ## [1] 5.167 ``` ```r # mean: dplyr-style score %>% mean() ``` ``` ## [1] 5.167 ``` ] --- # `%>%` .pull-left45[ <ul> <li class="m1"><span>The preferred way of using `dplyr` includes a <high>novel operator</high>, the pipe <highm>%>%</highm>.</span></li> </ul> <br> <p align="center"> <img src="image/pipe.jpg" width = "300px"><br> <font style="font-size:10px">from <a href="https://upload.wikimedia.org/wikipedia/en/thumb/b/b9/MagrittePipe.jpg">wikimedia.org</a></font> </p> ] .pull-right45[ <p align="center"> <img src="image/pipe.png" height = "400px"> </p> ] --- .pull-left4[ # Transform <ul> <li class="m1"><span><high>Rename</high>: Choose intuitive column names. <br><br> <ul class="level"> <li><span><mono>rename()</mono></span></li> </ul> </span></li> <li class="m2"><span><high>Recode</high>: Choose appropriate units and labels. <br><br> <ul class="level"> <li><span><mono>mutate()</mono></span></li> <li><span><mono>case_when()</mono></span></li> </ul> </span></li> <li class="m3"><span><high>Join</high>: Join datasets <br><br> <ul class="level"> <li><span><mono>left_join()</mono></span></li> </ul> </span></li> </ul> ] .pull-right45[ <br> ```r patients ``` ``` ## # A tibble: 5 x 3 ## id X1 X2 ## <dbl> <dbl> <dbl> ## 1 1 37 1 ## 2 2 65 2 ## 3 3 57 2 ## 4 4 34 1 ## 5 5 45 2 ``` ```r results ``` ``` ## # A tibble: 5 x 3 ## id t_1 t_2 ## <dbl> <dbl> <dbl> ## 1 4 100 105 ## 2 92 134 150 ## 3 1 123 135 ## 4 2 143 140 ## 5 99 102 68 ``` ] --- # `rename()` .pull-left45[ ```r patients %>% rename(NEW = OLD, NEW = OLD) ``` ] .pull-right45[ ```r # Starte mit Datensatz patients %>% # Ändere Spaltennamen rename(age = X1, condition = X2) ``` ``` ## # A tibble: 5 x 3 ## id age condition ## <dbl> <dbl> <dbl> ## 1 1 37 1 ## 2 2 65 2 ## 3 3 57 2 ## 4 4 34 1 ## 5 5 45 2 ``` ] --- # `mutate()` .pull-left45[ ```r tibble %>% mutate( NAME1 = DEFINITION1, NAME2 = DEFINITION2, NAME3 = DEFINITION3, ... ) ``` ] .pull-right45[ ```r patients %>% rename(age = X1, condition = X2) %>% # Kreiere neue Variablen mutate(monate = age * 12, dekaden = age / 10) ``` ``` ## # A tibble: 5 x 5 ## id age condition monate dekaden ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 37 1 444 3.7 ## 2 2 65 2 780 6.5 ## 3 3 57 2 684 5.7 ## 4 4 34 1 408 3.4 ## 5 5 45 2 540 4.5 ``` ] --- # `case_when()` .pull-left45[ ```r tibble %>% mutate( NAME = case_when( LOGICAL1 ~ VALUE1, LOGICAL2 ~ VALUE2, ... ) ) ``` ] .pull-right45[ ```r patients %>% rename(age = X1, condition = X2) %>% # Create cond_label from condition mutate(cond_label = case_when( condition == 1 ~ "placebo", condition == 2 ~ "medication")) ``` ``` ## # A tibble: 5 x 4 ## id age condition cond_label ## <dbl> <dbl> <dbl> <chr> ## 1 1 37 1 placebo ## 2 2 65 2 medication ## 3 3 57 2 medication ## 4 4 34 1 placebo ## 5 5 45 2 medication ``` ] --- # Join datasets <p align="center"> <img src="image/joining_data.png" height="450px"> </p> --- # `inner_join()` .pull-left4[ ```r TIBBLE1 %>% inner_join(TIBBLE2, by = c("KEY")) ``` ] .pull-right5[ ```r patients %>% rename(age = X1, condition = X2) %>% mutate(cond_label = case_when( condition == 1 ~ "placebo", condition == 2 ~ "medication")) %>% # Verbinde mit ergebnisse inner_join(ergebnisse, by = "id") ``` ``` ## Error in tbl_vars_dispatch(x): object 'ergebnisse' not found ``` ] --- # `left_join()` .pull-left4[ ```r TIBBLE1 %>% left_join(TIBBLE2, by = c("KEY")) ``` ] .pull-right5[ ```r patients %>% rename(age = X1, condition = X2) %>% mutate(cond_label = case_when( condition == 1 ~ "placebo", condition == 2 ~ "medication")) %>% # Verbinde mit ergebnisse left_join(ergebnisse, by = "id") ``` ``` ## Error in tbl_vars_dispatch(x): object 'ergebnisse' not found ``` ] --- # Organize .pull-left4[ <ul> <li class="m4"><span><high>Sortieren</high>: Sort data. <br><br> <ul class="level"> <li><span><mono>arrange()</mono></span></li> </ul> </span></li> <li class="m5"><span><high>Filtern</high>: Select relevant cases. <br><br> <ul class="level"> <li><span><mono>slice()</mono></span></li> <li><span><mono>filter()</mono></span></li> </ul> </span></li> <li class="m6"><span><high>Auswählen</high>: Select relevant variables. <br><br> <ul class="level"> <li><span><mono>select()</mono></span></li> </ul> </span></li> </ul> ] .pull-right55[ ```r # Joined tibble patient_results ``` ``` ## # A tibble: 5 x 6 ## id age condition cond_label t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 1 37 1 placebo 123 135 ## 2 2 65 2 medication 143 140 ## 3 3 57 2 medication NA NA ## 4 4 34 1 placebo 100 105 ## 5 5 45 2 medication NA NA ``` ] --- # `arrange()` .pull-left4[ ```r # Sort ascending tibble %>% arrange(VAR1, VAR2) # Sort descending (with desc()) tibble %>% arrange(desc(VAR1), VAR2) ``` ] .pull-right5[ ```r patient_results %>% # Sort according to condition arrange(condition) ``` ``` ## # A tibble: 5 x 6 ## id age condition cond_label t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 1 37 1 placebo 123 135 ## 2 4 34 1 placebo 100 105 ## 3 2 65 2 medication 143 140 ## 4 3 57 2 medication NA NA ## 5 5 45 2 medication NA NA ``` ] --- # `arrange()` .pull-left4[ ```r # Sort ascending tibble %>% arrange(VAR1, VAR2) # Sort descending (with desc()) tibble %>% arrange(desc(VAR1), VAR2) ``` ] .pull-right5[ ```r patient_results %>% # Sort according to both arrange(condition, age) ``` ``` ## # A tibble: 5 x 6 ## id age condition cond_label t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 4 34 1 placebo 100 105 ## 2 1 37 1 placebo 123 135 ## 3 5 45 2 medication NA NA ## 4 3 57 2 medication NA NA ## 5 2 65 2 medication 143 140 ``` ] --- # `slice()` .pull-left4[ ```r # Slice with sequence patient_results %>% slice(INDEX_START:INDEX_STOP) # Slice with vector patient_results %>% slice(c(INDEX1, INDEX2, ...)) ``` ] .pull-right5[ ```r patient_results %>% arrange(condition, age) %>% # Lines 3 and 5 slice(c(3, 5)) ``` ``` ## # A tibble: 2 x 6 ## id age condition cond_label t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 5 45 2 medication NA NA ## 2 2 65 2 medication 143 140 ``` ] --- # `slice()` .pull-left4[ ```r # Slice with sequence patient_results %>% slice(INDEX_START:INDEX_STOP) # Slice with vector patient_results %>% slice(c(INDEX1, INDEX2, ...)) ``` ] .pull-right5[ ```r patient_results %>% arrange(condition, age) %>% # First 4 rows slice(1:4) ``` ``` ## # A tibble: 4 x 6 ## id age condition cond_label t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 4 34 1 placebo 100 105 ## 2 1 37 1 placebo 123 135 ## 3 5 45 2 medication NA NA ## 4 3 57 2 medication NA NA ``` ] --- # `filter()` .pull-left4[ ```r patient_results %>% filter(VAR1 == VALUE1, VAR2 > VALUE2, VAR3 < VALUE3, VAR4 == VALUE4 | VAR5 < VALUE5) ``` ] .pull-right5[ ```r patient_results %>% # Patients with age > 35 filter(age > 35) ``` ``` ## # A tibble: 4 x 6 ## id age condition cond_label t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 1 37 1 placebo 123 135 ## 2 2 65 2 medication 143 140 ## 3 3 57 2 medication NA NA ## 4 5 45 2 medication NA NA ``` ] --- # `filter()` .pull-left4[ ```r patient_results %>% filter(VAR1 == VALUE1, VAR2 > VALUE2, VAR3 < VALUE3, VAR4 == VALUE4 | VAR5 < VALUE5) ``` ] .pull-right5[ ```r # Age larger 35 & cond_label is medication patient_results %>% filter(age > 35, cond_label == "medication") ``` ``` ## # A tibble: 3 x 6 ## id age condition cond_label t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 2 65 2 medication 143 140 ## 2 3 57 2 medication NA NA ## 3 5 45 2 medication NA NA ``` ] --- # `select()` .pull-left4[ ```r # Select variables tibble %>% select(VAR1, VAR2) # Select all but one variable tibble %>% select(-VAR1) ``` ] .pull-right5[ ```r patient_results %>% # Select id and condition select(id, condition) ``` ``` ## # A tibble: 5 x 2 ## id condition ## <dbl> <dbl> ## 1 1 1 ## 2 2 2 ## 3 3 2 ## 4 4 1 ## 5 5 2 ``` ] --- # `select()` .pull-left4[ ```r # Select variables tibble %>% select(VAR1, VAR2) # Select all but one variable tibble %>% select(-VAR1) ``` ] .pull-right5[ ```r patient_results %>% # Everything but id select(-id) ``` ``` ## # A tibble: 5 x 5 ## age condition cond_label t_1 t_2 ## <dbl> <dbl> <chr> <dbl> <dbl> ## 1 37 1 placebo 123 135 ## 2 65 2 medication 143 140 ## 3 57 2 medication NA NA ## 4 34 1 placebo 100 105 ## 5 45 2 medication NA NA ``` ] --- # `starts_with()` .pull-left4[ ```r # Select variables tibble %>% select(starts_with("PATTERN")) ``` ] .pull-right5[ ```r patient_results %>% # Select variables starting with "t" select(starts_with("t")) ``` ``` ## # A tibble: 5 x 2 ## t_1 t_2 ## <dbl> <dbl> ## 1 123 135 ## 2 143 140 ## 3 NA NA ## 4 100 105 ## 5 NA NA ``` ] --- # `contains()` .pull-left4[ ```r # Select variables tibble %>% select(contains("PATTERN")) ``` ] .pull-right5[ ```r patient_results %>% # Select variables that contain "-" select(contains("_")) ``` ``` ## # A tibble: 5 x 3 ## cond_label t_1 t_2 ## <chr> <dbl> <dbl> ## 1 placebo 123 135 ## 2 medication 143 140 ## 3 medication NA NA ## 4 placebo 100 105 ## 5 medication NA NA ``` ] --- # `pivot_longer()` .pull-left4[ ```r # wide to long tibble %>% pivot_longer(cols = VARS, names_to = NAME1, values_to = NAME2) ``` ] .pull-right5[ ```r # wide to long patient_results %>% filter(cond_label == "placebo") ``` ``` ## # A tibble: 2 x 6 ## id age condition cond_label t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 1 37 1 placebo 123 135 ## 2 4 34 1 placebo 100 105 ``` ] --- # `pivot_longer()` .pull-left4[ ```r # wide to long DATA %>% pivot_longer(cols = VARS, names_to = NAME1, values_to = NAME2) ``` ] .pull-right5[ ```r # wide to long patient_results %>% filter(cond_label == "placebo") %>% pivot_longer(cols = c("t_1", "t_2"), names_to = "zeit", values_to = "messung") ``` ``` ## # A tibble: 4 x 6 ## id age condition cond_label zeit messung ## <dbl> <dbl> <dbl> <chr> <chr> <dbl> ## 1 1 37 1 placebo t_1 123 ## 2 1 37 1 placebo t_2 135 ## 3 4 34 1 placebo t_1 100 ## 4 4 34 1 placebo t_2 105 ``` ] --- # `pivot_wider()` .pull-left4[ ```r # long to wide tibble %>% pivot_wider(names_from = VAR1, values_from = VAR2) ``` ] .pull-right5[ ```r # long to wide patient_results_lang ``` ``` ## # A tibble: 4 x 6 ## id age condition cond_label zeit messung ## <dbl> <dbl> <dbl> <chr> <chr> <dbl> ## 1 1 37 1 placebo t_1 123 ## 2 1 37 1 placebo t_2 135 ## 3 4 34 1 placebo t_1 100 ## 4 4 34 1 placebo t_2 105 ``` ] --- # `pivot_wider()` .pull-left4[ ```r # long to wide tibble %>% pivot_wider(names_from = VAR1, values_from = VAR2) ``` ] .pull-right5[ ```r # long to wide patient_results_lang %>% pivot_wider(names_from = "zeit", values_from = "messung") ``` ``` ## # A tibble: 2 x 6 ## id age condition cond_label t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 1 37 1 placebo 123 135 ## 2 4 34 1 placebo 100 105 ``` ] --- # Aggregate .pull-left4[ <ul> <li class="m1"><span><high>Aggregate</high> <br><br> <ul class="level"> <li><span><mono>summarise()</mono></span></li> <li><span><mono>summarise_if()</mono></span></li> <li><span><mono>group_by(), summarise()</mono></span></li> <li><span><mono>n(), first(), last(), nth()</mono></span></li> <li><span><mono>pull()</mono></span></li> </ul> </span></li> </ul> ] .pull-right5[ ```r patient_results ``` ``` ## # A tibble: 5 x 6 ## id age condition cond_label t_1 ## <dbl> <dbl> <dbl> <chr> <dbl> ## 1 1 37 1 placebo 123 ## 2 2 65 2 medication 143 ## 3 3 57 2 medication NA ## 4 4 34 1 placebo 100 ## 5 5 45 2 medication NA ## # … with 1 more variable: t_2 <dbl> ``` ] --- # `summarise()` .pull-left4[ ```r tibble %>% summarise( NAME1 = SUMMARY_FUN(VAR1), NAME2 = SUMMARY_FUN(VAR2) ) ``` ] .pull-right5[ ```r patient_results %>% # descriptive stats summarise( mean_age = mean(age), median_t1 = median(t_1, na.rm = TRUE) ) ``` ``` ## # A tibble: 1 x 2 ## mean_age median_t1 ## <dbl> <dbl> ## 1 47.6 123 ``` ] --- # Grouped Aggregation <p align="center"> <img src="image/summarsed_data_diagram.png" height="470px"> </p> --- # `group_by()`, `summarise()` .pull-left4[ ```r tibble %>% group_by(GROUP_VAR) %>% summarise( NAME1 = SUMMARY_FUN(VAR1), NAME2 = SUMMARY_FUN(VAR2) ) ``` ] .pull-right5[ ```r patient_results %>% # Group according to cond group_by(cond_label) %>% # Descriptive stats summarise( mean_age = mean(age), median_t1 = median(t_1, na.rm = TRUE) ) ``` ``` ## # A tibble: 2 x 3 ## cond_label mean_age median_t1 ## <chr> <dbl> <dbl> ## 1 medication 55.7 143 ## 2 placebo 35.5 112. ``` ] --- # `n()` .pull-left4[ ```r tibble %>% group_by(GRUPPEN_VAR) %>% summarise( NAME1 = SUMMARY_FUN(VAR1), NAME2 = SUMMARY_FUN(VAR2) ) ``` ] .pull-right5[ ```r patient_results %>% # Group according to cond group_by(cond_label) %>% # Descriptive stats summarise( N = n() ) ``` ``` ## # A tibble: 2 x 2 ## cond_label N ## <chr> <int> ## 1 medication 3 ## 2 placebo 2 ``` ] --- class: middle, center <h1><a href="https://therbootcamp.github.io/R4DS_2019Feb/_sessions/Wrangling/Wrangling_practical.html">Practical</a></h1>