class: inverse, center, middle background-image: url("img/tidyverse.png") background-position: 95% 95% background-size: 25% # Using the _tidyverse_ ### Maximilian H.K. Hesselbarth #### University of Michigan (EEB) 2022/10/24 --- # The _tidyverse_ .pull-left[ <img src="img/hexlogos.png" width="75%" style="display: block; margin: auto;" /> .ref[www.tidyverse.org] ] .pull-right[ <img src="img/tasks.png" width="100%" style="display: block; margin: auto;" /> .ref[Wickham, H., Grolemund, G., 2016. R for Data Science, 1st ed. O’Reilly, Newton (USA).] ] --- # Tidy data -- 1. Each **variable** forms a **column**. -- 2. Each **observation** forms a **row**. -- 3. Each **value** must have its own **cell**. .ref[Wickham, H., 2014. Tidy Data. Journal of Statistical Software 59, 1–23. https://doi.org/10.18637/jss.v059.i10] -- <img src="img/tidydata.png" width="75%" style="display: block; margin: auto;" /> .ref[Illustration from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst] --- # Tidy data <img src="img/table.png" width="100%" style="display: block; margin: auto;" /> .ref[Wickham, H., Grolemund, G., 2016. R for Data Science, 1st ed. O’Reilly, Newton (USA).] --- # Install and load ```r install.packages("tidyverse") ``` ```r library(tidyverse) ``` ``` ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── ## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5 ## ✔ tibble 3.1.8 ✔ dplyr 1.0.10 ## ✔ tidyr 1.2.1 ✔ stringr 1.4.1 ## ✔ readr 2.1.3 ✔ forcats 0.5.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ``` --- background-image: url("img/tidyverse.png") background-position: 95% 10% background-size: 25% # Core packages `readr` : Read rectangular data `tibble` : Modern re-imagining of data frames `stringr` : Functions to work with strings (i.e. sequence of characters) `forcats` : Functions to modify factors (i.e. categorical data) `tidyr` : Functions to tidy/reshape data `dplyr` : Functions for data manipulation `purrr` : Functional programming `ggplot2` : Data visualization --- background-image: url("img/maggritr.png") background-position: 95% 10% background-size: 10% # Pipe operator `%>%` -- - `f(x)` is equivalent to `x %>% f` -- - `f(x, y)` is equivalent to `x %>% f(y)` -- - `f(y, x)` is equivalent to `x %>% f(y, .)` -- ```r set.seed(42) x <- runif(n = 10) *min(log(sort(x))) ## [1] -2.004953 *x %>% sort() %>% log() %>% min() ## [1] -2.004953 set.seed(42) *10 %>% runif() %>% sort() %>% log() %>% min() ## [1] -2.004953 ``` --- # But, surprise! -- - Since `R v4.1` there is a **base** pipe: `|>` -- - **Similar** behavior as `maggritr` pipe. -- - First time I am using this as well... -- ```r set.seed(42) x <- runif(n = 10) *min(log(sort(x))) ## [1] -2.004953 *x |> sort() |> log() |> min() ## [1] -2.004953 set.seed(42) *10 |> runif() |> sort() |> log() |> min() ## [1] -2.004953 ``` --- class: inverse # Palmer penguins dataset <!-- - `palmerpenguins` dataset --> <!-- - Nesting observations, penguin size data, and isotope measurements from blood samples for adult Adélie, Chinstrap, and Gentoo penguins. --> <img src="img/penguins.png" width="65%" style="display: block; margin: auto;" /> .ref[Horst A.M., Hill A.P., Gorman K.B., 2020. palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0.] --- background-image: url("img/readr.png") background-position: 95% 10% background-size: 10% # Read data: _readr_ -- - Reads automatically into `tibble` -- - Different `readr::read_*()` functions for different data types -- - Functions to write data `readr::write_*()` -- - `readxl` as alternative for Excel data -- ```r *df_penguins <- readr::read_csv("data/penguins_raw.csv") ## Rows: 344 Columns: 17 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (9): studyName, Species, Region, Island, Stage, Individual ID, Clutch C... ## dbl (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng... ## date (1): Date Egg ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` --- background-image: url("img/stringr.png") background-position: 95% 10% background-size: 10% # Work with strings: _stringr_ -- - Allows all sort of **string manipulations** -- - Allows to avoid `grep` functions -- ```r names(df_penguins) <- names(df_penguins) |> stringr::str_remove_all(pattern = " ") |> stringr::str_remove_all(pattern = "\\([^()]+\\)") |> stringr::str_to_lower() head(df_penguins, n = 3) ``` ``` ## # A tibble: 3 × 17 ## study…¹ sampl…² species region island stage indiv…³ clutc…⁴ dateegg culme…⁵ ## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <date> <dbl> ## 1 PAL0708 1 Adelie… Anvers Torge… Adul… N1A1 Yes 2007-11-11 39.1 ## 2 PAL0708 2 Adelie… Anvers Torge… Adul… N1A2 Yes 2007-11-11 39.5 ## 3 PAL0708 3 Adelie… Anvers Torge… Adul… N2A1 Yes 2007-11-16 40.3 ## # … with 7 more variables: culmendepth <dbl>, flipperlength <dbl>, ## # bodymass <dbl>, sex <chr>, delta15n <dbl>, delta13c <dbl>, comments <chr>, ## # and abbreviated variable names ¹studyname, ²samplenumber, ³individualid, ## # ⁴clutchcompletion, ⁵culmenlength ``` --- background-image: url("img/tidyr.png") background-position: 95% 10% background-size: 10% # Tidy data: _tidyr_ -- - `pivot_longer()` & `pivot_wider()` most important functions to **reshape** data .yellow[(I will come back to this later!)] -- - Some functions to deal with `NA` values ```r df_penguins <- tidyr::drop_na(df_penguins, bodymass) ``` -- - `nest()`/`unnest()` to organize data ```r tidyr::nest(df_penguins, data = -island) ``` ``` ## # A tibble: 3 × 2 ## island data ## <chr> <list> ## 1 Torgersen <tibble [51 × 16]> ## 2 Biscoe <tibble [167 × 16]> ## 3 Dream <tibble [124 × 16]> ``` --- background-image: url("img/dplyr.png") background-position: 95% 10% background-size: 10% # Wrangle data: _dplyr_ -- - `filter()` to subset **rows**; `select()` to subset **columns** -- ```r *dplyr::filter(df_penguins, bodymass >= quantile(bodymass, 0.75), sex != "FEMALE") |> * dplyr::select_if(is.numeric) |> head(n = 3) ``` ``` ## # A tibble: 3 × 7 ## samplenumber culmenlength culmendepth flipperlength bodymass delta15n delta13c ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 110 43.2 19 197 4775 9.32 -25.5 ## 2 2 50 16.3 230 5700 8.15 -25.4 ## 3 4 50 15.2 218 5700 8.26 -25.4 ``` -- - `pull()` to convert **one** column as vector ```r *dplyr::pull(df_penguins, flipperlength) |> head(n = 10) ``` ``` ## [1] 181 186 195 193 190 181 195 193 190 186 ``` --- background-image: url("img/dplyr.png") background-position: 95% 10% background-size: 10% # Wrangle data: _dplyr_ -- - `mutate()` to **create/modify** columns - `case_when()` as vectorised **ifelse** statements - `slice()` to subset **rows** by position -- ```r dplyr::select(df_penguins, individualid, species, culmenlength) |> * dplyr::mutate(culmenlength_cm = culmenlength / 10, culmenlenth_class = * dplyr::case_when(culmenlength < 35 ~ "small", culmenlength >= 35 & culmenlength <= 45 ~ "med", culmenlength > 45 ~ "large")) |> dplyr::slice(sample(1:nrow(df_penguins), size = 3)) ``` ``` ## # A tibble: 3 × 5 ## individualid species culme…¹ culme…² culme…³ ## <chr> <chr> <dbl> <dbl> <chr> ## 1 N62A1 Chinstrap penguin (Pygoscelis antarctica) 46.4 4.64 large ## 2 N13A1 Adelie Penguin (Pygoscelis adeliae) 38.8 3.88 med ## 3 N92A1 Chinstrap penguin (Pygoscelis antarctica) 45.7 4.57 large ## # … with abbreviated variable names ¹culmenlength, ²culmenlength_cm, ## # ³culmenlenth_class ``` --- background-image: url("img/dplyr.png") background-position: 95% 10% background-size: 10% # Wrangle data: _dplyr_ -- - `group_by()` to group by **column** - `summarise()` to **summarize** each group - `n()` to **count** observations within each group (context dependent) -- ```r *(df_penguins_sum <- dplyr::group_by(df_penguins, island, species) |> * dplyr::summarise(n = dplyr::n(), flipperlength_mn = mean(flipperlength), flipperlength_sd = sd(flipperlength), * .groups = "drop")) ``` ``` ## # A tibble: 5 × 5 ## island species n flipperlen…¹ flipp…² ## <chr> <chr> <int> <dbl> <dbl> ## 1 Biscoe Adelie Penguin (Pygoscelis adeliae) 44 189. 6.73 ## 2 Biscoe Gentoo penguin (Pygoscelis papua) 123 217. 6.48 ## 3 Dream Adelie Penguin (Pygoscelis adeliae) 56 190. 6.59 ## 4 Dream Chinstrap penguin (Pygoscelis antarctica) 68 196. 7.13 ## 5 Torgersen Adelie Penguin (Pygoscelis adeliae) 51 191. 6.23 ## # … with abbreviated variable names ¹flipperlength_mn, ²flipperlength_sd ``` --- background-image: url("img/dplyr.png") background-position: 95% 10% background-size: 10% # Wrangle data: _dplyr_ -- - `*_join()` to combine columns from `x` and `y` using matching **keys** -- ```r *dplyr::left_join(x = df_penguins, y = df_penguins_sum, by = c("island", "species")) |> dplyr::select(species, island, tidyselect::starts_with("flipper")) |> dplyr::slice(sample(1:nrow(df_penguins), size = 5)) ``` ``` ## # A tibble: 5 × 5 ## species island flipperlength flipp…¹ flipp…² ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 Adelie Penguin (Pygoscelis adeliae) Dream 190 190. 6.59 ## 2 Gentoo penguin (Pygoscelis papua) Biscoe 213 217. 6.48 ## 3 Adelie Penguin (Pygoscelis adeliae) Biscoe 198 189. 6.73 ## 4 Adelie Penguin (Pygoscelis adeliae) Biscoe 174 189. 6.73 ## 5 Chinstrap penguin (Pygoscelis antarctica) Dream 187 196. 7.13 ## # … with abbreviated variable names ¹flipperlength_mn, ²flipperlength_sd ``` --- background-image: url("img/dplyr.png") background-position: 95% 10% background-size: 10% # Wrangle data: _dplyr_ <img src="img/joins.png" width="75%" style="display: block; margin: auto;" /> .ref[https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti] --- background-image: url("img/tidyr.png") background-position: 95% 10% background-size: 10% # Tidy data: _tidyr_ -- - Create summarized `data.frame` with mean bodymass for each species, island, year -- ```r (df_pen_sum <- dplyr::mutate(df_penguins, year = format(dateegg, "%Y"), species = stringr::str_split(df_penguins$species, pattern = " ", simplify = TRUE)[, 1]) |> dplyr::group_by(species, island, year) |> dplyr::summarise(bodymass = mean(bodymass), .groups = "drop") |> * dplyr::filter(year %in% c(2007, 2009))) ``` ``` ## # A tibble: 10 × 4 ## species island year bodymass ## <chr> <chr> <chr> <dbl> ## 1 Adelie Biscoe 2007 3620 ## 2 Adelie Biscoe 2009 3858. ## 3 Adelie Dream 2007 3671. ## 4 Adelie Dream 2009 3651. ## 5 Adelie Torgersen 2007 3763. ## 6 Adelie Torgersen 2009 3489. ## 7 Chinstrap Dream 2007 3694. ## 8 Chinstrap Dream 2009 3725 ## 9 Gentoo Biscoe 2007 5071. ## 10 Gentoo Biscoe 2009 5141. ``` --- background-image: url("img/tidyr.png") background-position: 95% 10% background-size: 10% # Tidy data: _tidyr_ -- - Reshape from **long** to **wide** specifying column names and values -- ```r (df_pen_wide <- tidyr::pivot_wider(df_pen_sum, * names_from = year, values_from = bodymass, names_prefix = "yr_") |> dplyr::mutate(diff = (yr_2009 - yr_2007) / yr_2007 * 100)) ``` ``` ## # A tibble: 5 × 5 ## species island yr_2007 yr_2009 diff ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 Adelie Biscoe 3620 3858. 6.57 ## 2 Adelie Dream 3671. 3651. -0.545 ## 3 Adelie Torgersen 3763. 3489. -7.28 ## 4 Chinstrap Dream 3694. 3725 0.833 ## 5 Gentoo Biscoe 5071. 5141. 1.38 ``` --- background-image: url("img/tidyr.png") background-position: 95% 10% background-size: 10% # Tidy data: _tidyr_ -- - Reshape from **wide** to **long** specifying which columns _not_ to reshape -- ```r *tidyr::pivot_longer(df_pen_wide, -c(species, island, diff), names_to = "years", values_to = "bodymass") |> head(10) ``` ``` ## # A tibble: 10 × 5 ## species island diff years bodymass ## <chr> <chr> <dbl> <chr> <dbl> ## 1 Adelie Biscoe 6.57 yr_2007 3620 ## 2 Adelie Biscoe 6.57 yr_2009 3858. ## 3 Adelie Dream -0.545 yr_2007 3671. ## 4 Adelie Dream -0.545 yr_2009 3651. ## 5 Adelie Torgersen -7.28 yr_2007 3763. ## 6 Adelie Torgersen -7.28 yr_2009 3489. ## 7 Chinstrap Dream 0.833 yr_2007 3694. ## 8 Chinstrap Dream 0.833 yr_2009 3725 ## 9 Gentoo Biscoe 1.38 yr_2007 5071. ## 10 Gentoo Biscoe 1.38 yr_2009 5141. ``` --- background-image: url("img/purrr.png") background-position: 95% 10% background-size: 10% # Functional programming: _purrr_ -- - `map_*()` to apply function to each element (vector/list) -- ```r species_names <- unique(df_penguins$species) *purrr::map(species_names, function(i) { dplyr::filter(df_penguins, species == i) |> dplyr::pull(island) |> unique() |> stringr::str_sort() |> paste(collapse = ", ")}) ``` ``` ## [[1]] ## [1] "Biscoe, Dream, Torgersen" ## ## [[2]] ## [1] "Biscoe" ## ## [[3]] ## [1] "Dream" ``` -- ```r foo <- function(i) {dplyr::filter(df_penguins, species == i) |> dplyr::pull(island) |> unique() |> length()} *purrr::map_int(species_names, foo) ``` ``` ## [1] 3 1 1 ``` --- background-image: url("img/purrr.png") background-position: 95% 10% background-size: 10% # Functional programming: _purrr_ -- - `reduce_*()` to **remove** a list by one level -- ```r pick <- c("culmenlength", "culmendepth", "flipperlength") foo <- function(i, j) { df <- dplyr::select(i, studyname, bodymass, j) dplyr::bind_cols(study = unique(df$studyname), cor(df[, 2], df[, 3])) } dplyr::group_by(df_penguins, island) |> dplyr::group_split() |> * purrr::map2(pick, foo) |> * purrr::reduce(dplyr::left_join, by = "study") ``` ``` ## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0. ## ℹ Please use `all_of()` or `any_of()` instead. ## # Was: ## data %>% select(j) ## ## # Now: ## data %>% select(all_of(j)) ## ## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>. ``` ``` ## # A tibble: 3 × 4 ## study culmenlength culmendepth flipperlength ## <chr> <dbl> <dbl> <dbl> ## 1 PAL0708 0.868 0.564 0.436 ## 2 PAL0809 0.868 0.564 0.436 ## 3 PAL0910 0.868 0.564 0.436 ``` --- class: inverse ## Thank you for your attention ### Questions? .pull-left[ Further resources: [https://mhesselbarth.github.io/advanced-r-workshop/resources](https://mhesselbarth.github.io/advanced-r-workshop/resources) Exercise: [https://mhesselbarth.github.io/advanced-r-workshop/exercise-tidyverse](https://mhesselbarth.github.io/advanced-r-workshop/exercise-tidyverse) ] .pull-right[ <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M464 64H48C21.49 64 0 85.49 0 112v288c0 26.51 21.49 48 48 48h416c26.51 0 48-21.49 48-48V112c0-26.51-21.49-48-48-48zm0 48v40.805c-22.422 18.259-58.168 46.651-134.587 106.49-16.841 13.247-50.201 45.072-73.413 44.701-23.208.375-56.579-31.459-73.413-44.701C106.18 199.465 70.425 171.067 48 152.805V112h416zM48 400V214.398c22.914 18.251 55.409 43.862 104.938 82.646 21.857 17.205 60.134 55.186 103.062 54.955 42.717.231 80.509-37.199 103.053-54.947 49.528-38.783 82.032-64.401 104.947-82.653V400H48z"></path></svg> [mhessel@umich.edu](mailto:mhessel@umich.edu) <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"></path></svg> [https://www.maxhesselbarth.com](https://www.maxhesselbarth.com) <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> [@MHKHesselbarth](https://twitter.com/MHKHesselbarth) ]