tidyverse

class: inverse, center, middle
background-image: url("img/tidyverse.png")
background-position: 95% 95%
background-size: 25%

# Using the _tidyverse_

### Maximilian H.K. Hesselbarth

#### University of Michigan (EEB)

2022/10/24

---

# The _tidyverse_

.pull-left[

.ref[www.tidyverse.org]

]

.pull-right[

.ref[Wickham, H., Grolemund, G., 2016. R for Data Science, 1st ed. O’Reilly, Newton (USA).]

]

---

# Tidy data

1. Each **variable** forms a **column**.

2. Each **observation** forms a **row**.

3. Each **value** must have its own **cell**.

.ref[Wickham, H., 2014. Tidy Data. Journal of Statistical Software 59, 1–23. https://doi.org/10.18637/jss.v059.i10]

.ref[Illustration from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst]

---

# Tidy data

.ref[Wickham, H., Grolemund, G., 2016. R for Data Science, 1st ed. O’Reilly, Newton (USA).]

---

# Install and load

```r
install.packages("tidyverse")
```

```r
library(tidyverse)
```

```
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
```

---

background-image: url("img/tidyverse.png")
background-position: 95% 10%
background-size: 25%

# Core packages

`readr`   : Read rectangular data

`tibble`  : Modern re-imagining of data frames

`stringr` : Functions to work with strings (i.e. sequence of characters)

`forcats` : Functions to modify factors (i.e. categorical data)

`tidyr`   : Functions to tidy/reshape data

`dplyr`   : Functions for data manipulation

`purrr`   : Functional programming

`ggplot2` : Data visualization

---

background-image: url("img/maggritr.png")
background-position: 95% 10%
background-size: 10%

# Pipe operator `%>%`

- `f(x)` is equivalent to `x %>% f`

- `f(x, y)` is equivalent to `x %>% f(y)`

- `f(y, x)` is equivalent to `x %>% f(y, .)`

```r
set.seed(42)
x <- runif(n = 10)

*min(log(sort(x)))
## [1] -2.004953

*x %>% sort() %>% log() %>% min()
## [1] -2.004953

set.seed(42)
*10 %>% runif() %>% sort() %>% log() %>% min()
## [1] -2.004953
```

---

# But, surprise!

- Since `R v4.1` there is a **base** pipe: `|>`

- **Similar** behavior as `maggritr` pipe.

- First time I am using this as well...

```r
set.seed(42)
x <- runif(n = 10)

*min(log(sort(x)))
## [1] -2.004953

*x |> sort() |> log() |> min()
## [1] -2.004953

set.seed(42)
*10 |> runif() |> sort() |> log() |> min()
## [1] -2.004953
```

---

class: inverse

# Palmer penguins dataset

.ref[Horst A.M., Hill A.P., Gorman K.B., 2020. palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0.]

---

background-image: url("img/readr.png")
background-position: 95% 10%
background-size: 10%

# Read data: _readr_

- Reads automatically into `tibble`

- Different `readr::read_*()` functions for different data types

- Functions to write data `readr::write_*()`

- `readxl` as alternative for Excel data

```r
*df_penguins <- readr::read_csv("data/penguins_raw.csv")
## Rows: 344 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): studyName, Species, Region, Island, Stage, Individual ID, Clutch C...
## dbl  (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng...
## date (1): Date Egg
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

---

background-image: url("img/stringr.png")
background-position: 95% 10%
background-size: 10%

# Work with strings: _stringr_

- Allows all sort of **string manipulations**

- Allows to avoid `grep` functions

```r
names(df_penguins) <- names(df_penguins) |> 
  stringr::str_remove_all(pattern = " ") |> 
  stringr::str_remove_all(pattern = "\$[^()]+\$") |>
  stringr::str_to_lower()

head(df_penguins, n = 3)
```

```
## # A tibble: 3 × 17
##   study…¹ sampl…² species region island stage indiv…³ clutc…⁴ dateegg    culme…⁵
##   <chr>     <dbl> <chr>   <chr>  <chr>  <chr> <chr>   <chr>   <date>       <dbl>
## 1 PAL0708       1 Adelie… Anvers Torge… Adul… N1A1    Yes     2007-11-11    39.1
## 2 PAL0708       2 Adelie… Anvers Torge… Adul… N1A2    Yes     2007-11-11    39.5
## 3 PAL0708       3 Adelie… Anvers Torge… Adul… N2A1    Yes     2007-11-16    40.3
## # … with 7 more variables: culmendepth <dbl>, flipperlength <dbl>,
## #   bodymass <dbl>, sex <chr>, delta15n <dbl>, delta13c <dbl>, comments <chr>,
## #   and abbreviated variable names ¹studyname, ²samplenumber, ³individualid,
## #   ⁴clutchcompletion, ⁵culmenlength
```

---

background-image: url("img/tidyr.png")
background-position: 95% 10%
background-size: 10%

# Tidy data: _tidyr_

- `pivot_longer()` & `pivot_wider()` most important functions to **reshape** data

.yellow[(I will come back to this later!)]

- Some functions to deal with `NA` values

```r
df_penguins <- tidyr::drop_na(df_penguins, bodymass)
```

- `nest()`/`unnest()` to organize data

```r
tidyr::nest(df_penguins, data = -island)
```

```
## # A tibble: 3 × 2
##   island    data               
##   <chr>     <list>             
## 1 Torgersen <tibble [51 × 16]> 
## 2 Biscoe    <tibble [167 × 16]>
## 3 Dream     <tibble [124 × 16]>
```

---

background-image: url("img/dplyr.png")
background-position: 95% 10%
background-size: 10%

# Wrangle data: _dplyr_

- `filter()` to subset **rows**; `select()` to subset **columns**

```r
*dplyr::filter(df_penguins, bodymass >= quantile(bodymass, 0.75), sex != "FEMALE") |>
* dplyr::select_if(is.numeric) |>
  head(n = 3)
```

```
## # A tibble: 3 × 7
##   samplenumber culmenlength culmendepth flipperlength bodymass delta15n delta13c
##          <dbl>        <dbl>       <dbl>         <dbl>    <dbl>    <dbl>    <dbl>
## 1          110         43.2        19             197     4775     9.32    -25.5
## 2            2         50          16.3           230     5700     8.15    -25.4
## 3            4         50          15.2           218     5700     8.26    -25.4
```

- `pull()` to convert **one** column as vector

```r
*dplyr::pull(df_penguins, flipperlength) |>
  head(n = 10)
```

```
##  [1] 181 186 195 193 190 181 195 193 190 186
```

---

background-image: url("img/dplyr.png")
background-position: 95% 10%
background-size: 10%

# Wrangle data: _dplyr_

- `mutate()` to **create/modify** columns

- `case_when()` as vectorised **ifelse** statements

- `slice()` to subset **rows** by position

```r
dplyr::select(df_penguins, individualid, species, culmenlength) |>
* dplyr::mutate(culmenlength_cm = culmenlength / 10,
                culmenlenth_class = 
*                 dplyr::case_when(culmenlength < 35  ~ "small",
                                   culmenlength >= 35 & culmenlength <= 45 ~ "med", 
                                   culmenlength > 45 ~ "large")) |>
  dplyr::slice(sample(1:nrow(df_penguins), size = 3))
```

```
## # A tibble: 3 × 5
##   individualid species                                   culme…¹ culme…² culme…³
##   <chr>        <chr>                                       <dbl>   <dbl> <chr>  
## 1 N62A1        Chinstrap penguin (Pygoscelis antarctica)    46.4    4.64 large  
## 2 N13A1        Adelie Penguin (Pygoscelis adeliae)          38.8    3.88 med    
## 3 N92A1        Chinstrap penguin (Pygoscelis antarctica)    45.7    4.57 large  
## # … with abbreviated variable names ¹culmenlength, ²culmenlength_cm,
## #   ³culmenlenth_class
```

---

background-image: url("img/dplyr.png")
background-position: 95% 10%
background-size: 10%

# Wrangle data: _dplyr_

- `group_by()` to group by **column**

- `summarise()` to **summarize** each group

- `n()` to **count** observations within each group (context dependent)

```r
*(df_penguins_sum <- dplyr::group_by(df_penguins, island, species) |>
* dplyr::summarise(n = dplyr::n(),
                   flipperlength_mn = mean(flipperlength), 
                   flipperlength_sd = sd(flipperlength), 
*                  .groups = "drop"))
```

```
## # A tibble: 5 × 5
##   island    species                                       n flipperlen…¹ flipp…²
##   <chr>     <chr>                                     <int>        <dbl>   <dbl>
## 1 Biscoe    Adelie Penguin (Pygoscelis adeliae)          44         189.    6.73
## 2 Biscoe    Gentoo penguin (Pygoscelis papua)           123         217.    6.48
## 3 Dream     Adelie Penguin (Pygoscelis adeliae)          56         190.    6.59
## 4 Dream     Chinstrap penguin (Pygoscelis antarctica)    68         196.    7.13
## 5 Torgersen Adelie Penguin (Pygoscelis adeliae)          51         191.    6.23
## # … with abbreviated variable names ¹flipperlength_mn, ²flipperlength_sd
```

---

background-image: url("img/dplyr.png")
background-position: 95% 10%
background-size: 10%

# Wrangle data: _dplyr_

- `*_join()` to combine columns from `x` and `y` using matching **keys**

```r
*dplyr::left_join(x = df_penguins, y = df_penguins_sum, by = c("island", "species")) |>
  dplyr::select(species, island, tidyselect::starts_with("flipper")) |> 
  dplyr::slice(sample(1:nrow(df_penguins), size = 5))
```

```
## # A tibble: 5 × 5
##   species                                   island flipperlength flipp…¹ flipp…²
##   <chr>                                     <chr>          <dbl>   <dbl>   <dbl>
## 1 Adelie Penguin (Pygoscelis adeliae)       Dream            190    190.    6.59
## 2 Gentoo penguin (Pygoscelis papua)         Biscoe           213    217.    6.48
## 3 Adelie Penguin (Pygoscelis adeliae)       Biscoe           198    189.    6.73
## 4 Adelie Penguin (Pygoscelis adeliae)       Biscoe           174    189.    6.73
## 5 Chinstrap penguin (Pygoscelis antarctica) Dream            187    196.    7.13
## # … with abbreviated variable names ¹flipperlength_mn, ²flipperlength_sd
```

---

background-image: url("img/dplyr.png")
background-position: 95% 10%
background-size: 10%

# Wrangle data: _dplyr_

.ref[https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti]

---

background-image: url("img/tidyr.png")
background-position: 95% 10%
background-size: 10%

# Tidy data: _tidyr_

- Create summarized `data.frame` with mean bodymass for each species, island, year

```r
(df_pen_sum <- dplyr::mutate(df_penguins, year = format(dateegg, "%Y"),
                             species = stringr::str_split(df_penguins$species, 
                                pattern = " ", simplify = TRUE)[, 1]) |> 
  dplyr::group_by(species, island, year) |> 
  dplyr::summarise(bodymass = mean(bodymass), .groups = "drop") |> 
* dplyr::filter(year %in% c(2007, 2009)))
```

```
## # A tibble: 10 × 4
##    species   island    year  bodymass
##    <chr>     <chr>     <chr>    <dbl>
##  1 Adelie    Biscoe    2007     3620 
##  2 Adelie    Biscoe    2009     3858.
##  3 Adelie    Dream     2007     3671.
##  4 Adelie    Dream     2009     3651.
##  5 Adelie    Torgersen 2007     3763.
##  6 Adelie    Torgersen 2009     3489.
##  7 Chinstrap Dream     2007     3694.
##  8 Chinstrap Dream     2009     3725 
##  9 Gentoo    Biscoe    2007     5071.
## 10 Gentoo    Biscoe    2009     5141.
```

---

background-image: url("img/tidyr.png")
background-position: 95% 10%
background-size: 10%

# Tidy data: _tidyr_

- Reshape from **long** to **wide** specifying column names and values

```r
(df_pen_wide <- tidyr::pivot_wider(df_pen_sum, 
*                                  names_from = year, values_from = bodymass,
                                   names_prefix = "yr_") |>
   dplyr::mutate(diff = (yr_2009 - yr_2007) / yr_2007 * 100))
```

```
## # A tibble: 5 × 5
##   species   island    yr_2007 yr_2009   diff
##   <chr>     <chr>       <dbl>   <dbl>  <dbl>
## 1 Adelie    Biscoe      3620    3858.  6.57 
## 2 Adelie    Dream       3671.   3651. -0.545
## 3 Adelie    Torgersen   3763.   3489. -7.28 
## 4 Chinstrap Dream       3694.   3725   0.833
## 5 Gentoo    Biscoe      5071.   5141.  1.38
```

---

background-image: url("img/tidyr.png")
background-position: 95% 10%
background-size: 10%

# Tidy data: _tidyr_

- Reshape from **wide** to **long** specifying which columns _not_ to reshape

```r
*tidyr::pivot_longer(df_pen_wide, -c(species, island, diff),
                    names_to = "years", values_to = "bodymass") |> 
  head(10)
```

```
## # A tibble: 10 × 5
##    species   island      diff years   bodymass
##    <chr>     <chr>      <dbl> <chr>      <dbl>
##  1 Adelie    Biscoe     6.57  yr_2007    3620 
##  2 Adelie    Biscoe     6.57  yr_2009    3858.
##  3 Adelie    Dream     -0.545 yr_2007    3671.
##  4 Adelie    Dream     -0.545 yr_2009    3651.
##  5 Adelie    Torgersen -7.28  yr_2007    3763.
##  6 Adelie    Torgersen -7.28  yr_2009    3489.
##  7 Chinstrap Dream      0.833 yr_2007    3694.
##  8 Chinstrap Dream      0.833 yr_2009    3725 
##  9 Gentoo    Biscoe     1.38  yr_2007    5071.
## 10 Gentoo    Biscoe     1.38  yr_2009    5141.
```

---

background-image: url("img/purrr.png")
background-position: 95% 10%
background-size: 10%

# Functional programming: _purrr_

- `map_*()` to apply function to each element (vector/list)

```r
species_names <- unique(df_penguins$species)
*purrr::map(species_names, function(i) {
  dplyr::filter(df_penguins, species == i) |> dplyr::pull(island) |> 
    unique() |> stringr::str_sort() |> paste(collapse = ", ")})
```

```
## [[1]]
## [1] "Biscoe, Dream, Torgersen"
## 
## [[2]]
## [1] "Biscoe"
## 
## [[3]]
## [1] "Dream"
```

```r
foo <- function(i) {dplyr::filter(df_penguins, species == i) |>
    dplyr::pull(island) |> unique() |> length()}
*purrr::map_int(species_names, foo)
```

```
## [1] 3 1 1
```

---

background-image: url("img/purrr.png")
background-position: 95% 10%
background-size: 10%

# Functional programming: _purrr_

- `reduce_*()` to **remove** a list by one level

```r
pick <- c("culmenlength", "culmendepth", "flipperlength")
foo <- function(i, j) {
  df <- dplyr::select(i, studyname, bodymass, j) 
  dplyr::bind_cols(study = unique(df$studyname), cor(df[, 2], df[, 3]))
}

dplyr::group_by(df_penguins, island) |> dplyr::group_split() |> 
* purrr::map2(pick, foo) |>
* purrr::reduce(dplyr::left_join, by = "study")
```

```
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(j)
## 
##   # Now:
##   data %>% select(all_of(j))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
```

```
## # A tibble: 3 × 4
##   study   culmenlength culmendepth flipperlength
##   <chr>          <dbl>       <dbl>         <dbl>
## 1 PAL0708        0.868       0.564         0.436
## 2 PAL0809        0.868       0.564         0.436
## 3 PAL0910        0.868       0.564         0.436
```

---

class: inverse

## Thank you for your attention

### Questions?

.pull-left[

Further resources: [https://mhesselbarth.github.io/advanced-r-workshop/resources](https://mhesselbarth.github.io/advanced-r-workshop/resources)

Exercise: [https://mhesselbarth.github.io/advanced-r-workshop/exercise-tidyverse](https://mhesselbarth.github.io/advanced-r-workshop/exercise-tidyverse)

]

.pull-right[

]