Introduction to Tidyverse

Ha Khanh Nguyen

Welcome to Part 2 of STAT 385!

  • Topics covered include:
    • tibbles - the new data frame
    • Import data using readr
    • Pipe operator %>%
    • Graphics using ggplot2
    • Tidy data
    • Relational data
    • Strings/text data
    • Factors - categorical data
    • Dates and times
    • Functional programming

What is Tidyverse?

Tibbles

  • Tibbles are data frames, but they tweak some older behaviors to make life a little easier.
  • It’s difficult to change base R without breaking existing code, so most innovation occurs in packages.
    • tidyverse is a “mega” package (including many smaller packages).
## # A tibble: 5 x 2
##       x     y
##   <int> <int>
## 1     1     6
## 2     2     7
## 3     3     8
## 4     4     9
## 5     5    10
##   x  y
## 1 1  6
## 2 2  7
## 3 3  8
## 4 4  9
## 5 5 10

Prerequisites

  • Install tidyverse
    • You only need to do this once.
  • “Call” the package before using its functions.
    • Need to do this every time you open a new R session or “quit” RStudio and then open it again.

Creating tibbles

  • To create a tibble, use tibble().
    • Similar to data.frame().
## # A tibble: 5 x 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1     2
## 2     2     1     5
## 3     3     1    10
## 4     4     1    17
## 5     5     1    26
## Error in data.frame(x = 1:5, y = 1, z = x^2 + y): object 'x' not found
  • Almost all functions in tidyverse and many new R packages produce tibbles.
  • Most other (old) R packages use regular data frames.
    • To coerce a data frame to a tibble, use as_tibble().
##   x y
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 1
## # A tibble: 5 x 2
##       x     y
##   <int> <dbl>
## 1     1     1
## 2     2     1
## 3     3     1
## 4     4     1
## 5     5     1
  • Note: functions in tidyverse uses _ to separate words instead of .
    • read_csv() instead of read.csv()
  • If you’re already familiar with data.frame(), note that tibble() does much less:
    • It never changes the type of the inputs (e.g. it never converts strings to factors!).
    • It never changes the names of variables.
    • And it never creates row names.

Tibbles vs. data.frame

  • There are two main differences in the usage of a tibble vs. a classic data.frame: printing and subsetting.

Printing

  • Tibbles have a refined print method that shows:
    • only the first 10 rows,
    • all the columns that fit on screen,
    • and each column reports its type.
## # A tibble: 1,000 x 5
##    a                   b              c      d e    
##    <dttm>              <date>     <int>  <dbl> <chr>
##  1 2020-03-05 05:53:31 2020-03-10     1 0.667  m    
##  2 2020-03-04 20:46:55 2020-03-19     2 0.370  c    
##  3 2020-03-05 07:19:40 2020-03-09     3 0.833  u    
##  4 2020-03-05 10:17:40 2020-03-04     4 0.996  o    
##  5 2020-03-04 18:48:37 2020-03-17     5 0.0250 t    
##  6 2020-03-05 03:41:33 2020-03-19     6 0.573  h    
##  7 2020-03-04 16:58:14 2020-03-29     7 0.477  z    
##  8 2020-03-05 09:34:39 2020-03-07     8 0.727  w    
##  9 2020-03-04 15:29:27 2020-03-29     9 0.226  n    
## 10 2020-03-04 21:59:14 2020-04-01    10 0.817  o    
## # … with 990 more rows
  • But sometimes you need more output than the default display.
    • You can explicitly print() the data frame and control the number of rows (`n) and thewidth` of the display.
    • width = Inf will display all columns:
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
##    arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
##        <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
##  1        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
##  2        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
##  3        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
##  4       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45
##  5       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
##  6        12 UA        1696 N39463  EWR    ORD        150      719     5     58
##  7        19 B6         507 N516JB  EWR    FLL        158     1065     6      0
##  8       -14 EV        5708 N829AS  LGA    IAD         53      229     6      0
##  9        -8 B6          79 N593JB  JFK    MCO        140      944     6      0
## 10         8 AA         301 N3ALAA  LGA    ORD        138      733     6      0
##    time_hour          
##    <dttm>             
##  1 2013-01-01 05:00:00
##  2 2013-01-01 05:00:00
##  3 2013-01-01 05:00:00
##  4 2013-01-01 05:00:00
##  5 2013-01-01 06:00:00
##  6 2013-01-01 05:00:00
##  7 2013-01-01 06:00:00
##  8 2013-01-01 06:00:00
##  9 2013-01-01 06:00:00
## 10 2013-01-01 06:00:00
## # … with 3.368e+05 more rows
  • Notes: For the above code to work, you first have to install the nycflights13 package (which includes the flights dataset).
  • Another notes: from now on, when I use data frame and tibble as equivalence. I will say base R data frame if I specifically talk about classic data frame (not tibble).

Subsetting

  • Review: what will the following code return?
  • What’s about these?
  • What is the difference?
    • Review data frame!
  • Now, let’s change iris to tibble.
## # A tibble: 150 x 1
##    Sepal.Length
##           <dbl>
##  1          5.1
##  2          4.9
##  3          4.7
##  4          4.6
##  5          5  
##  6          5.4
##  7          4.6
##  8          5  
##  9          4.4
## 10          4.9
## # … with 140 more rows
##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
##  [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
##  [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
##  [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
##  [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
##  [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9
## # A tibble: 150 x 2
##    Sepal.Length Sepal.Width
##           <dbl>       <dbl>
##  1          5.1         3.5
##  2          4.9         3  
##  3          4.7         3.2
##  4          4.6         3.1
##  5          5           3.6
##  6          5.4         3.9
##  7          4.6         3.4
##  8          5           3.4
##  9          4.4         2.9
## 10          4.9         3.1
## # … with 140 more rows

Summary:

  • $ always returns a vector.
  • With base R data frames, [ ] sometimes returns a data frame, and sometimes returns a vector.
  • With tibbles, [ ] always returns another tibble.

  • An even easier way to choose specific columns
    • we will learn more about this on Friday
## # A tibble: 150 x 2
##    Sepal.Length Sepal.Width
##           <dbl>       <dbl>
##  1          5.1         3.5
##  2          4.9         3  
##  3          4.7         3.2
##  4          4.6         3.1
##  5          5           3.6
##  6          5.4         3.9
##  7          4.6         3.4
##  8          5           3.4
##  9          4.4         2.9
## 10          4.9         3.1
## # … with 140 more rows

References