1 Introduction to vroom

1.1 vroom basics

Load data into R using vroom

  1. Load the vroom() library

  2. Use the vroom() function to read the transactions_1.csv file from the /usr/share/class/files folder

    ## Observations: 50,000
    ## Variables: 14
    ## chr  [4]: customer_name, customer_phone, date_month_name, date_day
    ## dbl  [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_...
    ## date [1]: date
    ## 
    ## Call `spec()` for a copy-pastable column specification
    ## Specify the column types with `col_types` to quiet this message
    ## # A tibble: 50,000 x 14
    ##    order_id customer_id customer_name customer_phone customer_cc customer_lon
    ##       <dbl>       <dbl> <chr>         <chr>                <dbl>        <dbl>
    ##  1     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
    ##  2     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
    ##  3     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
    ##  4     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
    ##  5     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
    ##  6     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
    ##  7     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
    ##  8     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
    ##  9     1003          80 Jessee Rodri… 539.176.3896       4.50e12        -122.
    ## 10     1003          80 Jessee Rodri… 539.176.3896       4.50e12        -122.
    ## # … with 49,990 more rows, and 8 more variables: customer_lat <dbl>, date <date>,
    ## #   date_year <dbl>, date_month <dbl>, date_month_name <chr>, date_day <chr>,
    ## #   product_id <dbl>, price <dbl>
  3. Use the id argument to add the file name to the data frame. Use file_name as the argument’s value

    ## Observations: 50,000
    ## Variables: 15
    ## chr  [4]: customer_name, customer_phone, date_month_name, date_day
    ## dbl  [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_...
    ## date [1]: date
    ## 
    ## Call `spec()` for a copy-pastable column specification
    ## Specify the column types with `col_types` to quiet this message
    ## # A tibble: 50,000 x 15
    ##    file_name order_id customer_id customer_name customer_phone customer_cc
    ##    <chr>        <dbl>       <dbl> <chr>         <chr>                <dbl>
    ##  1 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
    ##  2 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
    ##  3 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
    ##  4 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
    ##  5 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
    ##  6 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
    ##  7 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
    ##  8 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
    ##  9 /usr/sha…     1003          80 Jessee Rodri… 539.176.3896       4.50e12
    ## 10 /usr/sha…     1003          80 Jessee Rodri… 539.176.3896       4.50e12
    ## # … with 49,990 more rows, and 9 more variables: customer_lon <dbl>,
    ## #   customer_lat <dbl>, date <date>, date_year <dbl>, date_month <dbl>,
    ## #   date_month_name <chr>, date_day <chr>, product_id <dbl>, price <dbl>
  4. Load the prior command into a variable called vr_transactions

    ## Observations: 50,000
    ## Variables: 15
    ## chr  [4]: customer_name, customer_phone, date_month_name, date_day
    ## dbl  [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_...
    ## date [1]: date
    ## 
    ## Call `spec()` for a copy-pastable column specification
    ## Specify the column types with `col_types` to quiet this message
    ## # A tibble: 50,000 x 15
    ##    file_name order_id customer_id customer_name customer_phone customer_cc
    ##    <chr>        <dbl>       <dbl> <chr>         <chr>                <dbl>
    ##  1 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
    ##  2 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
    ##  3 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
    ##  4 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
    ##  5 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
    ##  6 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
    ##  7 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
    ##  8 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
    ##  9 /usr/sha…     1003          80 Jessee Rodri… 539.176.3896       4.50e12
    ## 10 /usr/sha…     1003          80 Jessee Rodri… 539.176.3896       4.50e12
    ## # … with 49,990 more rows, and 9 more variables: customer_lon <dbl>,
    ## #   customer_lat <dbl>, date <date>, date_year <dbl>, date_month <dbl>,
    ## #   date_month_name <chr>, date_day <chr>, product_id <dbl>, price <dbl>
  5. Load the file spec into a variable called vr_spec, using the spec() command

    ## cols(
    ##   order_id = col_double(),
    ##   customer_id = col_double(),
    ##   customer_name = col_character(),
    ##   customer_phone = col_character(),
    ##   customer_cc = col_double(),
    ##   customer_lon = col_double(),
    ##   customer_lat = col_double(),
    ##   date = col_date(format = ""),
    ##   date_year = col_double(),
    ##   date_month = col_double(),
    ##   date_month_name = col_character(),
    ##   date_day = col_character(),
    ##   product_id = col_double(),
    ##   price = col_double()
    ## )

1.2 Load multiple files

  1. Load the fs and dplyr libraries

  2. List files in the /usr/share/class/files folder using the dir_ls() function

    ## /usr/share/class/files/transactions_1.csv
    ## /usr/share/class/files/transactions_2.csv
    ## /usr/share/class/files/transactions_3.csv
    ## /usr/share/class/files/transactions_4.csv
    ## /usr/share/class/files/transactions_5.csv
  3. In the dir_ls() function, use the glob argument to pass a wildcard to list CSV files only. Load to a variable named files

  4. Pass the files variable to vroom. Set the n_max argument to 1,000 to limit the data load for now

    ## Observations: 5,000
    ## Variables: 14
    ## chr  [4]: customer_name, customer_phone, date_month_name, date_day
    ## dbl  [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_...
    ## date [1]: date
    ## 
    ## Call `spec()` for a copy-pastable column specification
    ## Specify the column types with `col_types` to quiet this message
    ## # A tibble: 5,000 x 14
    ##    order_id customer_id customer_name customer_phone customer_cc customer_lon
    ##       <dbl>       <dbl> <chr>         <chr>                <dbl>        <dbl>
    ##  1     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
    ##  2     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
    ##  3     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
    ##  4     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
    ##  5     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
    ##  6     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
    ##  7     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
    ##  8     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
    ##  9     1003          80 Jessee Rodri… 539.176.3896       4.50e12        -122.
    ## 10     1003          80 Jessee Rodri… 539.176.3896       4.50e12        -122.
    ## # … with 4,990 more rows, and 8 more variables: customer_lat <dbl>, date <date>,
    ## #   date_year <dbl>, date_month <dbl>, date_month_name <chr>, date_day <chr>,
    ## #   product_id <dbl>, price <dbl>
  5. Add a col_types argument with vr_specs as its value

    ## # A tibble: 5,000 x 14
    ##    order_id customer_id customer_name customer_phone customer_cc customer_lon
    ##       <dbl>       <dbl> <chr>         <chr>                <dbl>        <dbl>
    ##  1     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
    ##  2     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
    ##  3     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
    ##  4     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
    ##  5     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
    ##  6     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
    ##  7     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
    ##  8     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
    ##  9     1003          80 Jessee Rodri… 539.176.3896       4.50e12        -122.
    ## 10     1003          80 Jessee Rodri… 539.176.3896       4.50e12        -122.
    ## # … with 4,990 more rows, and 8 more variables: customer_lat <dbl>, date <date>,
    ## #   date_year <dbl>, date_month <dbl>, date_month_name <chr>, date_day <chr>,
    ## #   product_id <dbl>, price <dbl>
  6. Use the col_select argument to pass a list object containing the following variables: order_id, date, customer_name, and price

    ## # A tibble: 5,000 x 4
    ##    order_id date       customer_name        price
    ##       <dbl> <date>     <chr>                <dbl>
    ##  1     1001 2016-01-01 Dr. Birdie Kessler    9.88
    ##  2     1001 2016-01-01 Dr. Birdie Kessler    7.53
    ##  3     1001 2016-01-01 Dr. Birdie Kessler    5.64
    ##  4     1001 2016-01-01 Dr. Birdie Kessler    4.89
    ##  5     1002 2016-01-01 Meggan Bruen          6.48
    ##  6     1002 2016-01-01 Meggan Bruen          6.7 
    ##  7     1002 2016-01-01 Meggan Bruen          4.27
    ##  8     1002 2016-01-01 Meggan Bruen          7.38
    ##  9     1003 2016-01-01 Jessee Rodriguez Jr.  7.53
    ## 10     1003 2016-01-01 Jessee Rodriguez Jr.  5.21
    ## # … with 4,990 more rows

1.3 Load and modify multiple files

For files that are too large to have in memory, keep a summarization

  1. Use a for() loop to print the content of each vector inside files

    ## /usr/share/class/files/transactions_1.csv
    ## /usr/share/class/files/transactions_2.csv
    ## /usr/share/class/files/transactions_3.csv
    ## /usr/share/class/files/transactions_4.csv
    ## /usr/share/class/files/transactions_5.csv
  2. Switch the print() command with the vroom command, using the same arguments, except the file name. Use the files variable. Load the results into a variable called transactions.

  3. Group transactions by order_id and get the total of price and the number of records. Name them total_sales and no_items respectively. Name the new variable orders

  4. Define the orders variable as NULL prior to the for loop and add a bind_rows() step to orders to preserve each summarized view.

  5. Remove the transactions variable at the end of each cycle

  6. Preview the orders variable

    ## # A tibble: 715 x 3
    ##    order_id total_sales no_items
    ##       <dbl>       <dbl>    <int>
    ##  1    41865        50.9        8
    ##  2    41866        97.4       14
    ##  3    41867       123.        16
    ##  4    41868        91.9       14
    ##  5    41869        63.2       10
    ##  6    41870        75.2       10
    ##  7    41871        70.6       10
    ##  8    41872        60.4        8
    ##  9    41873        76.2       10
    ## 10    41874        75.7       12
    ## # … with 705 more rows