1 Introduction to `vroom`

1.1 `vroom` basics

Load data into R using vroom

Load the vroom() library
```
library(vroom)
```

Use the vroom() function to read the transactions_1.csv file from the /usr/share/class/files folder

vroom("/usr/share/class/files/transactions_1.csv")

## Observations: 50,000
## Variables: 14
## chr  [4]: customer_name, customer_phone, date_month_name, date_day
## dbl  [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_...
## date [1]: date
## 
## Call `spec()` for a copy-pastable column specification
## Specify the column types with `col_types` to quiet this message

## # A tibble: 50,000 x 14
##    order_id customer_id customer_name customer_phone customer_cc customer_lon
##       <dbl>       <dbl> <chr>         <chr>                <dbl>        <dbl>
##  1     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
##  2     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
##  3     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
##  4     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
##  5     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
##  6     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
##  7     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
##  8     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
##  9     1003          80 Jessee Rodri… 539.176.3896       4.50e12        -122.
## 10     1003          80 Jessee Rodri… 539.176.3896       4.50e12        -122.
## # … with 49,990 more rows, and 8 more variables: customer_lat <dbl>, date <date>,
## #   date_year <dbl>, date_month <dbl>, date_month_name <chr>, date_day <chr>,
## #   product_id <dbl>, price <dbl>

Use the id argument to add the file name to the data frame. Use file_name as the argument’s value

vroom("/usr/share/class/files/transactions_1.csv", id = "file_name")

## Observations: 50,000
## Variables: 15
## chr  [4]: customer_name, customer_phone, date_month_name, date_day
## dbl  [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_...
## date [1]: date
## 
## Call `spec()` for a copy-pastable column specification
## Specify the column types with `col_types` to quiet this message

## # A tibble: 50,000 x 15
##    file_name order_id customer_id customer_name customer_phone customer_cc
##    <chr>        <dbl>       <dbl> <chr>         <chr>                <dbl>
##  1 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
##  2 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
##  3 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
##  4 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
##  5 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
##  6 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
##  7 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
##  8 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
##  9 /usr/sha…     1003          80 Jessee Rodri… 539.176.3896       4.50e12
## 10 /usr/sha…     1003          80 Jessee Rodri… 539.176.3896       4.50e12
## # … with 49,990 more rows, and 9 more variables: customer_lon <dbl>,
## #   customer_lat <dbl>, date <date>, date_year <dbl>, date_month <dbl>,
## #   date_month_name <chr>, date_day <chr>, product_id <dbl>, price <dbl>

Load the prior command into a variable called vr_transactions

vr_transactions <- vroom("/usr/share/class/files/transactions_1.csv", id = "file_name")

## Observations: 50,000
## Variables: 15
## chr  [4]: customer_name, customer_phone, date_month_name, date_day
## dbl  [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_...
## date [1]: date
## 
## Call `spec()` for a copy-pastable column specification
## Specify the column types with `col_types` to quiet this message

vr_transactions

## # A tibble: 50,000 x 15
##    file_name order_id customer_id customer_name customer_phone customer_cc
##    <chr>        <dbl>       <dbl> <chr>         <chr>                <dbl>
##  1 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
##  2 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
##  3 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
##  4 /usr/sha…     1001          22 Dr. Birdie K… 684.226.0455       6.01e18
##  5 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
##  6 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
##  7 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
##  8 /usr/sha…     1002           6 Meggan Bruen  326-151-4331       4.96e15
##  9 /usr/sha…     1003          80 Jessee Rodri… 539.176.3896       4.50e12
## 10 /usr/sha…     1003          80 Jessee Rodri… 539.176.3896       4.50e12
## # … with 49,990 more rows, and 9 more variables: customer_lon <dbl>,
## #   customer_lat <dbl>, date <date>, date_year <dbl>, date_month <dbl>,
## #   date_month_name <chr>, date_day <chr>, product_id <dbl>, price <dbl>

Load the file spec into a variable called vr_spec, using the spec() command

vr_spec <- spec(vr_transactions)

vr_spec

## cols(
##   order_id = col_double(),
##   customer_id = col_double(),
##   customer_name = col_character(),
##   customer_phone = col_character(),
##   customer_cc = col_double(),
##   customer_lon = col_double(),
##   customer_lat = col_double(),
##   date = col_date(format = ""),
##   date_year = col_double(),
##   date_month = col_double(),
##   date_month_name = col_character(),
##   date_day = col_character(),
##   product_id = col_double(),
##   price = col_double()
## )

1.2 Load multiple files

Load the fs and dplyr libraries
```
library(fs)
library(dplyr)
```

List files in the /usr/share/class/files folder using the dir_ls() function

dir_ls("/usr/share/class/files")

## /usr/share/class/files/transactions_1.csv
## /usr/share/class/files/transactions_2.csv
## /usr/share/class/files/transactions_3.csv
## /usr/share/class/files/transactions_4.csv
## /usr/share/class/files/transactions_5.csv

In the dir_ls() function, use the glob argument to pass a wildcard to list CSV files only. Load to a variable named files
```
files <- dir_ls("/usr/share/class/files", glob = "*.csv")
```

Pass the files variable to vroom. Set the n_max argument to 1,000 to limit the data load for now

vroom(files, n_max = 1000)

## Observations: 5,000
## Variables: 14
## chr  [4]: customer_name, customer_phone, date_month_name, date_day
## dbl  [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_...
## date [1]: date
## 
## Call `spec()` for a copy-pastable column specification
## Specify the column types with `col_types` to quiet this message

## # A tibble: 5,000 x 14
##    order_id customer_id customer_name customer_phone customer_cc customer_lon
##       <dbl>       <dbl> <chr>         <chr>                <dbl>        <dbl>
##  1     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
##  2     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
##  3     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
##  4     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
##  5     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
##  6     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
##  7     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
##  8     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
##  9     1003          80 Jessee Rodri… 539.176.3896       4.50e12        -122.
## 10     1003          80 Jessee Rodri… 539.176.3896       4.50e12        -122.
## # … with 4,990 more rows, and 8 more variables: customer_lat <dbl>, date <date>,
## #   date_year <dbl>, date_month <dbl>, date_month_name <chr>, date_day <chr>,
## #   product_id <dbl>, price <dbl>

Add a col_types argument with vr_specs as its value

vroom(files, n_max = 1000, col_types = vr_spec)

## # A tibble: 5,000 x 14
##    order_id customer_id customer_name customer_phone customer_cc customer_lon
##       <dbl>       <dbl> <chr>         <chr>                <dbl>        <dbl>
##  1     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
##  2     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
##  3     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
##  4     1001          22 Dr. Birdie K… 684.226.0455       6.01e18        -122.
##  5     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
##  6     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
##  7     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
##  8     1002           6 Meggan Bruen  326-151-4331       4.96e15        -122.
##  9     1003          80 Jessee Rodri… 539.176.3896       4.50e12        -122.
## 10     1003          80 Jessee Rodri… 539.176.3896       4.50e12        -122.
## # … with 4,990 more rows, and 8 more variables: customer_lat <dbl>, date <date>,
## #   date_year <dbl>, date_month <dbl>, date_month_name <chr>, date_day <chr>,
## #   product_id <dbl>, price <dbl>

Use the col_select argument to pass a list object containing the following variables: order_id, date, customer_name, and price

vroom(files, n_max = 1000, col_types = vr_spec,
      col_select = list(order_id, date, customer_name, price)
      )

## # A tibble: 5,000 x 4
##    order_id date       customer_name        price
##       <dbl> <date>     <chr>                <dbl>
##  1     1001 2016-01-01 Dr. Birdie Kessler    9.88
##  2     1001 2016-01-01 Dr. Birdie Kessler    7.53
##  3     1001 2016-01-01 Dr. Birdie Kessler    5.64
##  4     1001 2016-01-01 Dr. Birdie Kessler    4.89
##  5     1002 2016-01-01 Meggan Bruen          6.48
##  6     1002 2016-01-01 Meggan Bruen          6.7 
##  7     1002 2016-01-01 Meggan Bruen          4.27
##  8     1002 2016-01-01 Meggan Bruen          7.38
##  9     1003 2016-01-01 Jessee Rodriguez Jr.  7.53
## 10     1003 2016-01-01 Jessee Rodriguez Jr.  5.21
## # … with 4,990 more rows

1.3 Load and modify multiple files

For files that are too large to have in memory, keep a summarization

Use a for() loop to print the content of each vector inside files

for(i in seq_along(files)) {
  print(files[i])
}

## /usr/share/class/files/transactions_1.csv
## /usr/share/class/files/transactions_2.csv
## /usr/share/class/files/transactions_3.csv
## /usr/share/class/files/transactions_4.csv
## /usr/share/class/files/transactions_5.csv

Switch the print() command with the vroom command, using the same arguments, except the file name. Use the files variable. Load the results into a variable called transactions.

for(i in seq_along(files)) {
  transactions <- vroom(files[i], n_max = 1000, col_types = vr_spec, 
                       col_select = list(order_id, date, customer_name, price))

}

Group transactions by order_id and get the total of price and the number of records. Name them total_sales and no_items respectively. Name the new variable orders

for(i in seq_along(files)) {
  transactions <- vroom(files[i], n_max = 1000, col_types = vr_spec, 
                       col_select = list(order_id, date, customer_name, price))
  orders <- transactions %>%
    group_by(order_id) %>%
    summarise(total_sales = sum(price), no_items = n())
}

Define the orders variable as NULL prior to the for loop and add a bind_rows() step to orders to preserve each summarized view.

orders <- NULL
for(i in seq_along(files)) {
  transactions <- vroom(files[i], n_max = 1000, col_types = vr_spec, 
                       col_select = list(order_id, date, customer_name, price))
  orders <- transactions %>%
    group_by(order_id) %>%
    summarise(total_sales = sum(price), no_items = n()) %>%
    bind_rows(orders)
}

Remove the transactions variable at the end of each cycle

orders <- NULL
for(i in seq_along(files)) {
  transactions <- vroom(files[i], n_max = 1000, col_types = vr_spec, 
                       col_select = list(order_id, date, customer_name, price))
  orders <- transactions %>%
    group_by(order_id) %>%
    summarise(total_sales = sum(price), no_items = n()) %>%
    bind_rows(orders)
  rm(transactions)
}

Preview the orders variable

orders

## # A tibble: 715 x 3
##    order_id total_sales no_items
##       <dbl>       <dbl>    <int>
##  1    41865        50.9        8
##  2    41866        97.4       14
##  3    41867       123.        16
##  4    41868        91.9       14
##  5    41869        63.2       10
##  6    41870        75.2       10
##  7    41871        70.6       10
##  8    41872        60.4        8
##  9    41873        76.2       10
## 10    41874        75.7       12
## # … with 705 more rows

1 Introduction to vroom

1.1 vroom basics

1.2 Load multiple files

1.3 Load and modify multiple files

1 Introduction to `vroom`

1.1 `vroom` basics