1 Introduction to vroom
1.1 vroom
basics
Load data into R using vroom
Load the
vroom()
libraryUse the
vroom()
function to read the transactions_1.csv file from the /usr/share/class/files folder## Observations: 50,000 ## Variables: 14 ## chr [4]: customer_name, customer_phone, date_month_name, date_day ## dbl [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_... ## date [1]: date ## ## Call `spec()` for a copy-pastable column specification ## Specify the column types with `col_types` to quiet this message
## # A tibble: 50,000 x 14 ## order_id customer_id customer_name customer_phone customer_cc customer_lon ## <dbl> <dbl> <chr> <chr> <dbl> <dbl> ## 1 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 2 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 3 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 4 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 5 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 6 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 7 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 8 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 9 1003 80 Jessee Rodri… 539.176.3896 4.50e12 -122. ## 10 1003 80 Jessee Rodri… 539.176.3896 4.50e12 -122. ## # … with 49,990 more rows, and 8 more variables: customer_lat <dbl>, date <date>, ## # date_year <dbl>, date_month <dbl>, date_month_name <chr>, date_day <chr>, ## # product_id <dbl>, price <dbl>
Use the
id
argument to add the file name to the data frame. Use file_name as the argument’s value## Observations: 50,000 ## Variables: 15 ## chr [4]: customer_name, customer_phone, date_month_name, date_day ## dbl [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_... ## date [1]: date ## ## Call `spec()` for a copy-pastable column specification ## Specify the column types with `col_types` to quiet this message
## # A tibble: 50,000 x 15 ## file_name order_id customer_id customer_name customer_phone customer_cc ## <chr> <dbl> <dbl> <chr> <chr> <dbl> ## 1 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 2 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 3 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 4 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 5 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 6 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 7 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 8 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 9 /usr/sha… 1003 80 Jessee Rodri… 539.176.3896 4.50e12 ## 10 /usr/sha… 1003 80 Jessee Rodri… 539.176.3896 4.50e12 ## # … with 49,990 more rows, and 9 more variables: customer_lon <dbl>, ## # customer_lat <dbl>, date <date>, date_year <dbl>, date_month <dbl>, ## # date_month_name <chr>, date_day <chr>, product_id <dbl>, price <dbl>
Load the prior command into a variable called
vr_transactions
## Observations: 50,000 ## Variables: 15 ## chr [4]: customer_name, customer_phone, date_month_name, date_day ## dbl [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_... ## date [1]: date ## ## Call `spec()` for a copy-pastable column specification ## Specify the column types with `col_types` to quiet this message
## # A tibble: 50,000 x 15 ## file_name order_id customer_id customer_name customer_phone customer_cc ## <chr> <dbl> <dbl> <chr> <chr> <dbl> ## 1 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 2 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 3 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 4 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 5 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 6 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 7 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 8 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 9 /usr/sha… 1003 80 Jessee Rodri… 539.176.3896 4.50e12 ## 10 /usr/sha… 1003 80 Jessee Rodri… 539.176.3896 4.50e12 ## # … with 49,990 more rows, and 9 more variables: customer_lon <dbl>, ## # customer_lat <dbl>, date <date>, date_year <dbl>, date_month <dbl>, ## # date_month_name <chr>, date_day <chr>, product_id <dbl>, price <dbl>
Load the file spec into a variable called
vr_spec
, using thespec()
command## cols( ## order_id = col_double(), ## customer_id = col_double(), ## customer_name = col_character(), ## customer_phone = col_character(), ## customer_cc = col_double(), ## customer_lon = col_double(), ## customer_lat = col_double(), ## date = col_date(format = ""), ## date_year = col_double(), ## date_month = col_double(), ## date_month_name = col_character(), ## date_day = col_character(), ## product_id = col_double(), ## price = col_double() ## )
1.2 Load multiple files
Load the
fs
anddplyr
librariesList files in the /usr/share/class/files folder using the
dir_ls()
function## /usr/share/class/files/transactions_1.csv ## /usr/share/class/files/transactions_2.csv ## /usr/share/class/files/transactions_3.csv ## /usr/share/class/files/transactions_4.csv ## /usr/share/class/files/transactions_5.csv
In the
dir_ls()
function, use theglob
argument to pass a wildcard to list CSV files only. Load to a variable namedfiles
Pass the
files
variable tovroom
. Set then_max
argument to 1,000 to limit the data load for now## Observations: 5,000 ## Variables: 14 ## chr [4]: customer_name, customer_phone, date_month_name, date_day ## dbl [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_... ## date [1]: date ## ## Call `spec()` for a copy-pastable column specification ## Specify the column types with `col_types` to quiet this message
## # A tibble: 5,000 x 14 ## order_id customer_id customer_name customer_phone customer_cc customer_lon ## <dbl> <dbl> <chr> <chr> <dbl> <dbl> ## 1 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 2 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 3 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 4 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 5 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 6 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 7 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 8 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 9 1003 80 Jessee Rodri… 539.176.3896 4.50e12 -122. ## 10 1003 80 Jessee Rodri… 539.176.3896 4.50e12 -122. ## # … with 4,990 more rows, and 8 more variables: customer_lat <dbl>, date <date>, ## # date_year <dbl>, date_month <dbl>, date_month_name <chr>, date_day <chr>, ## # product_id <dbl>, price <dbl>
Add a
col_types
argument withvr_specs
as its value## # A tibble: 5,000 x 14 ## order_id customer_id customer_name customer_phone customer_cc customer_lon ## <dbl> <dbl> <chr> <chr> <dbl> <dbl> ## 1 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 2 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 3 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 4 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 5 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 6 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 7 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 8 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 9 1003 80 Jessee Rodri… 539.176.3896 4.50e12 -122. ## 10 1003 80 Jessee Rodri… 539.176.3896 4.50e12 -122. ## # … with 4,990 more rows, and 8 more variables: customer_lat <dbl>, date <date>, ## # date_year <dbl>, date_month <dbl>, date_month_name <chr>, date_day <chr>, ## # product_id <dbl>, price <dbl>
Use the
col_select
argument to pass alist
object containing the following variables: order_id, date, customer_name, and pricevroom(files, n_max = 1000, col_types = vr_spec, col_select = list(order_id, date, customer_name, price) )
## # A tibble: 5,000 x 4 ## order_id date customer_name price ## <dbl> <date> <chr> <dbl> ## 1 1001 2016-01-01 Dr. Birdie Kessler 9.88 ## 2 1001 2016-01-01 Dr. Birdie Kessler 7.53 ## 3 1001 2016-01-01 Dr. Birdie Kessler 5.64 ## 4 1001 2016-01-01 Dr. Birdie Kessler 4.89 ## 5 1002 2016-01-01 Meggan Bruen 6.48 ## 6 1002 2016-01-01 Meggan Bruen 6.7 ## 7 1002 2016-01-01 Meggan Bruen 4.27 ## 8 1002 2016-01-01 Meggan Bruen 7.38 ## 9 1003 2016-01-01 Jessee Rodriguez Jr. 7.53 ## 10 1003 2016-01-01 Jessee Rodriguez Jr. 5.21 ## # … with 4,990 more rows
1.3 Load and modify multiple files
For files that are too large to have in memory, keep a summarization
Use a
for()
loop to print the content of each vector insidefiles
## /usr/share/class/files/transactions_1.csv ## /usr/share/class/files/transactions_2.csv ## /usr/share/class/files/transactions_3.csv ## /usr/share/class/files/transactions_4.csv ## /usr/share/class/files/transactions_5.csv
Switch the
print()
command with thevroom
command, using the same arguments, except the file name. Use thefiles
variable. Load the results into a variable calledtransactions
.Group
transactions
byorder_id
and get the total ofprice
and the number of records. Name themtotal_sales
andno_items
respectively. Name the new variableorders
Define the
orders
variable asNULL
prior to the for loop and add abind_rows()
step toorders
to preserve each summarized view.Remove the
transactions
variable at the end of each cycleorders <- NULL for(i in seq_along(files)) { transactions <- vroom(files[i], n_max = 1000, col_types = vr_spec, col_select = list(order_id, date, customer_name, price)) orders <- transactions %>% group_by(order_id) %>% summarise(total_sales = sum(price), no_items = n()) %>% bind_rows(orders) rm(transactions) }
Preview the
orders
variable## # A tibble: 715 x 3 ## order_id total_sales no_items ## <dbl> <dbl> <int> ## 1 41865 50.9 8 ## 2 41866 97.4 14 ## 3 41867 123. 16 ## 4 41868 91.9 14 ## 5 41869 63.2 10 ## 6 41870 75.2 10 ## 7 41871 70.6 10 ## 8 41872 60.4 8 ## 9 41873 76.2 10 ## 10 41874 75.7 12 ## # … with 705 more rows