1 Introduction to vroom
1.1 vroom basics
Load data into R using vroom
Load the
vroom()libraryUse the
vroom()function to read the transactions_1.csv file from the /usr/share/class/files folder## Observations: 50,000 ## Variables: 14 ## chr [4]: customer_name, customer_phone, date_month_name, date_day ## dbl [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_... ## date [1]: date ## ## Call `spec()` for a copy-pastable column specification ## Specify the column types with `col_types` to quiet this message## # A tibble: 50,000 x 14 ## order_id customer_id customer_name customer_phone customer_cc customer_lon ## <dbl> <dbl> <chr> <chr> <dbl> <dbl> ## 1 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 2 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 3 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 4 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 5 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 6 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 7 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 8 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 9 1003 80 Jessee Rodri… 539.176.3896 4.50e12 -122. ## 10 1003 80 Jessee Rodri… 539.176.3896 4.50e12 -122. ## # … with 49,990 more rows, and 8 more variables: customer_lat <dbl>, date <date>, ## # date_year <dbl>, date_month <dbl>, date_month_name <chr>, date_day <chr>, ## # product_id <dbl>, price <dbl>Use the
idargument to add the file name to the data frame. Use file_name as the argument’s value## Observations: 50,000 ## Variables: 15 ## chr [4]: customer_name, customer_phone, date_month_name, date_day ## dbl [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_... ## date [1]: date ## ## Call `spec()` for a copy-pastable column specification ## Specify the column types with `col_types` to quiet this message## # A tibble: 50,000 x 15 ## file_name order_id customer_id customer_name customer_phone customer_cc ## <chr> <dbl> <dbl> <chr> <chr> <dbl> ## 1 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 2 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 3 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 4 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 5 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 6 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 7 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 8 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 9 /usr/sha… 1003 80 Jessee Rodri… 539.176.3896 4.50e12 ## 10 /usr/sha… 1003 80 Jessee Rodri… 539.176.3896 4.50e12 ## # … with 49,990 more rows, and 9 more variables: customer_lon <dbl>, ## # customer_lat <dbl>, date <date>, date_year <dbl>, date_month <dbl>, ## # date_month_name <chr>, date_day <chr>, product_id <dbl>, price <dbl>Load the prior command into a variable called
vr_transactions## Observations: 50,000 ## Variables: 15 ## chr [4]: customer_name, customer_phone, date_month_name, date_day ## dbl [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_... ## date [1]: date ## ## Call `spec()` for a copy-pastable column specification ## Specify the column types with `col_types` to quiet this message## # A tibble: 50,000 x 15 ## file_name order_id customer_id customer_name customer_phone customer_cc ## <chr> <dbl> <dbl> <chr> <chr> <dbl> ## 1 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 2 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 3 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 4 /usr/sha… 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 ## 5 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 6 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 7 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 8 /usr/sha… 1002 6 Meggan Bruen 326-151-4331 4.96e15 ## 9 /usr/sha… 1003 80 Jessee Rodri… 539.176.3896 4.50e12 ## 10 /usr/sha… 1003 80 Jessee Rodri… 539.176.3896 4.50e12 ## # … with 49,990 more rows, and 9 more variables: customer_lon <dbl>, ## # customer_lat <dbl>, date <date>, date_year <dbl>, date_month <dbl>, ## # date_month_name <chr>, date_day <chr>, product_id <dbl>, price <dbl>Load the file spec into a variable called
vr_spec, using thespec()command## cols( ## order_id = col_double(), ## customer_id = col_double(), ## customer_name = col_character(), ## customer_phone = col_character(), ## customer_cc = col_double(), ## customer_lon = col_double(), ## customer_lat = col_double(), ## date = col_date(format = ""), ## date_year = col_double(), ## date_month = col_double(), ## date_month_name = col_character(), ## date_day = col_character(), ## product_id = col_double(), ## price = col_double() ## )
1.2 Load multiple files
Load the
fsanddplyrlibrariesList files in the /usr/share/class/files folder using the
dir_ls()function## /usr/share/class/files/transactions_1.csv ## /usr/share/class/files/transactions_2.csv ## /usr/share/class/files/transactions_3.csv ## /usr/share/class/files/transactions_4.csv ## /usr/share/class/files/transactions_5.csvIn the
dir_ls()function, use theglobargument to pass a wildcard to list CSV files only. Load to a variable namedfilesPass the
filesvariable tovroom. Set then_maxargument to 1,000 to limit the data load for now## Observations: 5,000 ## Variables: 14 ## chr [4]: customer_name, customer_phone, date_month_name, date_day ## dbl [9]: order_id, customer_id, customer_cc, customer_lon, customer_lat, date_year, date_... ## date [1]: date ## ## Call `spec()` for a copy-pastable column specification ## Specify the column types with `col_types` to quiet this message## # A tibble: 5,000 x 14 ## order_id customer_id customer_name customer_phone customer_cc customer_lon ## <dbl> <dbl> <chr> <chr> <dbl> <dbl> ## 1 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 2 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 3 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 4 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 5 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 6 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 7 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 8 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 9 1003 80 Jessee Rodri… 539.176.3896 4.50e12 -122. ## 10 1003 80 Jessee Rodri… 539.176.3896 4.50e12 -122. ## # … with 4,990 more rows, and 8 more variables: customer_lat <dbl>, date <date>, ## # date_year <dbl>, date_month <dbl>, date_month_name <chr>, date_day <chr>, ## # product_id <dbl>, price <dbl>Add a
col_typesargument withvr_specsas its value## # A tibble: 5,000 x 14 ## order_id customer_id customer_name customer_phone customer_cc customer_lon ## <dbl> <dbl> <chr> <chr> <dbl> <dbl> ## 1 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 2 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 3 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 4 1001 22 Dr. Birdie K… 684.226.0455 6.01e18 -122. ## 5 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 6 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 7 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 8 1002 6 Meggan Bruen 326-151-4331 4.96e15 -122. ## 9 1003 80 Jessee Rodri… 539.176.3896 4.50e12 -122. ## 10 1003 80 Jessee Rodri… 539.176.3896 4.50e12 -122. ## # … with 4,990 more rows, and 8 more variables: customer_lat <dbl>, date <date>, ## # date_year <dbl>, date_month <dbl>, date_month_name <chr>, date_day <chr>, ## # product_id <dbl>, price <dbl>Use the
col_selectargument to pass alistobject containing the following variables: order_id, date, customer_name, and pricevroom(files, n_max = 1000, col_types = vr_spec, col_select = list(order_id, date, customer_name, price) )## # A tibble: 5,000 x 4 ## order_id date customer_name price ## <dbl> <date> <chr> <dbl> ## 1 1001 2016-01-01 Dr. Birdie Kessler 9.88 ## 2 1001 2016-01-01 Dr. Birdie Kessler 7.53 ## 3 1001 2016-01-01 Dr. Birdie Kessler 5.64 ## 4 1001 2016-01-01 Dr. Birdie Kessler 4.89 ## 5 1002 2016-01-01 Meggan Bruen 6.48 ## 6 1002 2016-01-01 Meggan Bruen 6.7 ## 7 1002 2016-01-01 Meggan Bruen 4.27 ## 8 1002 2016-01-01 Meggan Bruen 7.38 ## 9 1003 2016-01-01 Jessee Rodriguez Jr. 7.53 ## 10 1003 2016-01-01 Jessee Rodriguez Jr. 5.21 ## # … with 4,990 more rows
1.3 Load and modify multiple files
For files that are too large to have in memory, keep a summarization
Use a
for()loop to print the content of each vector insidefiles## /usr/share/class/files/transactions_1.csv ## /usr/share/class/files/transactions_2.csv ## /usr/share/class/files/transactions_3.csv ## /usr/share/class/files/transactions_4.csv ## /usr/share/class/files/transactions_5.csvSwitch the
print()command with thevroomcommand, using the same arguments, except the file name. Use thefilesvariable. Load the results into a variable calledtransactions.Group
transactionsbyorder_idand get the total ofpriceand the number of records. Name themtotal_salesandno_itemsrespectively. Name the new variableordersDefine the
ordersvariable asNULLprior to the for loop and add abind_rows()step toordersto preserve each summarized view.Remove the
transactionsvariable at the end of each cycleorders <- NULL for(i in seq_along(files)) { transactions <- vroom(files[i], n_max = 1000, col_types = vr_spec, col_select = list(order_id, date, customer_name, price)) orders <- transactions %>% group_by(order_id) %>% summarise(total_sales = sum(price), no_items = n()) %>% bind_rows(orders) rm(transactions) }Preview the
ordersvariable## # A tibble: 715 x 3 ## order_id total_sales no_items ## <dbl> <dbl> <int> ## 1 41865 50.9 8 ## 2 41866 97.4 14 ## 3 41867 123. 16 ## 4 41868 91.9 14 ## 5 41869 63.2 10 ## 6 41870 75.2 10 ## 7 41871 70.6 10 ## 8 41872 60.4 8 ## 9 41873 76.2 10 ## 10 41874 75.7 12 ## # … with 705 more rows