11 Spark data caching
11.1 Map data
See the machanics of how Spark is able to use files as a data source
Examine the contents of the /usr/share/class/files folder
Load the
sparklyr
libraryUse
spark_connect()
to create a new local Spark session## * Using Spark: 2.4.0
Load the
readr
andpurrr
librariesRead the top 5 rows of the transactions_1 CSV file
## Parsed with column specification: ## cols( ## order_id = col_double(), ## customer_id = col_double(), ## customer_name = col_character(), ## customer_phone = col_character(), ## customer_cc = col_double(), ## customer_lon = col_double(), ## customer_lat = col_double(), ## date = col_date(format = ""), ## date_year = col_double(), ## date_month = col_double(), ## date_month_name = col_character(), ## date_day = col_character(), ## product_id = col_double(), ## price = col_double() ## )
Create a list based on the column names, and add a list item with “character” as its value. Name the variable
file_columns
Preview the contents of the
file_columns
variable## $order_id ## [1] "character" ## ## $customer_id ## [1] "character" ## ## $customer_name ## [1] "character" ## ## $customer_phone ## [1] "character" ## ## $customer_cc ## [1] "character" ## ## $customer_lon ## [1] "character"
Use
spark_read()
to “map” the file’s structure and location to the Spark context. Assign it to thespark_lineitems
variableIn the Connections pane, click on the table icon by the
transactions
variableVerify that the new variable pointer works by using
tally()
## # Source: spark<?> [?? x 1] ## n ## <dbl> ## 1 250000
11.2 Caching data
Learn how to cache a subset of the data in Spark
Create a subset of the orders table object. Summarize by date, careate a total price and number of items sold.
Use
compute()
to extract the data into Spark memoryConfirm new variable pointer works
## # Source: spark<?> [?? x 3] ## date total_sales no_items ## <chr> <dbl> <dbl> ## 1 2016-01-27 39311. 5866 ## 2 2016-01-28 38424. 5771 ## 3 2016-02-03 37666. 5659 ## 4 2016-01-29 37582. 5652 ## 5 2016-02-04 38193. 5719 ## 6 2016-02-10 38500. 5686
Go to the Spark UI
Click the Storage button
Notice that “orders” is now cached into Spark memory