11 Spark data caching

11.1 Map data

See the machanics of how Spark is able to use files as a data source

Examine the contents of the /usr/share/class/files folder
Load the sparklyr library
```
library(sparklyr)
```

Use spark_connect() to create a new local Spark session

sc <- spark_connect(master = "local")

## * Using Spark: 2.4.0

Load the readr and purrr libraries
```
library(readr)
library(purrr)
```

Read the top 5 rows of the transactions_1 CSV file

top_rows <- read_csv("/usr/share/class/files/transactions_1.csv", n_max = 5)

## Parsed with column specification:
## cols(
##   order_id = col_double(),
##   customer_id = col_double(),
##   customer_name = col_character(),
##   customer_phone = col_character(),
##   customer_cc = col_double(),
##   customer_lon = col_double(),
##   customer_lat = col_double(),
##   date = col_date(format = ""),
##   date_year = col_double(),
##   date_month = col_double(),
##   date_month_name = col_character(),
##   date_day = col_character(),
##   product_id = col_double(),
##   price = col_double()
## )

Create a list based on the column names, and add a list item with “character” as its value. Name the variable file_columns
```
file_columns <- top_rows %>%
  rename_all(tolower) %>%
  map(function(x) "character")
```

Preview the contents of the file_columns variable

head(file_columns)

## $order_id
## [1] "character"
## 
## $customer_id
## [1] "character"
## 
## $customer_name
## [1] "character"
## 
## $customer_phone
## [1] "character"
## 
## $customer_cc
## [1] "character"
## 
## $customer_lon
## [1] "character"

Use spark_read() to “map” the file’s structure and location to the Spark context. Assign it to the spark_lineitems variable

spark_lineitems <- spark_read_csv(
  sc,
  name = "orders",
  path = "/usr/share/class/files",
  memory = FALSE,
  columns = file_columns,
  infer_schema = FALSE
)

In the Connections pane, click on the table icon by the transactions variable

Verify that the new variable pointer works by using tally()

spark_lineitems %>%
  tally()

## # Source: spark<?> [?? x 1]
##        n
##    <dbl>
## 1 250000

11.2 Caching data

Learn how to cache a subset of the data in Spark

Create a subset of the orders table object. Summarize by date, careate a total price and number of items sold.

daily_orders <- spark_lineitems %>%
  mutate(price = as.double(price)) %>%
  group_by(date) %>%
  summarise(total_sales = sum(price, na.rm = TRUE), no_items = n())

Use compute() to extract the data into Spark memory

cached_orders <- compute(daily_orders, "daily")

Confirm new variable pointer works

head(cached_orders)

## # Source: spark<?> [?? x 3]
##   date       total_sales no_items
##   <chr>            <dbl>    <dbl>
## 1 2016-01-27      39311.     5866
## 2 2016-01-28      38424.     5771
## 3 2016-02-03      37666.     5659
## 4 2016-01-29      37582.     5652
## 5 2016-02-04      38193.     5719
## 6 2016-02-10      38500.     5686

Go to the Spark UI
Click the Storage button
Notice that “orders” is now cached into Spark memory