11 Spark data caching

11.1 Map data

See the machanics of how Spark is able to use files as a data source

  1. Examine the contents of the /usr/share/class/files folder

  2. Load the sparklyr library

  3. Use spark_connect() to create a new local Spark session

    ## * Using Spark: 2.4.0
  4. Load the readr and purrr libraries

  5. Read the top 5 rows of the transactions_1 CSV file

    ## Parsed with column specification:
    ## cols(
    ##   order_id = col_double(),
    ##   customer_id = col_double(),
    ##   customer_name = col_character(),
    ##   customer_phone = col_character(),
    ##   customer_cc = col_double(),
    ##   customer_lon = col_double(),
    ##   customer_lat = col_double(),
    ##   date = col_date(format = ""),
    ##   date_year = col_double(),
    ##   date_month = col_double(),
    ##   date_month_name = col_character(),
    ##   date_day = col_character(),
    ##   product_id = col_double(),
    ##   price = col_double()
    ## )
  6. Create a list based on the column names, and add a list item with “character” as its value. Name the variable file_columns

  7. Preview the contents of the file_columns variable

    ## $order_id
    ## [1] "character"
    ## 
    ## $customer_id
    ## [1] "character"
    ## 
    ## $customer_name
    ## [1] "character"
    ## 
    ## $customer_phone
    ## [1] "character"
    ## 
    ## $customer_cc
    ## [1] "character"
    ## 
    ## $customer_lon
    ## [1] "character"
  8. Use spark_read() to “map” the file’s structure and location to the Spark context. Assign it to the spark_lineitems variable

  9. In the Connections pane, click on the table icon by the transactions variable

  10. Verify that the new variable pointer works by using tally()

    ## # Source: spark<?> [?? x 1]
    ##        n
    ##    <dbl>
    ## 1 250000

11.2 Caching data

Learn how to cache a subset of the data in Spark

  1. Create a subset of the orders table object. Summarize by date, careate a total price and number of items sold.

  2. Use compute() to extract the data into Spark memory

  3. Confirm new variable pointer works

    ## # Source: spark<?> [?? x 3]
    ##   date       total_sales no_items
    ##   <chr>            <dbl>    <dbl>
    ## 1 2016-01-27      39311.     5866
    ## 2 2016-01-28      38424.     5771
    ## 3 2016-02-03      37666.     5659
    ## 4 2016-01-29      37582.     5652
    ## 5 2016-02-04      38193.     5719
    ## 6 2016-02-10      38500.     5686
  4. Go to the Spark UI

  5. Click the Storage button

  6. Notice that “orders” is now cached into Spark memory