9 Intro to `sparklyr`

9.1 New Spark session

Learn to open a new Spark session

Load the sparklyr library
```
library(sparklyr)
```

Use spark_connect() to create a new local Spark session

sc <- spark_connect(master = "local")

## * Using Spark: 2.4.0

Click on the Spark button to view the current Spark session’s UI
Click on the Log button to see the message history

9.2 Data transfer

Practice uploading data to Spark

Load the dplyr library
```
library(dplyr)
```

Copy the mtcars dataset into the session

spark_mtcars <- copy_to(sc, mtcars, "my_mtcars")

In the Connections pane, expande the my_mtcars table
Go to the Spark UI, note the new jobs
In the UI, click the Storage button, note the new table
Click on the In-memory table my_mtcars link

9.3 Spark and `dplyr`

See how Spark handles dplyr commands

Run the following code snipett

spark_mtcars %>%
  group_by(am) %>%
  summarise(mpg_mean = mean(mpg, na.rm = TRUE))

## # Source: spark<?> [?? x 2]
##      am mpg_mean
##   <dbl>    <dbl>
## 1     0     17.1
## 2     1     24.4

Go to the Spark UI and click the SQL button
Click on the top item inside the Completed Queries table
At the bottom of the diagram, expand Details

9.4 Feature transformers

Introduction to how Spark Feature Transformers can be called from R

Use ft_binarizer() to create a new column, called over_20, that indicates if that row’s mpg value is over or under 20MPG

spark_mtcars %>%
  ft_binarizer("mpg", "over_20", 20)

## # Source: spark<?> [?? x 12]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb over_20
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4       1
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4       1
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1       1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1       1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2       0
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1       0
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4       0
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2       1
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2       1
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4       0
## # … with more rows

Pipe the code into count() to see how the data splits between the two values

spark_mtcars %>%
  ft_binarizer("mpg", "over_20", 20) %>%
  count(over_20)

## # Source: spark<?> [?? x 2]
##   over_20     n
##     <dbl> <dbl>
## 1       0    18
## 2       1    14

Start a new code chunk. This time use ft_quantile_discretizer() to create a new column called mpg_quantile

spark_mtcars %>%
  ft_quantile_discretizer("mpg", "mpg_quantile")

## # Source: spark<?> [?? x 12]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb mpg_quantile
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>        <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4            1
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4            1
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1            1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1            1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2            0
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1            0
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4            0
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2            1
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2            1
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4            1
## # … with more rows

Add the num_buckets argument to ft_quantile_discretizer(), set its value to 5

spark_mtcars %>%
  ft_quantile_discretizer("mpg", "mpg_quantile", num_buckets = 5)

## # Source: spark<?> [?? x 12]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb mpg_quantile
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>        <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4            3
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4            3
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1            3
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1            3
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2            2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1            2
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4            0
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2            4
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2            3
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4            2
## # … with more rows

Pipe the code into count() to see how the data splits between the quantiles

spark_mtcars %>%
  ft_quantile_discretizer("mpg", "mpg_quantile", num_buckets = 5) %>%
  count(mpg_quantile)

## # Source: spark<?> [?? x 2]
##   mpg_quantile     n
##          <dbl> <dbl>
## 1            0     6
## 2            3     7
## 3            4     7
## 4            1     6
## 5            2     6

9.5 Models

Introduce Spark ML models by running a couple of them in R

Use ml_kmeans() to run a model based on the following formula: wt ~ mpg. Assign the results to a variable called k_mtcars
```
k_mtcars <- spark_mtcars %>%
  ml_kmeans(wt ~ mpg)
```
Use k_mtcars$summary to view the results of the model. Pull the cluster sizes by using ...$cluster_sizes()
```
k_mtcars$summary$cluster_sizes()
```
```
## [1] 14 18
```
Start a new code chunk. This time use ml_linear_regression() to produce a Linear Regression model of the same formula used in the previous model. Assign the results to a variable called lr_mtcars
```
lr_mtcars <- spark_mtcars %>%
  ml_linear_regression(wt ~ mpg)
```

Use summary() to view the results of the model

summary(lr_mtcars)

## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6516 -0.3490 -0.1381  0.3190  1.3684 
## 
## Coefficients:
## (Intercept)         mpg 
##    6.047255   -0.140862 
## 
## R-Squared: 0.7528
## Root Mean Squared Error: 0.4788

9 Intro to sparklyr