9 Intro to sparklyr
9.1 New Spark session
Learn to open a new Spark session
Load the
sparklyr
libraryUse
spark_connect()
to create a new local Spark session## * Using Spark: 2.4.0
Click on the
Spark
button to view the current Spark session’s UIClick on the
Log
button to see the message history
9.2 Data transfer
Practice uploading data to Spark
Load the
dplyr
libraryCopy the
mtcars
dataset into the sessionIn the Connections pane, expande the
my_mtcars
tableGo to the Spark UI, note the new jobs
In the UI, click the Storage button, note the new table
Click on the In-memory table my_mtcars link
9.3 Spark and dplyr
See how Spark handles dplyr
commands
Run the following code snipett
## # Source: spark<?> [?? x 2] ## am mpg_mean ## <dbl> <dbl> ## 1 0 17.1 ## 2 1 24.4
Go to the Spark UI and click the SQL button
Click on the top item inside the Completed Queries table
At the bottom of the diagram, expand Details
9.4 Feature transformers
Introduction to how Spark Feature Transformers can be called from R
Use
ft_binarizer()
to create a new column, calledover_20
, that indicates if that row’smpg
value is over or under 20MPG## # Source: spark<?> [?? x 12] ## mpg cyl disp hp drat wt qsec vs am gear carb over_20 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1 ## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 1 ## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 1 ## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 1 ## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 0 ## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 0 ## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 0 ## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 1 ## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 1 ## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 0 ## # … with more rows
Pipe the code into
count()
to see how the data splits between the two values## # Source: spark<?> [?? x 2] ## over_20 n ## <dbl> <dbl> ## 1 0 18 ## 2 1 14
Start a new code chunk. This time use
ft_quantile_discretizer()
to create a new column calledmpg_quantile
## # Source: spark<?> [?? x 12] ## mpg cyl disp hp drat wt qsec vs am gear carb mpg_quantile ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1 ## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 1 ## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 1 ## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 1 ## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 0 ## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 0 ## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 0 ## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 1 ## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 1 ## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 1 ## # … with more rows
Add the
num_buckets
argument toft_quantile_discretizer()
, set its value to 5## # Source: spark<?> [?? x 12] ## mpg cyl disp hp drat wt qsec vs am gear carb mpg_quantile ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 3 ## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 3 ## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 3 ## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 3 ## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 2 ## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 2 ## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 0 ## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 4 ## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 3 ## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 2 ## # … with more rows
- Pipe the code into
count()
to see how the data splits between the quantiles
spark_mtcars %>% ft_quantile_discretizer("mpg", "mpg_quantile", num_buckets = 5) %>% count(mpg_quantile)
## # Source: spark<?> [?? x 2] ## mpg_quantile n ## <dbl> <dbl> ## 1 0 6 ## 2 3 7 ## 3 4 7 ## 4 1 6 ## 5 2 6
- Pipe the code into
9.5 Models
Introduce Spark ML models by running a couple of them in R
Use
ml_kmeans()
to run a model based on the following formula:wt ~ mpg
. Assign the results to a variable calledk_mtcars
Use
k_mtcars$summary
to view the results of the model. Pull the cluster sizes by using...$cluster_sizes()
## [1] 14 18
Start a new code chunk. This time use
ml_linear_regression()
to produce a Linear Regression model of the same formula used in the previous model. Assign the results to a variable calledlr_mtcars
Use
summary()
to view the results of the model## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.6516 -0.3490 -0.1381 0.3190 1.3684 ## ## Coefficients: ## (Intercept) mpg ## 6.047255 -0.140862 ## ## R-Squared: 0.7528 ## Root Mean Squared Error: 0.4788