9 Intro to sparklyr
9.1 New Spark session
Learn to open a new Spark session
Load the
sparklyrlibraryUse
spark_connect()to create a new local Spark session## * Using Spark: 2.4.0Click on the
Sparkbutton to view the current Spark session’s UIClick on the
Logbutton to see the message history
9.2 Data transfer
Practice uploading data to Spark
Load the
dplyrlibraryCopy the
mtcarsdataset into the sessionIn the Connections pane, expande the
my_mtcarstableGo to the Spark UI, note the new jobs
In the UI, click the Storage button, note the new table
Click on the In-memory table my_mtcars link
9.3 Spark and dplyr
See how Spark handles dplyr commands
Run the following code snipett
## # Source: spark<?> [?? x 2] ## am mpg_mean ## <dbl> <dbl> ## 1 0 17.1 ## 2 1 24.4Go to the Spark UI and click the SQL button
Click on the top item inside the Completed Queries table
At the bottom of the diagram, expand Details
9.4 Feature transformers
Introduction to how Spark Feature Transformers can be called from R
Use
ft_binarizer()to create a new column, calledover_20, that indicates if that row’smpgvalue is over or under 20MPG## # Source: spark<?> [?? x 12] ## mpg cyl disp hp drat wt qsec vs am gear carb over_20 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1 ## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 1 ## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 1 ## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 1 ## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 0 ## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 0 ## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 0 ## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 1 ## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 1 ## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 0 ## # … with more rowsPipe the code into
count()to see how the data splits between the two values## # Source: spark<?> [?? x 2] ## over_20 n ## <dbl> <dbl> ## 1 0 18 ## 2 1 14Start a new code chunk. This time use
ft_quantile_discretizer()to create a new column calledmpg_quantile## # Source: spark<?> [?? x 12] ## mpg cyl disp hp drat wt qsec vs am gear carb mpg_quantile ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1 ## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 1 ## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 1 ## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 1 ## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 0 ## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 0 ## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 0 ## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 1 ## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 1 ## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 1 ## # … with more rowsAdd the
num_bucketsargument toft_quantile_discretizer(), set its value to 5## # Source: spark<?> [?? x 12] ## mpg cyl disp hp drat wt qsec vs am gear carb mpg_quantile ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 3 ## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 3 ## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 3 ## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 3 ## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 2 ## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 2 ## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 0 ## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 4 ## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 3 ## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 2 ## # … with more rows- Pipe the code into
count()to see how the data splits between the quantiles
spark_mtcars %>% ft_quantile_discretizer("mpg", "mpg_quantile", num_buckets = 5) %>% count(mpg_quantile)## # Source: spark<?> [?? x 2] ## mpg_quantile n ## <dbl> <dbl> ## 1 0 6 ## 2 3 7 ## 3 4 7 ## 4 1 6 ## 5 2 6- Pipe the code into
9.5 Models
Introduce Spark ML models by running a couple of them in R
Use
ml_kmeans()to run a model based on the following formula:wt ~ mpg. Assign the results to a variable calledk_mtcarsUse
k_mtcars$summaryto view the results of the model. Pull the cluster sizes by using...$cluster_sizes()## [1] 14 18Start a new code chunk. This time use
ml_linear_regression()to produce a Linear Regression model of the same formula used in the previous model. Assign the results to a variable calledlr_mtcarsUse
summary()to view the results of the model## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.6516 -0.3490 -0.1381 0.3190 1.3684 ## ## Coefficients: ## (Intercept) mpg ## 6.047255 -0.140862 ## ## R-Squared: 0.7528 ## Root Mean Squared Error: 0.4788