7 Modeling with databases
7.1 Single step sampling
Use PostgreSQL TABLESAMPLE clause
Use
connection_open()to open a Database connectionSet the
ordersvariable to point to the orders tableSet the
orders_viewvariable to point to the v_orders tablePipe
ordersinto the functionshow_query()## <SQL> ## SELECT * ## FROM retail.ordersPipe the previous command into the
class()function to see the kind of outputshow_query()returns## <SQL> ## SELECT * ## FROM retail.orders## [1] "tbl_conn" "tbl_PqConnection" "tbl_dbi" "tbl_sql" ## [5] "tbl_lazy" "tbl"Replace
show_query()withremote_query()to compare the output types## [1] "sql" "character"Replace
class()withbuild_sql(). Useconas the value for theconargument## <SQL> SELECT * ## FROM retail.ordersAdd " TABLESAMPLE BERNOULLI (0.1)" to
build_sql()as another...argument## <SQL> SELECT * ## FROM retail.orders TABLESAMPLE BERNOULLI (0.1)Pipe the code into
tbl(). Useconfor theconargument, and.for the rest## # Source: SQL [?? x 3] ## # Database: postgres [rstudio_admin@localhost:5432/postgres] ## order_id customer_id step_id ## <int> <int> <dbl> ## 1 969600 39 965 ## 2 970974 89 966 ## 3 973046 86 967 ## 4 973200 78 967 ## 5 975219 75 968 ## 6 975885 62 968 ## 7 977001 21 969 ## 8 979327 53 970 ## 9 973304 26 972 ## 10 973555 48 972 ## # … with more rowsUse
inner_join()to add the information from theorders_viewpointer, useorder_idas the matching fieldorders %>% remote_query() %>% build_sql(con = con, " TABLESAMPLE BERNOULLI (0.1)") %>% tbl(con, .) %>% inner_join(orders_view, by = "order_id")## # Source: lazy query [?? x 12] ## # Database: postgres [rstudio_admin@localhost:5432/postgres] ## order_id customer_id.x step_id date date_year date_month customer_id.y ## <int> <int> <dbl> <chr> <int> <int> <int> ## 1 5693 33 3 2016… 2016 1 33 ## 2 7095 69 4 2016… 2016 1 69 ## 3 7280 78 4 2016… 2016 1 78 ## 4 11260 20 6 2016… 2016 1 20 ## 5 11260 20 6 2016… 2016 1 22 ## 6 15317 22 13 2016… 2016 1 87 ## 7 15317 22 13 2016… 2016 1 22 ## 8 15841 41 8 2016… 2016 1 41 ## 9 15841 41 8 2016… 2016 1 51 ## 10 21315 54 16 2016… 2016 1 54 ## # … with more rows, and 5 more variables: customer_name <chr>, ## # customer_lon <dbl>, customer_lat <dbl>, order_total <dbl>, order_qty <int>Assign the resulting code to a variable
orders_sample_dbUse
collect()to load the results oforders_sample_dbto a new variable calledorders_sampleLoad the
dbplotlibraryUse
dbplot_histogram()to visualize the distribution oforder_totalfromorders_sample
Use
dbplot_histogram()to visualize the distribution oforder_totalfromorders_view
7.2 Using tidymodels for modeling
Fit and measure the model’s performance using functions from parsnip and yardstick
Load the
tidymodelslibraryStart with the
linear_reg()command, pipe intoset_engine(), and use “lm” as its sole argument## Linear Regression Model Specification (regression) ## ## Computational engine: lmPipe into the
fit()command. Use the formula:order_total ~ order_qty, andorders_sampleas thedataargument## parsnip model object ## ## Fit in: 126ms ## Call: ## stats::lm(formula = formula, data = data) ## ## Coefficients: ## (Intercept) order_qty ## 0.09443 6.62037Assign the previous code to a variable called
parsnip_modelUse
bind_cols()to add the predictions toorder_sample. Calculate the prediction withpredict()## # A tibble: 1,964 x 13 ## order_id customer_id.x step_id date date_year date_month customer_id.y ## <int> <int> <dbl> <chr> <int> <int> <int> ## 1 1864 35 1 2016… 2016 1 35 ## 2 3617 13 2 2016… 2016 1 13 ## 3 5279 81 3 2016… 2016 1 81 ## 4 5426 42 3 2016… 2016 1 42 ## 5 7472 1 4 2016… 2016 1 1 ## 6 7967 50 4 2016… 2016 1 50 ## 7 11174 74 6 2016… 2016 1 74 ## 8 11174 74 6 2016… 2016 1 50 ## 9 11532 22 11 2016… 2016 1 62 ## 10 11532 22 11 2016… 2016 1 22 ## # … with 1,954 more rows, and 6 more variables: customer_name <chr>, ## # customer_lon <dbl>, customer_lat <dbl>, order_total <dbl>, order_qty <int>, ## # .pred <dbl>Pipe the code into the
metrics()function. Useorder_totalas thetruthargument, and.predas theestimateargumentorders_sample %>% bind_cols(predict(parsnip_model, orders_sample)) %>% metrics(truth = order_total, estimate = .pred)## # A tibble: 3 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 3.50 ## 2 rsq standard 0.939 ## 3 mae standard 2.78
7.3 Score with tidypredict
Load the
tidypredictlibraryUse the
parse_model()function to parseparsnip_model, and assign it to a variable calledparsed_parsnipUse
str()to see theparsed_parsnipobject’s structure## List of 2 ## $ general:List of 6 ## ..$ model : chr "lm" ## ..$ version : num 2 ## ..$ type : chr "regression" ## ..$ residual: int 1962 ## ..$ sigma2 : num 12.3 ## ..$ is_glm : num 0 ## $ terms :List of 2 ## ..$ :List of 5 ## .. ..$ label : chr "(Intercept)" ## .. ..$ coef : num 0.0944 ## .. ..$ is_intercept: num 1 ## .. ..$ fields :List of 1 ## .. .. ..$ :List of 2 ## .. .. .. ..$ type: chr "ordinary" ## .. .. .. ..$ col : chr "(Intercept)" ## .. ..$ qr :List of 2 ## .. .. ..$ qr_1: num -0.0226 ## .. .. ..$ qr_2: num -0.0639 ## ..$ :List of 5 ## .. ..$ label : chr "order_qty" ## .. ..$ coef : num 6.62 ## .. ..$ is_intercept: num 0 ## .. ..$ fields :List of 1 ## .. .. ..$ :List of 2 ## .. .. .. ..$ type: chr "ordinary" ## .. .. .. ..$ col : chr "order_qty" ## .. ..$ qr :List of 2 ## .. .. ..$ qr_1: num 0 ## .. .. ..$ qr_2: num 0.0109 ## - attr(*, "class")= chr [1:3] "parsed_model" "pm_regression" "list"Use
tidypredict_fit()to view thedplyrformula that calculates the prediction## 0.0944307069267399 + (order_qty * 6.62036536915535)Use
head()to get the first 10 records fromorders_view## # Source: lazy query [?? x 10] ## # Database: postgres [rstudio_admin@localhost:5432/postgres] ## order_id date date_year date_month customer_id customer_name customer_lon ## <int> <chr> <int> <int> <int> <chr> <dbl> ## 1 1001 2016… 2016 1 22 Dr. Birdie K… -122. ## 2 1002 2016… 2016 1 6 Meggan Bruen -122. ## 3 1003 2016… 2016 1 80 Jessee Rodri… -122. ## 4 1004 2016… 2016 1 55 Kathryn Stehr -122. ## 5 1005 2016… 2016 1 73 Merlyn Runol… -122. ## 6 1006 2016… 2016 1 70 Reggie Mills -122. ## 7 1007 2016… 2016 1 55 Kathryn Stehr -122. ## 8 1008 2016… 2016 1 40 Dr. Trace Gl… -122. ## 9 1009 2016… 2016 1 78 Pricilla Goo… -122. ## 10 1010 2016… 2016 1 35 Mr. Commodor… -122. ## # … with 3 more variables: customer_lat <dbl>, order_total <dbl>, order_qty <int>Pipe the code into
mutate(). Assign to a newmy_predvariable the results oftidypredict_fit(). Make sure to prefixtidypredict_fit()with the bang-bang operator so that the formula is evaluated.## # Source: lazy query [?? x 11] ## # Database: postgres [rstudio_admin@localhost:5432/postgres] ## order_id date date_year date_month customer_id customer_name customer_lon ## <int> <chr> <int> <int> <int> <chr> <dbl> ## 1 1001 2016… 2016 1 22 Dr. Birdie K… -122. ## 2 1002 2016… 2016 1 6 Meggan Bruen -122. ## 3 1003 2016… 2016 1 80 Jessee Rodri… -122. ## 4 1004 2016… 2016 1 55 Kathryn Stehr -122. ## 5 1005 2016… 2016 1 73 Merlyn Runol… -122. ## 6 1006 2016… 2016 1 70 Reggie Mills -122. ## 7 1007 2016… 2016 1 55 Kathryn Stehr -122. ## 8 1008 2016… 2016 1 40 Dr. Trace Gl… -122. ## 9 1009 2016… 2016 1 78 Pricilla Goo… -122. ## 10 1010 2016… 2016 1 35 Mr. Commodor… -122. ## # … with 4 more variables: customer_lat <dbl>, order_total <dbl>, ## # order_qty <int>, my_pred <dbl>Replace the
mutate()command withtidypredict_to_column()## # Source: lazy query [?? x 11] ## # Database: postgres [rstudio_admin@localhost:5432/postgres] ## order_id date date_year date_month customer_id customer_name customer_lon ## <int> <chr> <int> <int> <int> <chr> <dbl> ## 1 1001 2016… 2016 1 22 Dr. Birdie K… -122. ## 2 1002 2016… 2016 1 6 Meggan Bruen -122. ## 3 1003 2016… 2016 1 80 Jessee Rodri… -122. ## 4 1004 2016… 2016 1 55 Kathryn Stehr -122. ## 5 1005 2016… 2016 1 73 Merlyn Runol… -122. ## 6 1006 2016… 2016 1 70 Reggie Mills -122. ## 7 1007 2016… 2016 1 55 Kathryn Stehr -122. ## 8 1008 2016… 2016 1 40 Dr. Trace Gl… -122. ## 9 1009 2016… 2016 1 78 Pricilla Goo… -122. ## 10 1010 2016… 2016 1 35 Mr. Commodor… -122. ## # … with 4 more variables: customer_lat <dbl>, order_total <dbl>, ## # order_qty <int>, fit <dbl>Load the
yamllibraryUse
write_yaml()to save the contents ofparsed_parsnipinto a file called model.yamlUsing
read_yaml(), read the contents of the model.yaml file into the a new variable calledloaded_modelUse
as_parsed_model()to convert theloaded_modelvariable into atidypredictparsed model object, assign the results toloaded_model_2
7.4 Run predictions in DB
Load the
modeldblibraryUse
select()to pick theorder_totalandorder_qtyfields from theorders_sample_dbtable pointer## # Source: lazy query [?? x 2] ## # Database: postgres [rstudio_admin@localhost:5432/postgres] ## order_total order_qty ## <dbl> <int> ## 1 23.2 3 ## 2 7.53 1 ## 3 13.7 2 ## 4 17.6 3 ## 5 12.6 2 ## 6 31.7 5 ## 7 31.7 5 ## 8 21.2 3 ## 9 21.2 3 ## 10 37.1 6 ## # … with more rowsPipe the code into the
linear_regression_db()function, passorder_totalas the only argument## # A tibble: 1 x 2 ## `(Intercept)` order_qty ## <dbl> <dbl> ## 1 0.105 6.64Assign the model results to a new variable called
db_modelUse
as_parsed_model()to convertdb_modelto a parsed model object. Assign to new a variable calledpmUse
head()to get the top 10 records, and then pipe intotidypredict_to_column()to add the results frompm## # Source: lazy query [?? x 11] ## # Database: postgres [rstudio_admin@localhost:5432/postgres] ## order_id date date_year date_month customer_id customer_name customer_lon ## <int> <chr> <int> <int> <int> <chr> <dbl> ## 1 1001 2016… 2016 1 22 Dr. Birdie K… -122. ## 2 1002 2016… 2016 1 6 Meggan Bruen -122. ## 3 1003 2016… 2016 1 80 Jessee Rodri… -122. ## 4 1004 2016… 2016 1 55 Kathryn Stehr -122. ## 5 1005 2016… 2016 1 73 Merlyn Runol… -122. ## 6 1006 2016… 2016 1 70 Reggie Mills -122. ## 7 1007 2016… 2016 1 55 Kathryn Stehr -122. ## 8 1008 2016… 2016 1 40 Dr. Trace Gl… -122. ## 9 1009 2016… 2016 1 78 Pricilla Goo… -122. ## 10 1010 2016… 2016 1 35 Mr. Commodor… -122. ## # … with 4 more variables: customer_lat <dbl>, order_total <dbl>, ## # order_qty <int>, fit <dbl>