7 Modeling with databases
7.1 Single step sampling
Use PostgreSQL TABLESAMPLE clause
Use
connection_open()
to open a Database connectionSet the
orders
variable to point to the orders tableSet the
orders_view
variable to point to the v_orders tablePipe
orders
into the functionshow_query()
## <SQL> ## SELECT * ## FROM retail.orders
Pipe the previous command into the
class()
function to see the kind of outputshow_query()
returns## <SQL> ## SELECT * ## FROM retail.orders
## [1] "tbl_conn" "tbl_PqConnection" "tbl_dbi" "tbl_sql" ## [5] "tbl_lazy" "tbl"
Replace
show_query()
withremote_query()
to compare the output types## [1] "sql" "character"
Replace
class()
withbuild_sql()
. Usecon
as the value for thecon
argument## <SQL> SELECT * ## FROM retail.orders
Add " TABLESAMPLE BERNOULLI (0.1)" to
build_sql()
as another...
argument## <SQL> SELECT * ## FROM retail.orders TABLESAMPLE BERNOULLI (0.1)
Pipe the code into
tbl()
. Usecon
for thecon
argument, and.
for the rest## # Source: SQL [?? x 3] ## # Database: postgres [rstudio_admin@localhost:5432/postgres] ## order_id customer_id step_id ## <int> <int> <dbl> ## 1 969600 39 965 ## 2 970974 89 966 ## 3 973046 86 967 ## 4 973200 78 967 ## 5 975219 75 968 ## 6 975885 62 968 ## 7 977001 21 969 ## 8 979327 53 970 ## 9 973304 26 972 ## 10 973555 48 972 ## # … with more rows
Use
inner_join()
to add the information from theorders_view
pointer, useorder_id
as the matching fieldorders %>% remote_query() %>% build_sql(con = con, " TABLESAMPLE BERNOULLI (0.1)") %>% tbl(con, .) %>% inner_join(orders_view, by = "order_id")
## # Source: lazy query [?? x 12] ## # Database: postgres [rstudio_admin@localhost:5432/postgres] ## order_id customer_id.x step_id date date_year date_month customer_id.y ## <int> <int> <dbl> <chr> <int> <int> <int> ## 1 5693 33 3 2016… 2016 1 33 ## 2 7095 69 4 2016… 2016 1 69 ## 3 7280 78 4 2016… 2016 1 78 ## 4 11260 20 6 2016… 2016 1 20 ## 5 11260 20 6 2016… 2016 1 22 ## 6 15317 22 13 2016… 2016 1 87 ## 7 15317 22 13 2016… 2016 1 22 ## 8 15841 41 8 2016… 2016 1 41 ## 9 15841 41 8 2016… 2016 1 51 ## 10 21315 54 16 2016… 2016 1 54 ## # … with more rows, and 5 more variables: customer_name <chr>, ## # customer_lon <dbl>, customer_lat <dbl>, order_total <dbl>, order_qty <int>
Assign the resulting code to a variable
orders_sample_db
Use
collect()
to load the results oforders_sample_db
to a new variable calledorders_sample
Load the
dbplot
libraryUse
dbplot_histogram()
to visualize the distribution oforder_total
fromorders_sample
Use
dbplot_histogram()
to visualize the distribution oforder_total
fromorders_view
7.2 Using tidymodels
for modeling
Fit and measure the model’s performance using functions from parsnip
and yardstick
Load the
tidymodels
libraryStart with the
linear_reg()
command, pipe intoset_engine()
, and use “lm” as its sole argument## Linear Regression Model Specification (regression) ## ## Computational engine: lm
Pipe into the
fit()
command. Use the formula:order_total ~ order_qty
, andorders_sample
as thedata
argument## parsnip model object ## ## Fit in: 126ms ## Call: ## stats::lm(formula = formula, data = data) ## ## Coefficients: ## (Intercept) order_qty ## 0.09443 6.62037
Assign the previous code to a variable called
parsnip_model
Use
bind_cols()
to add the predictions toorder_sample
. Calculate the prediction withpredict()
## # A tibble: 1,964 x 13 ## order_id customer_id.x step_id date date_year date_month customer_id.y ## <int> <int> <dbl> <chr> <int> <int> <int> ## 1 1864 35 1 2016… 2016 1 35 ## 2 3617 13 2 2016… 2016 1 13 ## 3 5279 81 3 2016… 2016 1 81 ## 4 5426 42 3 2016… 2016 1 42 ## 5 7472 1 4 2016… 2016 1 1 ## 6 7967 50 4 2016… 2016 1 50 ## 7 11174 74 6 2016… 2016 1 74 ## 8 11174 74 6 2016… 2016 1 50 ## 9 11532 22 11 2016… 2016 1 62 ## 10 11532 22 11 2016… 2016 1 22 ## # … with 1,954 more rows, and 6 more variables: customer_name <chr>, ## # customer_lon <dbl>, customer_lat <dbl>, order_total <dbl>, order_qty <int>, ## # .pred <dbl>
Pipe the code into the
metrics()
function. Useorder_total
as thetruth
argument, and.pred
as theestimate
argumentorders_sample %>% bind_cols(predict(parsnip_model, orders_sample)) %>% metrics(truth = order_total, estimate = .pred)
## # A tibble: 3 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 3.50 ## 2 rsq standard 0.939 ## 3 mae standard 2.78
7.3 Score with tidypredict
Load the
tidypredict
libraryUse the
parse_model()
function to parseparsnip_model
, and assign it to a variable calledparsed_parsnip
Use
str()
to see theparsed_parsnip
object’s structure## List of 2 ## $ general:List of 6 ## ..$ model : chr "lm" ## ..$ version : num 2 ## ..$ type : chr "regression" ## ..$ residual: int 1962 ## ..$ sigma2 : num 12.3 ## ..$ is_glm : num 0 ## $ terms :List of 2 ## ..$ :List of 5 ## .. ..$ label : chr "(Intercept)" ## .. ..$ coef : num 0.0944 ## .. ..$ is_intercept: num 1 ## .. ..$ fields :List of 1 ## .. .. ..$ :List of 2 ## .. .. .. ..$ type: chr "ordinary" ## .. .. .. ..$ col : chr "(Intercept)" ## .. ..$ qr :List of 2 ## .. .. ..$ qr_1: num -0.0226 ## .. .. ..$ qr_2: num -0.0639 ## ..$ :List of 5 ## .. ..$ label : chr "order_qty" ## .. ..$ coef : num 6.62 ## .. ..$ is_intercept: num 0 ## .. ..$ fields :List of 1 ## .. .. ..$ :List of 2 ## .. .. .. ..$ type: chr "ordinary" ## .. .. .. ..$ col : chr "order_qty" ## .. ..$ qr :List of 2 ## .. .. ..$ qr_1: num 0 ## .. .. ..$ qr_2: num 0.0109 ## - attr(*, "class")= chr [1:3] "parsed_model" "pm_regression" "list"
Use
tidypredict_fit()
to view thedplyr
formula that calculates the prediction## 0.0944307069267399 + (order_qty * 6.62036536915535)
Use
head()
to get the first 10 records fromorders_view
## # Source: lazy query [?? x 10] ## # Database: postgres [rstudio_admin@localhost:5432/postgres] ## order_id date date_year date_month customer_id customer_name customer_lon ## <int> <chr> <int> <int> <int> <chr> <dbl> ## 1 1001 2016… 2016 1 22 Dr. Birdie K… -122. ## 2 1002 2016… 2016 1 6 Meggan Bruen -122. ## 3 1003 2016… 2016 1 80 Jessee Rodri… -122. ## 4 1004 2016… 2016 1 55 Kathryn Stehr -122. ## 5 1005 2016… 2016 1 73 Merlyn Runol… -122. ## 6 1006 2016… 2016 1 70 Reggie Mills -122. ## 7 1007 2016… 2016 1 55 Kathryn Stehr -122. ## 8 1008 2016… 2016 1 40 Dr. Trace Gl… -122. ## 9 1009 2016… 2016 1 78 Pricilla Goo… -122. ## 10 1010 2016… 2016 1 35 Mr. Commodor… -122. ## # … with 3 more variables: customer_lat <dbl>, order_total <dbl>, order_qty <int>
Pipe the code into
mutate()
. Assign to a newmy_pred
variable the results oftidypredict_fit()
. Make sure to prefixtidypredict_fit()
with the bang-bang operator so that the formula is evaluated.## # Source: lazy query [?? x 11] ## # Database: postgres [rstudio_admin@localhost:5432/postgres] ## order_id date date_year date_month customer_id customer_name customer_lon ## <int> <chr> <int> <int> <int> <chr> <dbl> ## 1 1001 2016… 2016 1 22 Dr. Birdie K… -122. ## 2 1002 2016… 2016 1 6 Meggan Bruen -122. ## 3 1003 2016… 2016 1 80 Jessee Rodri… -122. ## 4 1004 2016… 2016 1 55 Kathryn Stehr -122. ## 5 1005 2016… 2016 1 73 Merlyn Runol… -122. ## 6 1006 2016… 2016 1 70 Reggie Mills -122. ## 7 1007 2016… 2016 1 55 Kathryn Stehr -122. ## 8 1008 2016… 2016 1 40 Dr. Trace Gl… -122. ## 9 1009 2016… 2016 1 78 Pricilla Goo… -122. ## 10 1010 2016… 2016 1 35 Mr. Commodor… -122. ## # … with 4 more variables: customer_lat <dbl>, order_total <dbl>, ## # order_qty <int>, my_pred <dbl>
Replace the
mutate()
command withtidypredict_to_column()
## # Source: lazy query [?? x 11] ## # Database: postgres [rstudio_admin@localhost:5432/postgres] ## order_id date date_year date_month customer_id customer_name customer_lon ## <int> <chr> <int> <int> <int> <chr> <dbl> ## 1 1001 2016… 2016 1 22 Dr. Birdie K… -122. ## 2 1002 2016… 2016 1 6 Meggan Bruen -122. ## 3 1003 2016… 2016 1 80 Jessee Rodri… -122. ## 4 1004 2016… 2016 1 55 Kathryn Stehr -122. ## 5 1005 2016… 2016 1 73 Merlyn Runol… -122. ## 6 1006 2016… 2016 1 70 Reggie Mills -122. ## 7 1007 2016… 2016 1 55 Kathryn Stehr -122. ## 8 1008 2016… 2016 1 40 Dr. Trace Gl… -122. ## 9 1009 2016… 2016 1 78 Pricilla Goo… -122. ## 10 1010 2016… 2016 1 35 Mr. Commodor… -122. ## # … with 4 more variables: customer_lat <dbl>, order_total <dbl>, ## # order_qty <int>, fit <dbl>
Load the
yaml
libraryUse
write_yaml()
to save the contents ofparsed_parsnip
into a file called model.yamlUsing
read_yaml()
, read the contents of the model.yaml file into the a new variable calledloaded_model
Use
as_parsed_model()
to convert theloaded_model
variable into atidypredict
parsed model object, assign the results toloaded_model_2
7.4 Run predictions in DB
Load the
modeldb
libraryUse
select()
to pick theorder_total
andorder_qty
fields from theorders_sample_db
table pointer## # Source: lazy query [?? x 2] ## # Database: postgres [rstudio_admin@localhost:5432/postgres] ## order_total order_qty ## <dbl> <int> ## 1 23.2 3 ## 2 7.53 1 ## 3 13.7 2 ## 4 17.6 3 ## 5 12.6 2 ## 6 31.7 5 ## 7 31.7 5 ## 8 21.2 3 ## 9 21.2 3 ## 10 37.1 6 ## # … with more rows
Pipe the code into the
linear_regression_db()
function, passorder_total
as the only argument## # A tibble: 1 x 2 ## `(Intercept)` order_qty ## <dbl> <dbl> ## 1 0.105 6.64
Assign the model results to a new variable called
db_model
Use
as_parsed_model()
to convertdb_model
to a parsed model object. Assign to new a variable calledpm
Use
head()
to get the top 10 records, and then pipe intotidypredict_to_column()
to add the results frompm
## # Source: lazy query [?? x 11] ## # Database: postgres [rstudio_admin@localhost:5432/postgres] ## order_id date date_year date_month customer_id customer_name customer_lon ## <int> <chr> <int> <int> <int> <chr> <dbl> ## 1 1001 2016… 2016 1 22 Dr. Birdie K… -122. ## 2 1002 2016… 2016 1 6 Meggan Bruen -122. ## 3 1003 2016… 2016 1 80 Jessee Rodri… -122. ## 4 1004 2016… 2016 1 55 Kathryn Stehr -122. ## 5 1005 2016… 2016 1 73 Merlyn Runol… -122. ## 6 1006 2016… 2016 1 70 Reggie Mills -122. ## 7 1007 2016… 2016 1 55 Kathryn Stehr -122. ## 8 1008 2016… 2016 1 40 Dr. Trace Gl… -122. ## 9 1009 2016… 2016 1 78 Pricilla Goo… -122. ## 10 1010 2016… 2016 1 35 Mr. Commodor… -122. ## # … with 4 more variables: customer_lat <dbl>, order_total <dbl>, ## # order_qty <int>, fit <dbl>