7 Modeling with databases

7.1 Single step sampling

Use PostgreSQL TABLESAMPLE clause

  1. Use connection_open() to open a Database connection

  2. Set the orders variable to point to the orders table

  3. Set the orders_view variable to point to the v_orders table

  4. Pipe orders into the function show_query()

    ## <SQL>
    ## SELECT *
    ## FROM retail.orders
  5. Pipe the previous command into the class() function to see the kind of output show_query() returns

    ## <SQL>
    ## SELECT *
    ## FROM retail.orders
    ## [1] "tbl_conn"         "tbl_PqConnection" "tbl_dbi"          "tbl_sql"         
    ## [5] "tbl_lazy"         "tbl"
  6. Replace show_query() with remote_query() to compare the output types

    ## [1] "sql"       "character"
  7. Replace class() with build_sql(). Use con as the value for the con argument

    ## <SQL> SELECT *
    ## FROM retail.orders
  8. Add " TABLESAMPLE BERNOULLI (0.1)" to build_sql() as another ... argument

    ## <SQL> SELECT *
    ## FROM retail.orders TABLESAMPLE BERNOULLI (0.1)
  9. Pipe the code into tbl(). Use con for the con argument, and . for the rest

    ## # Source:   SQL [?? x 3]
    ## # Database: postgres [rstudio_admin@localhost:5432/postgres]
    ##    order_id customer_id step_id
    ##       <int>       <int>   <dbl>
    ##  1   969600          39     965
    ##  2   970974          89     966
    ##  3   973046          86     967
    ##  4   973200          78     967
    ##  5   975219          75     968
    ##  6   975885          62     968
    ##  7   977001          21     969
    ##  8   979327          53     970
    ##  9   973304          26     972
    ## 10   973555          48     972
    ## # … with more rows
  10. Use inner_join() to add the information from the orders_view pointer, use order_id as the matching field

    ## # Source:   lazy query [?? x 12]
    ## # Database: postgres [rstudio_admin@localhost:5432/postgres]
    ##    order_id customer_id.x step_id date  date_year date_month customer_id.y
    ##       <int>         <int>   <dbl> <chr>     <int>      <int>         <int>
    ##  1     5693            33       3 2016…      2016          1            33
    ##  2     7095            69       4 2016…      2016          1            69
    ##  3     7280            78       4 2016…      2016          1            78
    ##  4    11260            20       6 2016…      2016          1            20
    ##  5    11260            20       6 2016…      2016          1            22
    ##  6    15317            22      13 2016…      2016          1            87
    ##  7    15317            22      13 2016…      2016          1            22
    ##  8    15841            41       8 2016…      2016          1            41
    ##  9    15841            41       8 2016…      2016          1            51
    ## 10    21315            54      16 2016…      2016          1            54
    ## # … with more rows, and 5 more variables: customer_name <chr>,
    ## #   customer_lon <dbl>, customer_lat <dbl>, order_total <dbl>, order_qty <int>
  11. Assign the resulting code to a variable orders_sample_db

  12. Use collect() to load the results of orders_sample_db to a new variable called orders_sample

  13. Load the dbplot library

  14. Use dbplot_histogram() to visualize the distribution of order_total from orders_sample

  15. Use dbplot_histogram() to visualize the distribution of order_total from orders_view

7.2 Using tidymodels for modeling

Fit and measure the model’s performance using functions from parsnip and yardstick

  1. Load the tidymodels library

  2. Start with the linear_reg() command, pipe into set_engine(), and use “lm” as its sole argument

    ## Linear Regression Model Specification (regression)
    ## 
    ## Computational engine: lm
  3. Pipe into the fit() command. Use the formula: order_total ~ order_qty, and orders_sample as the data argument

    ## parsnip model object
    ## 
    ## Fit in:  126ms
    ## Call:
    ## stats::lm(formula = formula, data = data)
    ## 
    ## Coefficients:
    ## (Intercept)    order_qty  
    ##     0.09443      6.62037
  4. Assign the previous code to a variable called parsnip_model

  5. Use bind_cols() to add the predictions to order_sample. Calculate the prediction with predict()

    ## # A tibble: 1,964 x 13
    ##    order_id customer_id.x step_id date  date_year date_month customer_id.y
    ##       <int>         <int>   <dbl> <chr>     <int>      <int>         <int>
    ##  1     1864            35       1 2016…      2016          1            35
    ##  2     3617            13       2 2016…      2016          1            13
    ##  3     5279            81       3 2016…      2016          1            81
    ##  4     5426            42       3 2016…      2016          1            42
    ##  5     7472             1       4 2016…      2016          1             1
    ##  6     7967            50       4 2016…      2016          1            50
    ##  7    11174            74       6 2016…      2016          1            74
    ##  8    11174            74       6 2016…      2016          1            50
    ##  9    11532            22      11 2016…      2016          1            62
    ## 10    11532            22      11 2016…      2016          1            22
    ## # … with 1,954 more rows, and 6 more variables: customer_name <chr>,
    ## #   customer_lon <dbl>, customer_lat <dbl>, order_total <dbl>, order_qty <int>,
    ## #   .pred <dbl>
  6. Pipe the code into the metrics() function. Use order_total as the truth argument, and .pred as the estimate argument

    ## # A tibble: 3 x 3
    ##   .metric .estimator .estimate
    ##   <chr>   <chr>          <dbl>
    ## 1 rmse    standard       3.50 
    ## 2 rsq     standard       0.939
    ## 3 mae     standard       2.78

7.3 Score with tidypredict

  1. Load the tidypredict library

  2. Use the parse_model() function to parse parsnip_model, and assign it to a variable called parsed_parsnip

  3. Use str() to see the parsed_parsnip object’s structure

    ## List of 2
    ##  $ general:List of 6
    ##   ..$ model   : chr "lm"
    ##   ..$ version : num 2
    ##   ..$ type    : chr "regression"
    ##   ..$ residual: int 1962
    ##   ..$ sigma2  : num 12.3
    ##   ..$ is_glm  : num 0
    ##  $ terms  :List of 2
    ##   ..$ :List of 5
    ##   .. ..$ label       : chr "(Intercept)"
    ##   .. ..$ coef        : num 0.0944
    ##   .. ..$ is_intercept: num 1
    ##   .. ..$ fields      :List of 1
    ##   .. .. ..$ :List of 2
    ##   .. .. .. ..$ type: chr "ordinary"
    ##   .. .. .. ..$ col : chr "(Intercept)"
    ##   .. ..$ qr          :List of 2
    ##   .. .. ..$ qr_1: num -0.0226
    ##   .. .. ..$ qr_2: num -0.0639
    ##   ..$ :List of 5
    ##   .. ..$ label       : chr "order_qty"
    ##   .. ..$ coef        : num 6.62
    ##   .. ..$ is_intercept: num 0
    ##   .. ..$ fields      :List of 1
    ##   .. .. ..$ :List of 2
    ##   .. .. .. ..$ type: chr "ordinary"
    ##   .. .. .. ..$ col : chr "order_qty"
    ##   .. ..$ qr          :List of 2
    ##   .. .. ..$ qr_1: num 0
    ##   .. .. ..$ qr_2: num 0.0109
    ##  - attr(*, "class")= chr [1:3] "parsed_model" "pm_regression" "list"
  4. Use tidypredict_fit() to view the dplyr formula that calculates the prediction

    ## 0.0944307069267399 + (order_qty * 6.62036536915535)
  5. Use head() to get the first 10 records from orders_view

    ## # Source:   lazy query [?? x 10]
    ## # Database: postgres [rstudio_admin@localhost:5432/postgres]
    ##    order_id date  date_year date_month customer_id customer_name customer_lon
    ##       <int> <chr>     <int>      <int>       <int> <chr>                <dbl>
    ##  1     1001 2016…      2016          1          22 Dr. Birdie K…        -122.
    ##  2     1002 2016…      2016          1           6 Meggan Bruen         -122.
    ##  3     1003 2016…      2016          1          80 Jessee Rodri…        -122.
    ##  4     1004 2016…      2016          1          55 Kathryn Stehr        -122.
    ##  5     1005 2016…      2016          1          73 Merlyn Runol…        -122.
    ##  6     1006 2016…      2016          1          70 Reggie Mills         -122.
    ##  7     1007 2016…      2016          1          55 Kathryn Stehr        -122.
    ##  8     1008 2016…      2016          1          40 Dr. Trace Gl…        -122.
    ##  9     1009 2016…      2016          1          78 Pricilla Goo…        -122.
    ## 10     1010 2016…      2016          1          35 Mr. Commodor…        -122.
    ## # … with 3 more variables: customer_lat <dbl>, order_total <dbl>, order_qty <int>
  6. Pipe the code into mutate(). Assign to a new my_pred variable the results of tidypredict_fit(). Make sure to prefix tidypredict_fit() with the bang-bang operator so that the formula is evaluated.

    ## # Source:   lazy query [?? x 11]
    ## # Database: postgres [rstudio_admin@localhost:5432/postgres]
    ##    order_id date  date_year date_month customer_id customer_name customer_lon
    ##       <int> <chr>     <int>      <int>       <int> <chr>                <dbl>
    ##  1     1001 2016…      2016          1          22 Dr. Birdie K…        -122.
    ##  2     1002 2016…      2016          1           6 Meggan Bruen         -122.
    ##  3     1003 2016…      2016          1          80 Jessee Rodri…        -122.
    ##  4     1004 2016…      2016          1          55 Kathryn Stehr        -122.
    ##  5     1005 2016…      2016          1          73 Merlyn Runol…        -122.
    ##  6     1006 2016…      2016          1          70 Reggie Mills         -122.
    ##  7     1007 2016…      2016          1          55 Kathryn Stehr        -122.
    ##  8     1008 2016…      2016          1          40 Dr. Trace Gl…        -122.
    ##  9     1009 2016…      2016          1          78 Pricilla Goo…        -122.
    ## 10     1010 2016…      2016          1          35 Mr. Commodor…        -122.
    ## # … with 4 more variables: customer_lat <dbl>, order_total <dbl>,
    ## #   order_qty <int>, my_pred <dbl>
  7. Replace the mutate() command with tidypredict_to_column()

    ## # Source:   lazy query [?? x 11]
    ## # Database: postgres [rstudio_admin@localhost:5432/postgres]
    ##    order_id date  date_year date_month customer_id customer_name customer_lon
    ##       <int> <chr>     <int>      <int>       <int> <chr>                <dbl>
    ##  1     1001 2016…      2016          1          22 Dr. Birdie K…        -122.
    ##  2     1002 2016…      2016          1           6 Meggan Bruen         -122.
    ##  3     1003 2016…      2016          1          80 Jessee Rodri…        -122.
    ##  4     1004 2016…      2016          1          55 Kathryn Stehr        -122.
    ##  5     1005 2016…      2016          1          73 Merlyn Runol…        -122.
    ##  6     1006 2016…      2016          1          70 Reggie Mills         -122.
    ##  7     1007 2016…      2016          1          55 Kathryn Stehr        -122.
    ##  8     1008 2016…      2016          1          40 Dr. Trace Gl…        -122.
    ##  9     1009 2016…      2016          1          78 Pricilla Goo…        -122.
    ## 10     1010 2016…      2016          1          35 Mr. Commodor…        -122.
    ## # … with 4 more variables: customer_lat <dbl>, order_total <dbl>,
    ## #   order_qty <int>, fit <dbl>
  8. Load the yaml library

  9. Use write_yaml() to save the contents of parsed_parsnip into a file called model.yaml

  10. Using read_yaml(), read the contents of the model.yaml file into the a new variable called loaded_model

  11. Use as_parsed_model() to convert the loaded_model variable into a tidypredict parsed model object, assign the results to loaded_model_2

7.4 Run predictions in DB

  1. Load the modeldb library

  2. Use select() to pick the order_total and order_qty fields from the orders_sample_db table pointer

    ## # Source:   lazy query [?? x 2]
    ## # Database: postgres [rstudio_admin@localhost:5432/postgres]
    ##    order_total order_qty
    ##          <dbl>     <int>
    ##  1       23.2          3
    ##  2        7.53         1
    ##  3       13.7          2
    ##  4       17.6          3
    ##  5       12.6          2
    ##  6       31.7          5
    ##  7       31.7          5
    ##  8       21.2          3
    ##  9       21.2          3
    ## 10       37.1          6
    ## # … with more rows
  3. Pipe the code into the linear_regression_db() function, pass order_total as the only argument

    ## # A tibble: 1 x 2
    ##   `(Intercept)` order_qty
    ##           <dbl>     <dbl>
    ## 1         0.105      6.64
  4. Assign the model results to a new variable called db_model

  5. Use as_parsed_model() to convert db_model to a parsed model object. Assign to new a variable called pm

  6. Use head() to get the top 10 records, and then pipe into tidypredict_to_column() to add the results from pm

    ## # Source:   lazy query [?? x 11]
    ## # Database: postgres [rstudio_admin@localhost:5432/postgres]
    ##    order_id date  date_year date_month customer_id customer_name customer_lon
    ##       <int> <chr>     <int>      <int>       <int> <chr>                <dbl>
    ##  1     1001 2016…      2016          1          22 Dr. Birdie K…        -122.
    ##  2     1002 2016…      2016          1           6 Meggan Bruen         -122.
    ##  3     1003 2016…      2016          1          80 Jessee Rodri…        -122.
    ##  4     1004 2016…      2016          1          55 Kathryn Stehr        -122.
    ##  5     1005 2016…      2016          1          73 Merlyn Runol…        -122.
    ##  6     1006 2016…      2016          1          70 Reggie Mills         -122.
    ##  7     1007 2016…      2016          1          55 Kathryn Stehr        -122.
    ##  8     1008 2016…      2016          1          40 Dr. Trace Gl…        -122.
    ##  9     1009 2016…      2016          1          78 Pricilla Goo…        -122.
    ## 10     1010 2016…      2016          1          35 Mr. Commodor…        -122.
    ## # … with 4 more variables: customer_lat <dbl>, order_total <dbl>,
    ## #   order_qty <int>, fit <dbl>