5 Databases and dplyr

5.1 Intro to connections

Use connections to open open a database connection

  1. Load the connections package

  2. Use connection_open() to open a Database connection

  3. The RStudio Connections pane should show the tables in the database

5.2 Table reference

Use the dplyr’s tbl() command

  1. Load the dplyr package

  2. Add in_schema() as an argument to tbl() to specify the schema

    ## # Source:   table<retail.customer> [?? x 6]
    ## # Database: postgres [rstudio_admin@localhost:5432/postgres]
    ##    customer_id customer_name  customer_phone customer_cc customer_lon customer_lat
    ##          <int> <chr>          <chr>          <chr>              <dbl>        <dbl>
    ##  1           1 Marilou Donne… 046-995-9387x… 4054106117…        -122.         37.8
    ##  2           2 Aubrey Gulgow… (020)136-2064  6759766520…        -122.         37.7
    ##  3           3 Arlis Koss     145.574.8189   8699968904…        -122.         37.8
    ##  4           4 Duwayne Walsh  737-897-1968x… 4091991124…        -122.         37.7
    ##  5           5 Nehemiah Doyl… (035)642-3662… 3709535249…        -122.         37.7
    ##  6           6 Meggan Bruen   326-151-4331   4964180480…        -122.         37.7
    ##  7           7 Tracie Swift … 776.442.3270x… 4354911637…        -122.         37.8
    ##  8           8 Karrie Donnel… 883.024.5322x… 4232403376…        -122.         37.8
    ##  9           9 Kip Eichmann   (619)169-8761… 5177848238…        -122.         37.7
    ## 10          10 Ms. Ciarra Bo… 964-240-3124   4893126879…        -122.         37.8
    ## # … with more rows
  3. Load the results from the tbl() command that points the table called orders to a variable called orders

  4. Use the class function to determine the object type of orders

    ## [1] "tbl_conn"         "tbl_PqConnection" "tbl_dbi"          "tbl_sql"         
    ## [5] "tbl_lazy"         "tbl"

5.3 Under the hood

Use show_query() to preview the SQL statement that will be sent to the database

  1. Use show_query() to preview SQL statement that actually runs when we run orders as a command

    ## <SQL>
    ## SELECT *
    ## FROM retail.orders
  2. When executed, orders returns the first 1000 rows of the remote orders table

    ## # Source:   table<retail.orders> [?? x 3]
    ## # Database: postgres [rstudio_admin@localhost:5432/postgres]
    ##    order_id customer_id step_id
    ##       <int>       <int>   <dbl>
    ##  1   801473          58     796
    ##  2   801474          19     796
    ##  3   801475           4     796
    ##  4   801476          10     796
    ##  5   801477          81     796
    ##  6   801478          89     796
    ##  7   801479          56     796
    ##  8   801480          53     796
    ##  9   801481          70     796
    ## 10   801482          37     796
    ## # … with more rows
  3. Full results of a remote query can be brought into R with collect

  4. Easily view the resulting query by adding show_query() in another piped command

    ## <SQL>
    ## SELECT *
    ## FROM retail.orders
  5. Insert head() in between the two statements to see how the SQL changes

    ## <SQL>
    ## SELECT *
    ## FROM retail.orders
    ## LIMIT 6
  6. Queries can be assigned to variables. Create a variable called orders_head that contains the previous query

    ## <SQL>
    ## SELECT *
    ## FROM retail.orders
    ## LIMIT 6
  7. Use sql_render() and simulate_mssql() to see how the SQL statement changes from vendor to vendor

    ## <SQL> SELECT TOP(6) *
    ## FROM retail.orders
  8. Use explain() to explore the query plan

    ## <SQL>
    ## SELECT *
    ## FROM retail.orders
    ## LIMIT 6
    ## 
    ## <PLAN>
    ## Limit  (cost=0.00..0.09 rows=6 width=12)
    ##   ->  Seq Scan on orders  (cost=0.00..15406.47 rows=1000047 width=12)

5.4 Un-translated R commands

Review of how dbplyr handles R commands that have not been translated into a like-SQL command

  1. Preview how mean is translated

    ## <SQL>
    ## SELECT "order_id", "customer_id", "step_id", AVG("order_id") OVER () AS "avg_id"
    ## FROM retail.orders
  2. Preview how Sys.Date() is translated

    ## <SQL>
    ## SELECT "order_id", "customer_id", "step_id", Sys.Date() AS "today"
    ## FROM retail.orders
  3. Use PostgreSQL native commands, in this case date

    ## <SQL>
    ## SELECT "order_id", "customer_id", "step_id", date('now') AS "today"
    ## FROM retail.orders
  4. Run the dplyr code to confirm it works

    ## # Source:   lazy query [?? x 4]
    ## # Database: postgres [rstudio_admin@localhost:5432/postgres]
    ##   order_id customer_id step_id today     
    ##      <int>       <int>   <dbl> <date>    
    ## 1   801473          58     796 2020-01-23
    ## 2   801474          19     796 2020-01-23
    ## 3   801475           4     796 2020-01-23
    ## 4   801476          10     796 2020-01-23
    ## 5   801477          81     796 2020-01-23
    ## 6   801478          89     796 2020-01-23

5.5 Using bang-bang

Intro on passing unevaluated code to a dplyr verb

  1. Preview how Sys.Date() is translated when prefixing !!

    ## <SQL>
    ## SELECT "order_id", "customer_id", "step_id", '2020-01-23' AS "today"
    ## FROM retail.orders
  2. View resulting table when Sys.Date() is translated when prefixing !!

    ## # Source:   lazy query [?? x 4]
    ## # Database: postgres [rstudio_admin@localhost:5432/postgres]
    ##   order_id customer_id step_id today     
    ##      <int>       <int>   <dbl> <chr>     
    ## 1   801473          58     796 2020-01-23
    ## 2   801474          19     796 2020-01-23
    ## 3   801475           4     796 2020-01-23
    ## 4   801476          10     796 2020-01-23
    ## 5   801477          81     796 2020-01-23
    ## 6   801478          89     796 2020-01-23
  3. Disconnect from the database using connection_close