library(nycflights13)
flights
A Crash Course on PostgreSQL for R Users
Updated on 2020-09-19: I changed the “Creating tables” section. In my original script I used copy_to()
, which creates (unique) indexes, while here I wrote dbWriteTable()
when I adapted my code, which doesn’t create indexes and only adds records.
Updated on 2020-08-11: @jbkunst suggested me a more concise approach to write large tables by using the Tidyverse. I added the explanation to do that and avoid obscure syntax. Thanks Joshua!
Motivation
If, like me, you have your workflows that are well organized, even with tidy GitHub repositories and still you create a lot of CSV or RDS files to share periodical data (i.e. weekly or monthly sales), then you can benefit from using PostgreSQL or any other SQL database engine such as MariaDB.
PostgreSQL and MariaDB are the database engines that I like, but this still applies to MySQL and even to propietary databases such as Oracle Database or SQL Server.
Nycflights13
Let’s start with the flights
table from the quite popular nycflights13
package:
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr>
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN
5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL
6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD
7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL
8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD
9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO
10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD
# … with 336,766 more rows, and 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Assume that you work at NYC airport and you are in charge of updating the flights
table every day. Instead of creating one CSV file per day with flight records, the process can be better organized by appending new records to an existing SQL table.
nycflights13
provides different tables besides flights
. You can explore the details in the superb book R4DS. For this tutorial, the relation between the tables is summarised in the next diagram:
SQL indexes
Skipping a lot of details, SQL creates copies of the tables that contain just a few columns which help to identify observations in order to be able to filter those observations in a faster way that without an index.
For example, in the month
table, the year column is really useful to create an index that boosts the speed of SQL operations similar to this dplyr
statement:
%>% filter(month == 3) flights
To create indexes, let’s start differentiating between unique and non-unique indexes:
<- list(
unique_index airlines = list("carrier"),
planes = list("tailnum")
)
<- list(
index airports = list("faa"),
flights = list(
c("year", "month", "day"), "carrier", "tailnum", "origin", "dest"
),weather = list(c("year", "month", "day"), "origin")
)
SQL connections
Let’s assume that the NYC airport gave you access to a PostgreSQL database named postgres
, which is hosted in a server with local IP 142.93.236.14
, the user is postgres
and the password is S4cr3tP4ssw0rd007
. Of course, this is an example with slightly different settings than the defaults.
Databases information such as passwords should be stored in your ~/.Rprofile
and never in your scripts! You can type usethis::edit_r_environ()
and edit your R profile to add the credentials and then restart your R session to apply the changes.
To access and start creating tables to better organize your data, you need to open and close connections, and the RPostgres
package is really useful to do that and more:
library(RPostgres)
library(dplyr)
<- function() {
fun_connect dbConnect(
Postgres(),
dbname = Sys.getenv("tutorial_db"),
user = Sys.getenv("tutorial_user"),
password = Sys.getenv("tutorial_pass"),
host = Sys.getenv("tutorial_host")
) }
By default, PostgreSQL uses the public
schema in a databases, which you’ll write to. You can explore the existing tables in the public
schema of postgres
database by running:
<- "public"
schema
<- fun_connect()
conn <- dbGetQuery(
remote_tables
conn,sprintf("SELECT table_name
FROM information_schema.tables
WHERE table_schema='%s'", schema)
)dbDisconnect(conn)
<- as.character(remote_tables$table_name)
remote_tables remote_tables
character(0)
Because this is an example with a freshly created server just for the tutorial, the result is no existing tables in the database.
Creating tables
Now you should be ready to start writing the tables from nycflights13
to the database:
<- utils::data(package = "nycflights13")$results[, "Item"]
local_tables <- setdiff(local_tables, remote_tables)
tables
for (table in tables) {
<- getExportedValue("nycflights13", table)
df
message("Creating table: ", table)
<- table
table_name
<- fun_connect()
conn copy_to(
conn,
df,
table_name,unique_indexes = unique_index[[table]],
indexes = index[[table]],
temporary = FALSE
)dbDisconnect(conn)
}
This approach prevents errors in case that a table already exists. Also, it prevents a bigger error that would be to overwrite an existing table instead of appending new observations.
Accessing tables
You can use dplyr
in a very similar way to what you would do with a local data.frame
. Let’s say you need to count the observations per month and save the result locally:
library(dplyr)
<- fun_connect()
conn <- tbl(conn, "flights") %>%
obs_per_month group_by(year, month) %>%
count() %>%
collect()
dbDisconnect(conn)
obs_per_month
# A tibble: 12 x 3
# Groups: year [1]
year month n
<int> <int> <int64>
1 2013 1 27004
2 2013 2 24951
3 2013 3 28834
4 2013 4 28330
5 2013 5 28796
6 2013 6 28243
7 2013 7 29425
8 2013 8 29327
9 2013 9 27574
10 2013 10 28889
11 2013 11 27268
12 2013 12 28135
The previous chunk uses the collect()
function, which stores the result of the query as a local data.frame
, which can be used with ggplot2
, saved as CSV with readr
, etc.
Appending new records
Let’s say you have a table flights14
with records for the year 2014. To append those records to the existing flights
table in the database, you need to run:
<- fun_connect()
conn dbWriteTable(conn, "flights", flights14, append = TRUE, overwrite = FALSE, row.names = FALSE)
dbDisconnect(conn)
Appending large tables
Let’s say your flights14
contains 12,345,678 rows. A better approach than writing the full table at once is to divide it in parts of 2,500,000 rows (or other smaller number).
Using base R
<- 2500000
N <- seq(N, N * ((nrow(flights14) %/% N) + 1), by = N)
divisions
<- list()
flights14_2
for (j in seq_along(divisions)) {
if (j == 1) {
<- flights14[1:divisions[j],]
flights14_2[[j]] else {
} <- flights14[(divisions[j-1] + 1):divisions[j]]
flights14_2[[j]]
}
}
rm(flights14)
for (j in seq_along(divisions)) {
message(sprintf("Writing fragment %s of %s", j, length(divisions)))
<- fun_connect()
conn dbWriteTable(conn, "flights", flights14_2[[j]],
append = TRUE, overwrite = FALSE, row.names = FALSE)
dbDisconnect(conn)
}
Using the Tidyverse
library(tidyverse)
<- 2500000
N
<- flights14 %>%
flights14 mutate(p = floor(row_number() / N) + 1) %>%
group_by(p) %>%
nest() %>%
ungroup() %>%
select(data) %>%
pull()
map(
seq_along(flights14),
function(x) {
message(sprintf("Writing fragment %s of %s", x, length(flights14)))
<- fun_connect()
conn dbWriteTable(conn, "flights", flights14[[x]],
append = TRUE, overwrite = FALSE, row.names = FALSE)
dbDisconnect(conn)
} )
Resources used for this tutorial
I created a 1GB RAM + 25GB disk VPS by using the Supabase Postgres Image from Digital Ocean Marketplace. You can even explore the code on GitHub.
Support more open source projects
You can Buy me a coffee. If you want to try DigitalOcean, please use my referral link which gives you $100 in credits for 60 days and helps me covering the hosting costs for my open source projects.