I’ve been busy with the field exams, so I haven’t had much time to work on the blog.
spuriouscorrelations package started as a fun project for one of my tutorials.
Here is a case of an interesting correlation: the number of people who drowned by falling into a pool and the number of films Nicholas Cage appeared in.
if (!require(spuriouscorrelations)) install.packages("spuriouscorrelations")
Loading required package: spuriouscorrelations
if (!require(dplyr)) install.packages("dplyr")
Loading required package: dplyr
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
if (!require(ggplot2)) install.packages("ggplot2")
Loading required package: ggplot2
library(spuriouscorrelations)
library(dplyr)
library(ggplot2)
unique(spurious_correlations$var1)
[1] US spending on science, space, and technology
[2] Number of people who drowned by falling into a pool
[3] Per capita cheese consumption
[4] Divorce rate in Maine
[5] Age of Miss America
[6] Total revenue generated by arcades
[7] Worldwide non-commercial space launches
[8] Per capita consumption of mozzarella cheese
[9] People who drowned after falling out of a fishing boat
[10] US crude oil imports from Norway
[11] Per capita consumption of chicken
[12] Number of people who drowned while in a swimming-pool
[13] Japanese cars sold in the US
[14] Letters in the winning word of the Scripps National Spelling Bee
[15] Mathematics doctorates awarded
15 Levels: Age of Miss America ... Worldwide non-commercial space launches
drownings <- spurious_correlations %>%
filter(
var1 == "Number of people who drowned by falling into a pool"
) %>%
select(year, var1, var2, var1_value, var2_value)
cor(drownings$var1_value, drownings$var2_value)
Now let’s plot the data.
# compute a scale factor so that max(var2_value * factor) ≈ max(var1_value)
max1 <- max(drownings$var1_value)
max2 <- max(drownings$var2_value)
ratio <- max1 / max2
ggplot(drownings, aes(x = year)) +
geom_line(aes(y = var1_value, color = "Drownings")) +
geom_line(aes(y = var2_value * ratio, color = "Films")) +
scale_y_continuous(
name = "Number of drownings",
sec.axis = sec_axis(~ . / ratio,
name = "Number of films"
),
limits = c(0, NA)
) +
scale_color_manual(
name = "",
values = c(
"Drownings" = "blue",
"Films" = "red"
)
) +
theme_minimal() +
labs(
title = "Number of people who drowned by falling into a pool vs.\nNumber of films Nicholas Cage appeared in",
caption = "Source: Spurious Correlations (Vigen 2015)"
)
Interested? You can install the package from GitHub
pak::pkg_install("pachadotdev/spuriouscorrelations")