spuriouscorrelations: An R package to show examples about spurious correlations

R
Statistics
Correlation is not causation.
Author

Mauricio “Pachá” Vargas S.

Published

May 17, 2025

I’ve been busy with the field exams, so I haven’t had much time to work on the blog.

spuriouscorrelations package started as a fun project for one of my tutorials.

Here is a case of an interesting correlation: the number of people who drowned by falling into a pool and the number of films Nicholas Cage appeared in.

library(spuriouscorrelations)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)

unique(spurious_correlations$var1)
 [1] Suicides by hanging, strangulation and suffocation              
 [2] Number of people who drowned by falling into a pool             
 [3] Number of people who died by becoming tangled in their bedsheets
 [4] Murders by steam, hot vapours and hot objects                   
 [5] Computer science doctorates awarded in the US                   
 [6] Sociology doctorates awarded in the US                          
 [7] Civil engineering doctorates awarded in the US                  
 [8] People who drowned after falling out of a fishing boat          
 [9] Drivers killed in collision with railway train                  
[10] Total US crude oil imports                                      
[11] Number of people who drowned while in a swimming-pool           
[12] Suicides by crashing of motor vehicle                           
[13] Number of people killed by venomous spiders                     
[14] Mathematics doctorates awarded                                  
14 Levels: Civil engineering doctorates awarded in the US ...
drownings <- spurious_correlations %>%
  filter(
     var1 == "Number of people who drowned by falling into a pool"
  ) %>%
  select(year, var1, var2, var1_value, var2_value)

cor(drownings$var1_value, drownings$var2_value)
[1] 0.6660043

Now let’s plot the data.

# compute a scale factor so that max(var2_value * factor) ≈ max(var1_value)
max1 <- max(drownings$var1_value)
max2 <- max(drownings$var2_value)
ratio <- max1 / max2

ggplot(drownings, aes(x = year)) +
  geom_line(aes(y = var1_value, color = "Drownings")) +
  geom_line(aes(y = var2_value * ratio, color = "Films")) +
  scale_y_continuous(
    name = "Number of drownings",
    sec.axis = sec_axis(~ . / ratio,
      name = "Number of films"
    ),
    limits = c(0, NA)
  ) +
  scale_color_manual(
    name = "",
    values = c(
      "Drownings" = "blue",
      "Films" = "red"
    )
  ) +
  theme_minimal() +
  labs(
    title = "Number of people who drowned by falling into a pool vs.\nNumber of films Nicholas Cage appeared in",
    caption = "Source: Spurious Correlations (Vigen 2015)"
  )

Interested? You can install the package from GitHub

pak::pkg_install("pachadotdev/spuriouscorrelations")