Working With SPSS Data in R
Sat, Jun 24, 2017Updated 2018-03-26
Introduction
I was in need of importing SPSS© data for work. There are some options but I’ve used both foreign
and haven
R packages. I prefer haven
because it integrates better with R’s tidyverse and started using it in detriment of foreign
when I verified it behaves well with factors and solves the deprecated factors labels in newer R versions.
The Data
For this post I found Diego Portales University National Survey. It consist in a publicly available survey applied since 2005 and applied at nation-wide level to ask people about their trust in institutions (e.g. government, police, firefighters, etc) and what its their option on same-sex marriage, restricting spaces to smoke, and more.
Importing Data
#devtools::install_github("ropenscilabs/skimr")
# Exploratory Data Analysis tools
library(ggplot2)
library(dplyr)
library(sjlabelled)
library(skimr)
# Read different formats
library(readr) # csv/tsv/txt
library(haven) # sav
# Data
url <- "http://encuesta.udp.cl/descargas/banco%20de%20datos/2015/Encuesta%20Nacional%20UDP%202015.sav"
try(dir.create("2017-06-24-working-with-spss-data-in-r"))
sav <- "../../data/2017-06-24-working-with-spss-data-in-r/udp_national_survey_2015.sav"
if (!file.exists(sav)) {download.file(url,sav)}
survey <- read_sav(sav)
Exploring data
To explore the data consider the survey is in spanish. So, “fecha” means date, “edad” means age, and sexo means “sex”.
# How many surveys do I have by day?
daily <- survey %>%
mutate(Fecha = as.Date(Fecha, "%d-%m-%Y")) %>%
rename(date = Fecha) %>%
group_by(date) %>%
summarise(n = n())
ggplot(daily, aes(date, n)) +
geom_line()
# How is the age distributed?
summary(survey$Edad_Entrevistado)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 32.00 48.00 47.92 61.00 89.00
age <- survey %>%
mutate(as.integer(Edad_Entrevistado)) %>%
rename(age = Edad_Entrevistado) %>%
group_by(age) %>%
summarise(n = n())
ggplot(age, aes(age, n)) +
geom_line()
# How is the sex distributed?
survey %>%
rename(sex_id = Sexo_Entrevistado) %>%
group_by(sex_id) %>%
summarise(n = n())
# A tibble: 2 x 2
sex_id n
<dbl+lbl> <int>
1 1 [Hombre] 651
2 2 [Mujer] 651
Exploring labels
In the last tibble we have no idea what is 1 and 2.
survey %>%
select(Sexo_Entrevistado) %>%
rename(sex_id = Sexo_Entrevistado) %>%
distinct() %>%
mutate(sex = as_factor(sex_id))
# A tibble: 2 x 2
sex_id sex
<dbl+lbl> <fct>
1 2 [Mujer] Mujer
2 1 [Hombre] Hombre
The last column (in spanish) shows us that in this survey “1 = Male” and “2 = Female”.
I could run
survey %>%
rename(sex = Sexo_Entrevistado) %>%
mutate(sex = as.integer(sex)) %>%
mutate(sex = recode(sex, `1` = "Male", `2` = "Female")) %>%
group_by(sex) %>%
summarise(n = n())
# A tibble: 2 x 2
sex n
<chr> <int>
1 Female 651
2 Male 651
The column names are labelled as well. Here sjlabelled
helps if I want to know for example what “P12” means. But instead of just translating labels I’ll describe the complete dataset.
Describing the dataset
survey %>%
skim() %>%
filter(stat == "complete") %>%
mutate(description = get_label(survey)) %>%
rename(pcent_valid = value) %>%
mutate(pcent_valid = paste0(100*round(pcent_valid / nrow(survey),2),'%'))
# A tibble: 203 x 7
variable type stat level pcent_valid formatted description
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 PONDERADOR numeric compl… .all 100% 1302 Ponderador
2 Folio numeric compl… .all 100% 1302 Folio
3 Región charac… compl… .all 100% 1302 Región
4 Comuna charac… compl… .all 100% 1302 Comuna
5 Fecha charac… compl… .all 100% 1302 Fecha entrevis…
6 Sexo_Encuest… charac… compl… .all 91% 1186 Sexo Entrevist…
7 GSE charac… compl… .all 100% 1302 GSE Visual
8 Sexo_Entrevi… charac… compl… .all 100% 1302 Sexo Entrevist…
9 Edad_Entrevi… numeric compl… .all 100% 1302 Edad Entrevist…
10 Hora_Inicio charac… compl… .all 100% 1302 Hora Inicio Me…
# … with 193 more rows
Exploring the last tibble there are interesting questions. For example, P12 refers to “Apoyo a la democracia” that is Do you support democracy?.