import censo2017
# I'll skip this as I've already downloaded the DB
# censo2017.descargar()
Comparing CENSO 2017 versions for Python and R
Motivation
Language wars are pretty counterproductive, in my opinion. Different languages have different strengths. Here I’ll show the equivalence of Python and R for a particular context, which is creating and querying a DuckDB (embedded SQL) database.
A simple exercise with the census
The general idea here is to explore the question labelled “p15” (Highest Educational Attainment Obtained) to obtain how many people with completed high school we have in each region in Chile (and also how many with college diploma, etc). To do this we use the library CENSO 2017 to perform three steps.
Here I won’t use pandas, dplyr or anything besides the CENSO 2017 library and DuckDB (i.e., I’ll run standard SQL queries on different languages).
Step 1
The download function will delete any previous version of the database that exists. It needs to be run only once. This will create a DuckDB (SQL) database.
In Python we would type:
In R we would type:
library(censo2017)
# I'll skip this as I've already downloaded the DB
# censo_descargar()
Step 2
Once the tables have been imported, it is possible to read selectively from SQL. For this you have to open a connection. Let’s look, for example, at the variable “p15” in the table “variables”.
We can also see the coding of the variable in the table “variables_coding”.
In Python we would type:
import duckdb
= duckdb.connect(database = censo2017.archivo_sql())
con
# read the table variables in the connection and filter for the variable p15
"SELECT * FROM variables WHERE variable = 'p15'").df()
con.execute(
"SELECT * FROM variables_codificacion WHERE variable = 'p15'").df() con.execute(
In R we would type:
library(duckdb)
<- censo_conectar()
con
# read the table variables in the connection and filter for the variable p15
dbGetQuery(con, "SELECT * FROM variables WHERE variable = 'p15'")
dbGetQuery(con, "SELECT * FROM variables_codificacion WHERE variable = 'p15'")
tabla variable valor1 personas p15 1
2 personas p15 2
3 personas p15 3
4 personas p15 4
5 personas p15 5
6 personas p15 6
7 personas p15 7
8 personas p15 8
9 personas p15 9
10 personas p15 10
11 personas p15 11
12 personas p15 12
13 personas p15 13
14 personas p15 14
15 personas p15 99
16 personas p15 98
descripcion1 Sala Cuna o Jardín Infantil
2 Prekínder
3 Kínder
4 Especial o Diferencial
5 Educación Básica
6 Primaria o Preparatoria (Sistema Antiguo)
7 Científico-Humanista
8 Técnica Profesional
9 Humanidades (Sistema Antiguo)
10 Técnica Comercial, Industrial/Normalista (Sistema Antiguo)
11 Técnico Superior (1-3 Años)
12 Profesional (4 o Más Años)
13 Magíster
14 Doctorado
15 Valor Perdido
16 No Aplica
Step 3
To get detailed information for each region in relation to question “p15”, we need to think of the REDATAM data as a tree, and we need to join “zones” with “dwellings” by zone ID, then join “dwellings” with “households” by household ID, and then “households” with “persons” by household ID. This is done quickly with the DuckDB backend.
In Python we would type:
"""
con.execute(SELECT "region", "p15", COUNT(*) AS "n"
FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "vivienda_ref_id", "LHS"."hogar_ref_id" AS "hogar_ref_id", "p15"
FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "LHS"."vivienda_ref_id" AS "vivienda_ref_id", "hogar_ref_id"
FROM (SELECT "region", "geocodigo", "LHS"."zonaloc_ref_id" AS "zonaloc_ref_id", "vivienda_ref_id"
FROM (SELECT SUBSTR(CAST("geocodigo" AS TEXT), 1, 2) AS "region", "geocodigo", "zonaloc_ref_id"
FROM "zonas") "LHS"
INNER JOIN (SELECT "zonaloc_ref_id", "vivienda_ref_id"
FROM "viviendas") "RHS"
ON ("LHS"."zonaloc_ref_id" = "RHS"."zonaloc_ref_id")
) "LHS"
INNER JOIN (SELECT "vivienda_ref_id", "hogar_ref_id"
FROM "hogares") "RHS"
ON ("LHS"."vivienda_ref_id" = "RHS"."vivienda_ref_id")
) "LHS"
INNER JOIN (SELECT "hogar_ref_id", "p15"
FROM "personas") "RHS"
ON ("LHS"."hogar_ref_id" = "RHS"."hogar_ref_id")
) "q01"
GROUP BY "region", "p15"
""").df()
con.close()
In R we would type:
dbGetQuery(con,
'
SELECT "region", "p15", COUNT(*) AS "n"
FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "vivienda_ref_id", "LHS"."hogar_ref_id" AS "hogar_ref_id", "p15"
FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "LHS"."vivienda_ref_id" AS "vivienda_ref_id", "hogar_ref_id"
FROM (SELECT "region", "geocodigo", "LHS"."zonaloc_ref_id" AS "zonaloc_ref_id", "vivienda_ref_id"
FROM (SELECT SUBSTR(CAST("geocodigo" AS TEXT), 1, 2) AS "region", "geocodigo", "zonaloc_ref_id"
FROM "zonas") "LHS"
INNER JOIN (SELECT "zonaloc_ref_id", "vivienda_ref_id"
FROM "viviendas") "RHS"
ON ("LHS"."zonaloc_ref_id" = "RHS"."zonaloc_ref_id")
) "LHS"
INNER JOIN (SELECT "vivienda_ref_id", "hogar_ref_id"
FROM "hogares") "RHS"
ON ("LHS"."vivienda_ref_id" = "RHS"."vivienda_ref_id")
) "LHS"
INNER JOIN (SELECT "hogar_ref_id", "p15"
FROM "personas") "RHS"
ON ("LHS"."hogar_ref_id" = "RHS"."hogar_ref_id")
) "q01"
GROUP BY "region", "p15"
')
dbDisconnect(con, shutdown = T)
region p15 n1 02 8 91434
2 02 12 90268
3 02 7 143032
4 02 11 47194
5 02 5 124812
6 02 13 5559
7 02 10 4368
8 02 3 13215
9 02 9 8624
10 02 1 15229
...