Comparing CENSO 2017 versions for Python and R

Using DuckDB with different languages

Author

Mauricio “Pachá” Vargas S.

Published

June 20, 2022

Motivation

Language wars are pretty counterproductive, in my opinion. Different languages have different strengths. Here I’ll show the equivalence of Python and R for a particular context, which is creating and querying a DuckDB (embedded SQL) database.

A simple exercise with the census

The general idea here is to explore the question labelled “p15” (Highest Educational Attainment Obtained) to obtain how many people with completed high school we have in each region in Chile (and also how many with college diploma, etc). To do this we use the library CENSO 2017 to perform three steps.

Here I won’t use pandas, dplyr or anything besides the CENSO 2017 library and DuckDB (i.e., I’ll run standard SQL queries on different languages).

Step 1

The download function will delete any previous version of the database that exists. It needs to be run only once. This will create a DuckDB (SQL) database.

In Python we would type:

import censo2017
# I'll skip this as I've already downloaded the DB
# censo2017.descargar()

In R we would type:

library(censo2017)
# I'll skip this as I've already downloaded the DB
# censo_descargar()

Step 2

Once the tables have been imported, it is possible to read selectively from SQL. For this you have to open a connection. Let’s look, for example, at the variable “p15” in the table “variables”.

We can also see the coding of the variable in the table “variables_coding”.

In Python we would type:

import duckdb
con = duckdb.connect(database = censo2017.archivo_sql())

# read the table variables in the connection and filter for the variable p15
con.execute("SELECT * FROM variables WHERE variable = 'p15'").df()

con.execute("SELECT * FROM variables_codificacion WHERE variable = 'p15'").df()

In R we would type:

library(duckdb)
con <- censo_conectar()

# read the table variables in the connection and filter for the variable p15
dbGetQuery(con, "SELECT * FROM variables WHERE variable = 'p15'")
dbGetQuery(con, "SELECT * FROM variables_codificacion WHERE variable = 'p15'")

      tabla variable valor
1  personas      p15     1
2  personas      p15     2
3  personas      p15     3
4  personas      p15     4
5  personas      p15     5
6  personas      p15     6
7  personas      p15     7
8  personas      p15     8
9  personas      p15     9
10 personas      p15    10
11 personas      p15    11
12 personas      p15    12
13 personas      p15    13
14 personas      p15    14
15 personas      p15    99
16 personas      p15    98

                                                  descripcion
1                                 Sala Cuna o Jardín Infantil
2                                                   Prekínder
3                                                      Kínder
4                                      Especial o Diferencial
5                                            Educación Básica
6                   Primaria o Preparatoria (Sistema Antiguo)
7                                        Científico-Humanista
8                                         Técnica Profesional
9                               Humanidades (Sistema Antiguo)
10 Técnica Comercial, Industrial/Normalista (Sistema Antiguo)
11                                Técnico Superior (1-3 Años)
12                                 Profesional (4 o Más Años)
13                                                   Magíster
14                                                  Doctorado
15                                              Valor Perdido
16                                                  No Aplica

Step 3

To get detailed information for each region in relation to question “p15”, we need to think of the REDATAM data as a tree, and we need to join “zones” with “dwellings” by zone ID, then join “dwellings” with “households” by household ID, and then “households” with “persons” by household ID. This is done quickly with the DuckDB backend.

In Python we would type:

con.execute("""
SELECT "region", "p15", COUNT(*) AS "n"
FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "vivienda_ref_id", "LHS"."hogar_ref_id" AS "hogar_ref_id", "p15"
FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "LHS"."vivienda_ref_id" AS "vivienda_ref_id", "hogar_ref_id"
FROM (SELECT "region", "geocodigo", "LHS"."zonaloc_ref_id" AS "zonaloc_ref_id", "vivienda_ref_id"
FROM (SELECT SUBSTR(CAST("geocodigo" AS TEXT), 1, 2) AS "region", "geocodigo", "zonaloc_ref_id"
FROM "zonas") "LHS"
INNER JOIN (SELECT "zonaloc_ref_id", "vivienda_ref_id"
FROM "viviendas") "RHS"
ON ("LHS"."zonaloc_ref_id" = "RHS"."zonaloc_ref_id")
) "LHS"
INNER JOIN (SELECT "vivienda_ref_id", "hogar_ref_id"
FROM "hogares") "RHS"
ON ("LHS"."vivienda_ref_id" = "RHS"."vivienda_ref_id")
) "LHS"
INNER JOIN (SELECT "hogar_ref_id", "p15"
FROM "personas") "RHS"
ON ("LHS"."hogar_ref_id" = "RHS"."hogar_ref_id")
) "q01"
GROUP BY "region", "p15"
""").df()

con.close()

In R we would type:

dbGetQuery(con,
'
SELECT "region", "p15", COUNT(*) AS "n"
FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "vivienda_ref_id", "LHS"."hogar_ref_id" AS "hogar_ref_id", "p15"
FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "LHS"."vivienda_ref_id" AS "vivienda_ref_id", "hogar_ref_id"
FROM (SELECT "region", "geocodigo", "LHS"."zonaloc_ref_id" AS "zonaloc_ref_id", "vivienda_ref_id"
FROM (SELECT SUBSTR(CAST("geocodigo" AS TEXT), 1, 2) AS "region", "geocodigo", "zonaloc_ref_id"
FROM "zonas") "LHS"
INNER JOIN (SELECT "zonaloc_ref_id", "vivienda_ref_id"
FROM "viviendas") "RHS"
ON ("LHS"."zonaloc_ref_id" = "RHS"."zonaloc_ref_id")
) "LHS"
INNER JOIN (SELECT "vivienda_ref_id", "hogar_ref_id"
FROM "hogares") "RHS"
ON ("LHS"."vivienda_ref_id" = "RHS"."vivienda_ref_id")
) "LHS"
INNER JOIN (SELECT "hogar_ref_id", "p15"
FROM "personas") "RHS"
ON ("LHS"."hogar_ref_id" = "RHS"."hogar_ref_id")
) "q01"
GROUP BY "region", "p15"
')

dbDisconnect(con, shutdown = T)

    region p15       n
1       02   8   91434
2       02  12   90268
3       02   7  143032
4       02  11   47194
5       02   5  124812
6       02  13    5559
7       02  10    4368
8       02   3   13215
9       02   9    8624
10      02   1   15229
...