R Para Ciencia de Datos: How to do data science with R in spanish

Why translating this book?

‘R for Data Science’ is a hands-on book used by many to learn the fundamental of the R language. However, many spanish speakers struggle to use this book as a resource because of the english language barrier.

I’m very fond of have worked with my friend and colleague Riva Quiroga, who invited me to work with her, Edgar Ruiz and others, to address this gap in accessibility and have created ‘R Para Ciencia de Datos’. As she mentioned during her talk at rstudio::conf 2020, she highlights that initiatives such as this are not just software, but tools to make the community stronger.

In the case of Chile, an elementary level english course costs around 500 USD/month while the minimum wage is 350 USD/month and the median is around 550 USD/month according to the last reports by the National Bureau for Statistics (INE Chile). Rather than placing the burden of learning english on the learner, as community leaders and educators we can take action to reduce the language barriers with social and technological solutions.

Additionally, we created the R package ‘datos’ to automatically translate datasets from english to spanish using computational tools already existing in the R ecosystem. Together, the book ‘R Para Ciencia de Datos’ and the ‘datos’ package allows spanish speakers to spend their energy not in understanding english but in learning data science in R.

Which lesson I learned after completing this process?

I was very happy to work with people from different countries and realize that we needed to apply different conventions to be able to write in neutral spanish (does it exist?) and get to translate all the chapters from the book.

My role at this was to translate chapters 2, 3, 12, 13, 28, re-draw and translate the diagrams for the complete book and be the ‘gatekeeper’ for the repository.

The core lesson is: Respect diversity! Diversity, like I always insist, is more than a set of beautifully written tweets. The rest is about creating the right ‘sandbox’ to prevent people from deleting branches, or pushing to the main translation branch without a peer-reviewed PR, and giving them more and more freedom as they learn how to use different GitHub functionalities (something not trivial to learn!).

Can I measure my contribution to the project?

Yes, here’s an analysis of the 112 merged PRs during the project. This analysis does not include Edgar Ruiz contributions because he contributed to datasets translation and not to chapters translations directly.

For the analysis I started by the required packages to read the GitHub history.

library(gh)
library(purrr)
library(glue)
library(readr)
library(dplyr)
library(lubridate)
library(stringr)
library(ggplot2)

Then I defined which repository to read and a function to get the relevant information for an elemental analysis.

user <- "cienciadedatos"
repo <- "r4ds"
limit <- 500

get_prs <- function() {
  res  <- gh(
    "/repos/{user}/{repo}/pulls?state=all",
    user = user, repo = repo, page = page, per_page = limit,
    .token = Sys.getenv("GITHUB_TOKEN"), .limit = limit
  )

  author <- res %>%
    map_chr(~ .x$user$login)

  created <- res %>%
    map_chr(~ .x$created_at)

  state <- res %>%
    map_chr(~ .x$state)

  return(
    tibble(
      author = author,
      created = created,
      state = state
    )
  )
}

I captured the data once, and then read it many times until I polished my plots.

prs_rds <- "~/github/r4ds-es-analysis/prs.rds"

if (!file.exists(prs_rds)) {
  prs <- get_prs()
  saveRDS(prs, prs_rds)
} else {
  prs <- readRDS(prs_rds)
}

Finally I made the plots by using Pokemon colours. I found many ties for the 3rd, 4th and 5th place in the contributions ranking.

cols <- c("#78c850", "#f08030", "#6890f0", "#a8b820", "#a8a878",
          "#a040a0", "#f8d030", "#e0c068", "#ee99ac", "#c03028",
          "#f85888", "#b8a038", "#705898", "#98d8d8", "#7038f8")

prs2 <- prs %>%
  filter(state == "closed") %>%
  mutate(
    created = as_datetime(created),
    month = paste(year(created), 
                   str_pad(month(created), 2, "left", "0"),
                   sep = "-")
  )

prs2 %>%
  group_by(month) %>%
  count() %>%
  ungroup() %>%
  mutate(y = cumsum(n)) %>%
  ggplot() +
  geom_col(aes(x = month, y = y)) +
  labs(x = "Month", y = "No. of PRs",
       title = "Merged PRs per month") +
  theme_minimal(base_size = 10, base_family = "Source Sans Pro") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

prs2_top <- prs2 %>%
  group_by(author) %>%
  count(sort = T) %>%
  ungroup() %>%
  mutate(rank = dense_rank(desc(n))) %>%
  arrange(rank)

prs2 %>%
  inner_join(
    prs2_top %>% filter(rank <= 5)
  ) %>%
  group_by(month, author) %>%
  count() %>%
  group_by(author) %>%
  mutate(y = cumsum(n)) %>%
  ggplot() +
  geom_col(aes(x = month, y = y, fill = author)) +
  scale_fill_manual(values = cols) +
  labs(x = "Month", y = "No. of PRs",
       title = "Merged PRs for top 5 ranked contributors") +
  theme_minimal(base_size = 10, base_family = "Source Sans Pro") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

prs2 %>%
  inner_join(
    prs2_top %>% filter(rank > 5)
  ) %>%
  group_by(month, author) %>%
  count() %>%
  group_by(author) %>%
  mutate(y = cumsum(n)) %>%
  ggplot() +
  geom_col(aes(x = month, y = y, fill = author)) +
  scale_fill_manual(values = cols) +
  labs(x = "Month", y = "No. of PRs",
       title = "Merged PRs for non-top 5 ranked contributors") +
  theme_minimal(base_size = 10, base_family = "Source Sans Pro")

I’m happy to have the 2nd place in the contributions ranking!

How can I use the spanish translation to translate to another language?

The infrastructure for ‘R4DS in Spanish’ motivated the creation of R4DS in Portuguese. At the GitHub organization Ciencia de Datos (Data Science), we have the next available resources:

  • R4DS in Spanish: Book translation. See the ‘traduccion’ branch, ‘main’ is kept intact to ease syncronization with the original version.
  • Datos: Datasets translation in spanish.
  • Dados: Datasets translation in portuguese.

The R4DS-ES repository also contains a redrawing of all the diagrams in SVG format to be edited with Inkscape regardless of the operating system that you use. The diagrams are provided both in english and spanish. The book itself is a standard ‘bookdown’ project in R.

The ‘datos’ package makes use of YAML specifications to automatically translate data sets originally available in other R packages. The translated data can be used together with R4DS book or independently as a source of practice data in spanish. The YAML specification for each dataset that provides the dataset name, how you want to translate the variables, and the description for the documentation. This process not only gets the dataset translated, but also the help page for the dataset, which is very useful for people who are learning. ‘Datos’ translates the datasets on the fly, thanks to delayedAssign() from base R, so the datasets are not in the package, as it just contains YAML files with translation specifications and functions that translate the datasets called from other packages.

As an example, let’s inspect the first rows of the airlines table from ‘nycflights13’. This dataset has two columns carrier and name, which provide a two-letter abbreviation and the full name of the airline.

head(nycflights13::airlines)

## # A tibble: 6 x 2
##   carrier name                    
##   <chr>   <chr>                   
## 1 9E      Endeavor Air Inc.       
## 2 AA      American Airlines Inc.  
## 3 AS      Alaska Airlines Inc.    
## 4 B6      JetBlue Airways         
## 5 DL      Delta Air Lines Inc.    
## 6 EV      ExpressJet Airlines Inc.

This is the specification for the airlines table from ‘nycflights13’. Here, we provide both a translation (trans:) and description (desc:) in spanish as well as additional helpful information.

df:
  source: nycflights13::airlines
  name: aerolineas
variables:
  carrier:
    trans: aerolinea
    desc: "abreviaci\u00f3n de dos caracteres del nombre de la
     aerol\u00EDnea"
  name:
    trans: nombre
    desc: "nombre completo de la aerol\u00EDnea"
help:
  name: aerolineas
  alias: aerolineas
  title: "Nombres de aerol\u00EDneas"
  description: "Nombres de aerol\u00EDneas y su respectivo c\u00f3digo
   carrier de dos d\u00EDgitos."
  usage: aerolineas
  format: Un data.frame con 16 filas y 2 columnas