Step-by-Step Guide to Use R and Selenium to Scrape Empleos Publicos (Part 3)

Using R, selenium and purrr to organize hundreds of HTML sections into one table.
Author

Mauricio “Pachá” Vargas S.

Published

August 21, 2025

Because of delays with my scholarship payment, if this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor.

Continuing with the previous Selenium post, now I will process each job offer and organize its contents.

This requires the readxl package to read XLSX files:

if (!require(readxl)) install.packages("readxl")

To read the XLSX from part2 and start reading each offer I start with:

library(RSelenium)
library(rvest)
library(dplyr)
library(purrr)
library(writexl)
library(readxl)

offers_tbl <- read_xlsx("offers_20250821.xlsx")

rmDr <- remoteDriver(port = 4444L, browserName = "chrome")

rmDr$open(silent = TRUE)

From this table, I can proceed reading the HTML for each job offer and see how it is structured. Starting with the first URL:

rmDr$navigate(offers_tbl$link[1])
html <- read_html(rmDr$getPageSource()[[1]])

This specific job offer has the following contents:

> html
{html_document}
<html lang="es">
[1] <head id="Head1">\n<meta http-equiv="Content-Type" content="text/html; ch ...
[2] <body>\n        <form name="form1" method="post" action="convpostularavis ...

Inspecting the details in the offer as in part 1, the full description is contained in a single HTML division with sub-divisions:

<div class="item formatodeclaraciones">
                                            <div class="row top">
                                                <h2><span id="lblAvisoTrabajo">Medico (a) especialista en Anestesiología</span></h2>
                                            </div>
                                            <hr>
                                            <div class="bottom">
                                                <div class="row">

                                                    <div class="col-md-6">
                                                <span id="lblAvisoTrabajoDatos"><div><h3> Institución</h3><p>Ministerio de Salud / Servicio de Salud Maule / Hospital de Constitución</p><h3>Convocatoria</h3><p>Medico (a) especialista en Anestesiología 44 horas</p><h3>Nº de Vacantes </h3><p>1</p><h3>Área de Trabajo</h3><p>Salud</p><h3>Región</h3><p>Región del Maule</p><h3>Ciudad</h3><p>Constitución</p><h3>Tipo de Vacante</h3><p>Contrata</p></div></span>
...

To organize this in a table, I can do the following amongh other possibilities to get the title, institution, number of offers, city, compensation, and educational requirements:

title <- html %>% html_element("h2 span#lblAvisoTrabajo") %>% html_text(trim = TRUE)
institution <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Institución') + p") %>% html_text(trim = TRUE)
positions <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Nº de Vacantes') + p") %>% html_text(trim = TRUE)
city <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Ciudad') + p") %>% html_text(trim = TRUE)
compensation <- html %>% html_element("span#lblRenta li ul li") %>% html_text(trim = TRUE)
education <- html %>% html_element("span#lblFormacion p") %>% html_text(trim = TRUE)

d <- tibble(
  title = title,
  institution = institution,
  positions = positions,
  city = city,
  compensation = compensation,
  education = education
)

The result is the following table:

> d
# A tibble: 1 × 6
  title                       institution positions city  compensation education
  <chr>                       <chr>       <chr>     <chr> <chr>        <chr>    
1 Medico (a) especialista en… Ministerio… 1         Cons… Renta Bruta… Título p…

I see that the compensation value needs tidying:

> d$compensation
[1] "Renta Bruta6.398.194"

To tidy it, I can do this to remove the leading text and number separators:

d <- d %>%
  mutate(compensation = as.numeric(gsub("Renta Bruta|\\.", "", compensation)))

which leads to the desired value for posterior analysis:

> d$compensation
[1] 6398194

To do the same with each of the 545 saved job offers, we repeat the same with purrr:

descriptions_tbl <- map_df(
  seq_len(nrow(offers_tbl)),
  function(x) {
    print(x) # just to see at which iteration it fails (if it fails)

    rmDr$navigate(offers_tbl$link[x])
    html <- read_html(rmDr$getPageSource()[[1]])

    title <- html %>% html_element("h2 span#lblAvisoTrabajo") %>% html_text(trim = TRUE)
    institution <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Institución') + p") %>% html_text(trim = TRUE)
    positions <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Nº de Vacantes') + p") %>% html_text(trim = TRUE)
    city <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Ciudad') + p") %>% html_text(trim = TRUE)
    compensation <- html %>% html_element("span#lblRenta li ul li") %>% html_text(trim = TRUE)
    education <- html %>% html_element("span#lblFormacion p") %>% html_text(trim = TRUE)

    d <- tibble(
      title = title,
      institution = institution,
      positions = positions,
      city = city,
      compensation = compensation,
      education = education
    )

    d <- d %>%
      mutate(compensation = as.numeric(gsub("Renta Bruta|\\.", "", compensation)))

    d
  }
)

The result is the following:

> descriptions_tbl
# A tibble: 545 × 6
   title                      institution positions city  compensation education
   <chr>                      <chr>       <chr>     <chr>        <dbl> <chr>    
 1 Medico (a) especialista e… Ministerio… 1         Cons…      6398194 "Título …
 2 NA                         NA          NA        NA              NA  NA      
 3 ENFERMERA-O, JORNADA DIUR… Ministerio… 1         Reco…      1906087 "Título …
 4 Psiquiatra infanto-juveni… Ministerio… 1         La P…      2333658 ""       
 5 Neurólogo(a) adulto GES A… Ministerio… 1         Puen…       637926 ""       
 6 Médico(a) especialista en… Ministerio… 2         Cast…      5328446 "Título …
 7 Arquitecto de Software     Ministerio… 1         Ñuñoa      2256523 "Profesi…
 8 DIRECCIÓN DEL SERVICIO DE… Ministerio… 1         Chil…      1540891 ""       
 9 01 CARGO DE TENS OPERADOR… Ministerio… 1         Peña…       851161 ""       
10 TENS DE CUIDADOS PALIATIV… Ministerio… 2         San …       636462 "Titulo …
# ℹ 535 more rows

There are some blank rows because of links under maintenance or that lead to external municipal sites with a different structure.

Here is a recount of the blanks on each field:

descriptions_tbl %>%
  summarise(
    across(
      everything(),
      list(
        na_count = ~sum(is.na(.))
      ),
      .names = "{.col}_{.fn}"
    )
  )

which shows that all the blank values correspond to the same observations:

# A tibble: 1 × 6
  title_na_count institution_na_count positions_na_count city_na_count
           <int>                <int>              <int>         <int>
1            187                  187                187           187
# ℹ 2 more variables: compensation_na_count <int>, education_na_count <int>

I got 547 - 187 = 360 well organized observations with a scraping process that took around five minutes. Not bad!

This needs an XLSX backup to avoid scraping twice:

write_xlsx(descriptions_tbl, "descriptions_20250821.xlsx")

I hope this was useful. In the next parts I will cover some analysis and plots with this data.