Because of delays with my scholarship payment, if this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor.
Continuing with the previous Selenium post, now I will process each job offer and organize its contents.
This requires the readxl package to read XLSX files:
if (!require(readxl)) install.packages("readxl")To read the XLSX from part2 and start reading each offer I start with:
library(RSelenium)
library(rvest)
library(dplyr)
library(purrr)
library(writexl)
library(readxl)
offers_tbl <- read_xlsx("offers_20250821.xlsx")
rmDr <- remoteDriver(port = 4444L, browserName = "chrome")
rmDr$open(silent = TRUE)From this table, I can proceed reading the HTML for each job offer and see how it is structured. Starting with the first URL:
rmDr$navigate(offers_tbl$link[1])
html <- read_html(rmDr$getPageSource()[[1]])This specific job offer has the following contents:
> html
{html_document}
<html lang="es">
[1] <head id="Head1">\n<meta http-equiv="Content-Type" content="text/html; ch ...
[2] <body>\n <form name="form1" method="post" action="convpostularavis ...Inspecting the details in the offer as in part 1, the full description is contained in a single HTML division with sub-divisions:
<div class="item formatodeclaraciones">
<div class="row top">
<h2><span id="lblAvisoTrabajo">Medico (a) especialista en Anestesiología</span></h2>
</div>
<hr>
<div class="bottom">
<div class="row">
<div class="col-md-6">
<span id="lblAvisoTrabajoDatos"><div><h3> Institución</h3><p>Ministerio de Salud / Servicio de Salud Maule / Hospital de Constitución</p><h3>Convocatoria</h3><p>Medico (a) especialista en Anestesiología 44 horas</p><h3>Nº de Vacantes </h3><p>1</p><h3>Área de Trabajo</h3><p>Salud</p><h3>Región</h3><p>Región del Maule</p><h3>Ciudad</h3><p>Constitución</p><h3>Tipo de Vacante</h3><p>Contrata</p></div></span>
...To organize this in a table, I can do the following amongh other possibilities to get the title, institution, number of offers, city, compensation, and educational requirements:
title <- html %>% html_element("h2 span#lblAvisoTrabajo") %>% html_text(trim = TRUE)
institution <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Institución') + p") %>% html_text(trim = TRUE)
positions <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Nº de Vacantes') + p") %>% html_text(trim = TRUE)
city <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Ciudad') + p") %>% html_text(trim = TRUE)
compensation <- html %>% html_element("span#lblRenta li ul li") %>% html_text(trim = TRUE)
education <- html %>% html_element("span#lblFormacion p") %>% html_text(trim = TRUE)
d <- tibble(
title = title,
institution = institution,
positions = positions,
city = city,
compensation = compensation,
education = education
)The result is the following table:
> d
# A tibble: 1 × 6
title institution positions city compensation education
<chr> <chr> <chr> <chr> <chr> <chr>
1 Medico (a) especialista en… Ministerio… 1 Cons… Renta Bruta… Título p…I see that the compensation value needs tidying:
> d$compensation
[1] "Renta Bruta6.398.194"
To tidy it, I can do this to remove the leading text and number separators:
d <- d %>%
mutate(compensation = as.numeric(gsub("Renta Bruta|\\.", "", compensation)))which leads to the desired value for posterior analysis:
> d$compensation
[1] 6398194To do the same with each of the 545 saved job offers, we repeat the same with purrr:
descriptions_tbl <- map_df(
seq_len(nrow(offers_tbl)),
function(x) {
print(x) # just to see at which iteration it fails (if it fails)
rmDr$navigate(offers_tbl$link[x])
html <- read_html(rmDr$getPageSource()[[1]])
title <- html %>% html_element("h2 span#lblAvisoTrabajo") %>% html_text(trim = TRUE)
institution <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Institución') + p") %>% html_text(trim = TRUE)
positions <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Nº de Vacantes') + p") %>% html_text(trim = TRUE)
city <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Ciudad') + p") %>% html_text(trim = TRUE)
compensation <- html %>% html_element("span#lblRenta li ul li") %>% html_text(trim = TRUE)
education <- html %>% html_element("span#lblFormacion p") %>% html_text(trim = TRUE)
d <- tibble(
title = title,
institution = institution,
positions = positions,
city = city,
compensation = compensation,
education = education
)
d <- d %>%
mutate(compensation = as.numeric(gsub("Renta Bruta|\\.", "", compensation)))
d
}
)The result is the following:
> descriptions_tbl
# A tibble: 545 × 6
title institution positions city compensation education
<chr> <chr> <chr> <chr> <dbl> <chr>
1 Medico (a) especialista e… Ministerio… 1 Cons… 6398194 "Título …
2 NA NA NA NA NA NA
3 ENFERMERA-O, JORNADA DIUR… Ministerio… 1 Reco… 1906087 "Título …
4 Psiquiatra infanto-juveni… Ministerio… 1 La P… 2333658 ""
5 Neurólogo(a) adulto GES A… Ministerio… 1 Puen… 637926 ""
6 Médico(a) especialista en… Ministerio… 2 Cast… 5328446 "Título …
7 Arquitecto de Software Ministerio… 1 Ñuñoa 2256523 "Profesi…
8 DIRECCIÓN DEL SERVICIO DE… Ministerio… 1 Chil… 1540891 ""
9 01 CARGO DE TENS OPERADOR… Ministerio… 1 Peña… 851161 ""
10 TENS DE CUIDADOS PALIATIV… Ministerio… 2 San … 636462 "Titulo …
# ℹ 535 more rows
There are some blank rows because of links under maintenance or that lead to external municipal sites with a different structure.
Here is a recount of the blanks on each field:
descriptions_tbl %>%
summarise(
across(
everything(),
list(
na_count = ~sum(is.na(.))
),
.names = "{.col}_{.fn}"
)
)which shows that all the blank values correspond to the same observations:
# A tibble: 1 × 6
title_na_count institution_na_count positions_na_count city_na_count
<int> <int> <int> <int>
1 187 187 187 187
# ℹ 2 more variables: compensation_na_count <int>, education_na_count <int>I got 547 - 187 = 360 well organized observations with a scraping process that took around five minutes. Not bad!
This needs an XLSX backup to avoid scraping twice:
write_xlsx(descriptions_tbl, "descriptions_20250821.xlsx")
I hope this was useful. In the next parts I will cover some analysis and plots with this data.