Motivation
I had to extract multiple tables from PDF files and do some data analysis in R. I found that updating tabulizer (now retired from CRAN) to use a Java version newer than Java 8 (deprecated) was worth it to complete this task.
tabulapdf is a reworked version of tabulizer that works with OpenJDK 11 and newer. I wanted to share it here and show how to use it to extract tables from PDF files.
About
tabulapdf provides R bindings to the Tabula java library, which can be used to computationally extract tables from PDF documents. The main function extract_tables() mimics the command-line behavior of the Tabula, by extracting all tables from a PDF file and, by default, returns those tables as a list of character tibbles in R.
if (!require(tabulapdf)) install.packages("tabulapdf")
Loading required package: tabulapdf
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'tabulapdf'
Installing package into '/home/pacha/R/x86_64-pc-linux-gnu-library/4.5'
(as 'lib' is unspecified)
also installing the dependencies 'png', 'rJava'
library(tabulapdf)
# set Java memory limit to 600 MB (optional)
options(java.parameters = "-Xmx600m")
f <- system.file("examples", "mtcars.pdf", package = "tabulapdf")
# extract table from first page of example PDF
tab <- extract_tables(f, pages = 1)
tab[[1]]
# A tibble: 5 × 12
model mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Dr… 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1
5 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
The pages argument allows you to select which pages to attempt to extract tables from. By default, Tabula (and thus tabulapdf) checks every page for tables using a detection algorithm and returns all of them. pages can be an integer vector of any length; pages are indexed from 1.
It is possible to specify a remote file, which will be copied to R’s temporary directory before processing:
f2 <- "https://raw.githubusercontent.com/ropensci/tabulapdf/main/inst/examples/mtcars.pdf"
extract_tables(f2, pages = 2)
[[1]]
# A tibble: 1 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<chr> <chr> <chr> <chr> <chr>
1 "5.10\r4.90\r4.70\r4.60\r5.00" "3.50\r3.00\r… "1.40\r1.40… "0.20\r0.2… "setos…
Modifying the Return Value
Tabula itself implements three “writer” methods that write extracted tables to disk as CSV, TSV, or JSON files. These can be specified by output = "csv", output = "tsv", and output = "json", respectively. For CSV and TSV, one file is written to disk for each table and R session’s temporary directory tempdir() is used by default (alternatively, the directory can be specified through output argument). For JSON, one file is written containing information about all tables. For these methods, extract_tables() returns a path to the directory containing the output files.
# extract tables to CSVs
extract_tables(f, output = "csv")
If none of the standard methods works well, you can specify output = "asis" to return an rJava “jobjRef” object, which is a pointer to a Java ArrayList of Tabula Table objects. Working with that object might be quite awkward as it requires knowledge of Java and Tabula’s internals, but might be useful to advanced users for debugging purposes.
Miscellaneous Functionality
Tabula is built on top of the Java PDFBox library), which provides low-level functionality for working with PDFs. A few of these tools are exposed through tabulapdf, as they might be useful for debugging or generally for working with PDFs. These functions include:
extract_text() converts the text of an entire file or specified pages into an R character vector.
split_pdf() and merge_pdfs() split and merge PDF documents, respectively.
extract_metadata() extracts PDF metadata as a list.
get_n_pages() determines the number of pages in a document.
get_page_dims() determines the width and height of each page in pt (the unit used by area and columns arguments).
make_thumbnails() converts specified pages of a PDF file to image files.