Utilities based on libpoppler for extracting text, fonts, attachments and metadata from a pdf file.
Usage
pdf_info(pdf, opw = "", upw = "")
pdf_text(pdf, opw = "", upw = "")
pdf_data(pdf, font_info = FALSE, opw = "", upw = "")
pdf_fonts(pdf, opw = "", upw = "")
pdf_attachments(pdf, opw = "", upw = "")
pdf_toc(pdf, opw = "", upw = "")
pdf_pagesize(pdf, opw = "", upw = "")
Details
The pdf_text
function renders all textboxes on a text canvas
and returns a character vector of equal length to the number of pages in the
PDF file. On the other hand, pdf_data
is more low level and
returns one data frame per page, containing one row for each textbox in the PDF.
Note that pdf_data
requires a recent version of libpoppler
which might not be available on all Linux systems.
When using pdf_data
in R packages, condition use on
poppler_config()$has_pdf_data
which shows if this function can be
used on the current system.
Poppler is pretty verbose when encountering minor errors in PDF files,
in especially pdf_text
. These messages are usually safe
to ignore, use suppressMessages
to hide them altogether.
See also
Other cpp11poppler:
rendering
Examples
# Just a random pdf file
file <- system.file("examples", "recipes.pdf", package = "cpp11poppler")
info <- pdf_info(file)
text <- pdf_text(file)
fonts <- pdf_fonts(file)
files <- pdf_attachments(file)