Extracting Text
You can extract text from a PDF file using the pdf_text
function. This function returns a character vector of equal length to the number of pages in the PDF file.
## Using poppler version 22.02.0
file <- system.file("examples", "recipes.pdf", package = "cpp11poppler")
text <- pdf_text(file)
cat(text[1])
## Traditional Recipes
## jewishfoodsociety.org
##
##
##
## Kreplach
##
## Yield
##
## 90 dumplings
##
##
## Ingredients
##
## For the filling:
##
## • 6 tablespoons canola oil
## • 1 small chicken breast (about 5 ounces; ½ cup chopped), cut into ½-inch pieces
## • 3 chicken livers (about 3 ounces; ¼ cup chopped), cut into ½-inch pieces
## • ½ teaspoon kosher salt, plus more to taste
## • 3 medium onions, roughly chopped
## • 1 egg, lightly beaten, divided
## • ½ teaspoon freshly ground black pepper
##
## For the dough:
##
## • 3½ cups all-purpose flour, sifted
## • 1 tablespoon salt
## • 1 egg, lightly beaten
## • 1 cup lukewarm water
## • Canola oil, for drizzling
##
## For serving:
##
## • Chicken soup
## • Fried onions
##
##
##
##
## 1
Extracting Metadata
You can extract metadata from a PDF file using the pdf_info
function. This function returns a list with metadata information.
pdf_info(file)
## $version
## [1] "1.5"
##
## $pages
## [1] 4
##
## $encrypted
## [1] FALSE
##
## $linearized
## [1] FALSE
##
## $keys
## $keys$Creator
## [1] "LaTeX via pandoc"
##
## $keys$Title
## [1] "Traditional Recipes"
##
## $keys$Author
## [1] "jewishfoodsociety.org"
##
## $keys$Producer
## [1] "xdvipdfmx (20240407)"
##
##
## $created
## [1] 1733797917
##
## $modified
## [1] -1
##
## $metadata
## integer(0)
##
## $locked
## [1] FALSE
##
## $attachments
## [1] FALSE
##
## $layout
## [1] "no_layout"
Extracting Fonts
You can extract font information from a PDF file using the pdf_fonts
function. This function returns a data frame with font information.
pdf_fonts(file)
## name type embedded file
## 1 CONQID+LMSans10-Bold-Identity-H cid_type0c TRUE
## 2 LUAZIR+LMRoman12-Regular-Identity-H cid_type0c TRUE
## 3 CTWPJX+LMRoman10-Regular-Identity-H cid_type0c TRUE
## 4 PXMFVJ+LMRoman10-Bold-Identity-H cid_type0c TRUE
Extracting Attachments
You can extract attachments from a PDF file using the pdf_attachments
function. This function returns a list with attachment information.
pdf_attachments(file)
## list()
Rendering Pages
You can render a PDF page to a bitmap array using the pdf_render_page
function. This function returns a bitmap array that can be further processed in R.
page1 <- pdf_render_page(file, page = 1, dpi = 300)
png::writePNG(page1, "page1.png")
Converting Pages to Images
You can convert PDF pages to images using the pdf_convert
function. This function saves the rendered pages as image files.
pdf_convert(file, format = "png", pages = 1:2, dpi = 300, verbose = FALSE)