Using cpp11poppler

Extracting Text

You can extract text from a PDF file using the pdf_text function. This function returns a character vector of equal length to the number of pages in the PDF file.

library(cpp11poppler)

## Using poppler version 22.02.0

file <- system.file("examples", "recipes.pdf", package = "cpp11poppler")
text <- pdf_text(file)
cat(text[1])

##                            Traditional Recipes
##                                  jewishfoodsociety.org
## 
## 
## 
## Kreplach
## 
## Yield
## 
## 90 dumplings
## 
## 
## Ingredients
## 
## For the filling:
## 
##    • 6 tablespoons canola oil
##    • 1 small chicken breast (about 5 ounces; ½ cup chopped), cut into ½-inch pieces
##    • 3 chicken livers (about 3 ounces; ¼ cup chopped), cut into ½-inch pieces
##    • ½ teaspoon kosher salt, plus more to taste
##    • 3 medium onions, roughly chopped
##    • 1 egg, lightly beaten, divided
##    • ½ teaspoon freshly ground black pepper
## 
## For the dough:
## 
##    • 3½ cups all-purpose flour, sifted
##    • 1 tablespoon salt
##    • 1 egg, lightly beaten
##    • 1 cup lukewarm water
##    • Canola oil, for drizzling
## 
## For serving:
## 
##    • Chicken soup
##    • Fried onions
## 
## 
## 
## 
##                                              1

Extracting Metadata

You can extract metadata from a PDF file using the pdf_info function. This function returns a list with metadata information.

pdf_info(file)

## $version
## [1] "1.5"
## 
## $pages
## [1] 4
## 
## $encrypted
## [1] FALSE
## 
## $linearized
## [1] FALSE
## 
## $keys
## $keys$Creator
## [1] "LaTeX via pandoc"
## 
## $keys$Title
## [1] "Traditional Recipes"
## 
## $keys$Author
## [1] "jewishfoodsociety.org"
## 
## $keys$Producer
## [1] "xdvipdfmx (20240407)"
## 
## 
## $created
## [1] 1733797917
## 
## $modified
## [1] -1
## 
## $metadata
## integer(0)
## 
## $locked
## [1] FALSE
## 
## $attachments
## [1] FALSE
## 
## $layout
## [1] "no_layout"

Extracting Fonts

You can extract font information from a PDF file using the pdf_fonts function. This function returns a data frame with font information.

pdf_fonts(file)

##                                  name       type embedded file
## 1     CONQID+LMSans10-Bold-Identity-H cid_type0c     TRUE     
## 2 LUAZIR+LMRoman12-Regular-Identity-H cid_type0c     TRUE     
## 3 CTWPJX+LMRoman10-Regular-Identity-H cid_type0c     TRUE     
## 4    PXMFVJ+LMRoman10-Bold-Identity-H cid_type0c     TRUE

Extracting Attachments

You can extract attachments from a PDF file using the pdf_attachments function. This function returns a list with attachment information.

pdf_attachments(file)

## list()

Rendering Pages

You can render a PDF page to a bitmap array using the pdf_render_page function. This function returns a bitmap array that can be further processed in R.

page1 <- pdf_render_page(file, page = 1, dpi = 300)
png::writePNG(page1, "page1.png")

Converting Pages to Images

You can convert PDF pages to images using the pdf_convert function. This function saves the rendered pages as image files.

pdf_convert(file, format = "png", pages = 1:2, dpi = 300, verbose = FALSE)