Extract text from an image. Requires that you have training data for the language you are reading. Works best for images with high contrast, little noise and horizontal text. See tesseract wiki and our package vignette for image preprocessing tips.
Arguments
- image
file path, url, or raw vector to image (png, tiff, jpeg, etc)
- engine
a tesseract engine created with
tesseract()
. Alternatively a language string which will be passed totesseract()
.- HOCR
if
TRUE
return results as HOCR xml instead of plain text
Details
The ocr()
function returns plain text by default, or hOCR text if hOCR is set to TRUE
.
The ocr_data()
function returns a data frame with a confidence rate and bounding box for
each word in the text.
See also
Other tesseract:
tesseract()
,
tesseract_download()
Examples
# Simple example
file <- system.file("examples", "testocr.png", package = "cpp11tesseract")
text <- ocr(file)
cat(text)
#> This is a lot of 12 point text to test the
#> ocr code and see if it works on all types
#> of file format.
#>
#> The quick brown dog jumped over the
#> lazy fox. The quick brown dog jumped
#> over the lazy fox. The quick brown dog
#> jumped over the lazy fox. The quick
#> brown dog jumped over the lazy fox.