Bindings to Tesseract-OCR: a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results.
- Upstream Tesseract-OCR documentation: https://tesseract-ocr.github.io/tessdoc/
- Introduction: https://docs.ropensci.org/tesseract/articles/intro.html
- Reference: https://docs.ropensci.org/tesseract/reference/ocr.html
Differences with the original tesseract R package
This package initially started as a series of modifications to the original tesseract
package to improve performance and add new features. Some of the changes contributed to the original included the functions to choose between the “best” and “fast” models.
However, some changes were not integrated, such as using the cpp11
package, which I need to comply with the Munk School IT standards. Using cpp11
allows me to vendor the C++ headers into the package, and then I can conduct an offline installation in the Niagara Cluster.
The documentation changes a bit. I tried to expand the documentation and compare with Amazon Textract output.
This package includes some changes requested by CRAN, and these are mostly about the package internals.
Installation
Installation from source on Linux or OSX requires the Tesseract
library (see below).
Install from source
On Debian or Ubuntu install libtesseract-dev and libleptonica-dev. Also install tesseract-ocr-eng to run examples.
-get install -y libtesseract-dev libleptonica-dev tesseract-ocr-eng sudo apt
On Ubuntu you can optionally use this PPA to get the latest version of Tesseract:
-apt-repository ppa:alex-p/tesseract-ocr-devel
sudo add-get install -y libtesseract-dev tesseract-ocr-eng sudo apt
On Fedora you need tesseract-devel and leptonica-devel
-devel leptonica-devel sudo yum install tesseract
On RHEL and CentOS you need tesseract-devel and leptonica-devel from EPEL
-release
sudo yum install epel-devel leptonica-devel sudo yum install tesseract
On OS-X use tesseract from Homebrew:
brew install tesseract
Tesseract uses training data to perform OCR. Most systems default to English training data. To improve OCR results for other languages you can to install the appropriate training data. On Windows and OSX you can do this in R using tesseract_download()
:
tesseract_download('fra')
On Linux you need to install the appropriate training data from your distribution. For example to install the spanish training data:
- tesseract-ocr-spa (Debian, Ubuntu)
- tesseract-langpack-spa (Fedora, EPEL)
Alternatively you can manually download training data from github and store it in a path on disk that you pass in the datapath
parameter or set a default path via the TESSDATA_PREFIX
environment variable. Note that the Tesseract 4 and Tesseract 3 use different training data format. Make sure to download training data from the branch that matches your libtesseract version.