Type: | Package |
Title: | Open Source OCR Engine |
Version: | 5.2.2 |
Description: | Bindings to 'Tesseract': a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results. |
License: | Apache License 2.0 |
URL: | https://docs.ropensci.org/tesseract/ https://ropensci.r-universe.dev/tesseract |
BugReports: | https://github.com/ropensci/tesseract/issues |
SystemRequirements: | Tesseract >= 3.03 (libtesseract-dev / tesseract-devel) and Leptonica (libleptonica-dev / leptonica-devel). On Debian you need to install the English training data separately (tesseract-ocr-eng) |
Imports: | Rcpp (≥ 0.12.12), pdftools (≥ 1.5), curl, rappdirs, digest |
LinkingTo: | Rcpp |
RoxygenNote: | 7.3.2 |
Suggests: | magick (≥ 1.7), spelling, knitr, tibble, rmarkdown |
Encoding: | UTF-8 |
VignetteBuilder: | knitr |
Language: | en-US |
NeedsCompilation: | yes |
Packaged: | 2024-10-04 14:33:38 UTC; jeroen |
Author: | Jeroen Ooms |
Maintainer: | Jeroen Ooms <jeroenooms@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-10-04 15:20:15 UTC |
Tesseract OCR
Description
Extract text from an image. Requires that you have training data for the language you are reading. Works best for images with high contrast, little noise and horizontal text. See tesseract wiki and our package vignette for image preprocessing tips.
Usage
ocr(image, engine = tesseract("eng"), HOCR = FALSE)
ocr_data(image, engine = tesseract("eng"))
Arguments
image |
file path, url, or raw vector to image (png, tiff, jpeg, etc) |
engine |
a tesseract engine created with |
HOCR |
if |
Details
The ocr()
function returns plain text by default, or hOCR text if hOCR is set to TRUE
.
The ocr_data()
function returns a data frame with a confidence rate and bounding box for
each word in the text.
References
See Also
Other tesseract:
tesseract()
,
tesseract_download()
Examples
# Simple example
text <- ocr("https://jeroen.github.io/images/testocr.png")
cat(text)
xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE)
cat(xml)
df <- ocr_data("https://jeroen.github.io/images/testocr.png")
print(df)
# Full roundtrip test: render PDF to image and OCR it back to text
curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf")
orig <- pdftools::pdf_text("R-intro.pdf")[1]
# Render pdf to png image
img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400)
unlink("R-intro.pdf")
# Extract text from png image
text <- ocr(img_file)
unlink(img_file)
cat(text)
engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))
Tesseract Engine
Description
Create an OCR engine for a given language and control parameters. This can be used by the ocr and ocr_data functions to recognize text.
Usage
tesseract(
language = "eng",
datapath = NULL,
configs = NULL,
options = NULL,
cache = TRUE
)
tesseract_params(filter = "")
tesseract_info()
Arguments
language |
string with language for training data. Usually defaults to |
datapath |
path with the training data for this language. Default uses the system library. |
configs |
character vector with files, each containing one or more parameter values. These config files can exist in the current directory or one of the standard tesseract config files that live in the tessdata directory. See details. |
options |
a named list with tesseract parameters. See details. |
cache |
speed things up by caching engines |
filter |
only list parameters containing a particular string |
Details
Tesseract control parameters can be set either via a named list in the
options
parameter, or in a config
file text file which contains the parameter name
followed by a space and then the value, one per line. Use tesseract_params()
to list
or find parameters. Note that that some parameters are only supported in certain versions
of libtesseract, and that invalid parameters can sometimes cause libtesseract to crash.
See Also
Other tesseract:
ocr()
,
tesseract_download()
Examples
tesseract_params('debug')
Tesseract Training Data
Description
Helper function to download training data from the official tessdata repository. On Linux, the fast training data can be installed directly with yum or apt-get.
Usage
tesseract_download(
lang,
datapath = NULL,
model = c("fast", "best"),
progress = interactive()
)
Arguments
lang |
three letter code for language, see tessdata repository. |
datapath |
destination directory where to download store the file |
model |
either |
progress |
print progress while downloading |
Details
Tesseract uses training data to perform OCR. Most systems default to English training data. To improve OCR performance for other languages you can to install the training data from your distribution. For example to install the spanish training data:
-
tesseract-ocr-spa (Debian, Ubuntu)
-
tesseract-langpack-spa
(Fedora, EPEL)
On Windows and MacOS you can install languages using the tesseract_download function
which downloads training data directly from github
and stores it in a the path on disk given by the TESSDATA_PREFIX
variable.
References
See Also
Other tesseract:
ocr()
,
tesseract()
Examples
## Not run:
if(is.na(match("fra", tesseract_info()$available)))
tesseract_download("fra", model = 'best')
french <- tesseract("fra")
text <- ocr("https://jeroen.github.io/images/french_text.png", engine = french)
cat(text)
## End(Not run)