KGS Optical Character Recognition Plugin
About
Some KGS OSGi products are capable to do an OCR on given documents. This is done by this plugin.
Content
Introduction
The plugin in used by some products like KGS Migration and KGS scan server.
The bundle is designed as multi instance bundle. This means, several instances can be configured. By default a local OCS recognizer instance is configured, which can be disabled by configuration.
Since OCS is power and time consuming task, for huge document amounts it is recommended to configure additional remote instances.
This article describes how this plugin works and how to configure it.
Precondition
If you are installing on linux systems, please install tesseract first. For windows systems it is not required.
CentOS 8
sudo dnf config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_8/
sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
sudo dnf install tesseract
sudo dnf install tesseract-langpack-deu
RHEL 7
sudo yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/RHEL_7/
sudo yum update
sudo yum install tesseract
sudo yum install tesseract-langpack-deu
CentOS 7
sudo yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/
sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
sudo yum update
sudo yum install tesseract
sudo yum install tesseract-langpack-deu
Debian (als Root ausführen)
How it works
The plugin have an built in recognizer. In addition it is capable to configure remote instances. How they are working is out of scope of this article. Here only the built it OCR recognizer is explained.
After calling the recognizePage methode of this plugin, it tries to get a free service. This can be a remote service as well as the local one.
After selecting a service it is called.
How the local recognizer works
The local recognizer is based on Tesseract. It maintains a pool of local instances. If not configured the pool has the size of core amounts. (https://en.wikipedia.org/wiki/Tesseract_(software) )
On calling the local trigger a local instance will be selected. If there is no available instance in pool and the pool size still has not reached the configured maximum size, a new instance is generated, added to the pool and returned. Then the recognizer is called.
Configuration
The plugin is an multi instance plugin. Therefore there is an instance config as well as a local one.
Below how to enter the configuration is explained. Then the instance and global configuration is explained.
Enter configuration
use OSGi main menü in order to enter KGS OCR Recognizer configuration
Instance configuration
per default an local OCR instance is configured.
there can be added other remote instances which are called by HTTP via new Instance button.
after naming the new instance it will appear in the list
in order to configure the remote instance, please click on the gear icon
the please enter the remote instance URL
the reachability of this remote instance can be tested by using the information icon
Per default, if an remote instance is configured, the local one is deactivated. See config item Disable builtin reco !
Global configuration
enter OCR plugin configuration
klick on “Configuration Editor”
There are 2 tabs for configuration
OCS
Common
Global configuration Items
OCR
config item | meaning | default |
---|---|---|
Preserve Interword Whitespaces |
|
|
User defined DPI | if defined the recognizer will recognize imaged gained from PDF with the given resolution. | 300 |
OCR Whitelist |
| * |
OCR Blacklist |
|
|
Optimize for Invoice |
| false |
Local Recognizer Pool Size | Amount recognizer instances for local recognizer. If not given the amount of CPUs is used. |
|
Local Recognizer Pool Timeout | Timout for local recognizer pool. If no recognizer can be retrieved inside given timeout (ms) an Timeout-Exception is thrown. If no value is given, it will be endless waited. | 60000 |
Common
config item | meaning | default |
---|---|---|
Disable builtin reco | If there are other recognizer instances configured and this flag is enabled, the local recognizer instance is disabled. | true |
Http Connection timeout (ms) | If other recognizer instances are configured, they will communicate by HTTP. This is the http timeout. | 0 |
Query Thread Pool |
| 5 |
Debug Level |
| 2 |
Recognition Languages | The recognizer languages. | eng+deu |
Working Directory |
|
|