KGS Optical Character Recognition Plugin

About

Some KGS OSGi products are capable to do an OCR on given documents. This is done by this plugin.

Content

Introduction

The plugin in used by some products like KGS Migration and KGS scan server.

The bundle is designed as multi instance bundle. This means, several instances can be configured. By default a local OCS recognizer instance is configured, which can be disabled by configuration.

Since OCS is power and time consuming task, for huge document amounts it is recommended to configure additional remote instances.

This article describes how this plugin works and how to configure it.

Precondition

If you are installing on linux systems, please install tesseract first. For windows systems it is not required.

CentOS 8

sudo dnf config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_8/ sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key sudo dnf install tesseract sudo dnf install tesseract-langpack-deu

RHEL 7

sudo yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/RHEL_7/ sudo yum update sudo yum install tesseract sudo yum install tesseract-langpack-deu

CentOS 7

sudo yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/ sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key sudo yum update sudo yum install tesseract sudo yum install tesseract-langpack-deu

Debian (als Root ausführen)

How it works

The plugin have an built in recognizer. In addition it is capable to configure remote instances. How they are working is out of scope of this article. Here only the built it OCR recognizer is explained.

After calling the recognizePage methode of this plugin, it tries to get a free service. This can be a remote service as well as the local one.

After selecting a service it is called.

How the local recognizer works

The local recognizer is based on Tesseract. It maintains a pool of local instances. If not configured the pool has the size of core amounts. (Tesseract (software) )

On calling the local trigger a local instance will be selected. If there is no available instance in pool and the pool size still has not reached the configured maximum size, a new instance is generated, added to the pool and returned. Then the recognizer is called.

Configuration

The plugin is an multi instance plugin. Therefore there is an instance config as well as a local one.

Below how to enter the configuration is explained. Then the instance and global configuration is explained.

Enter configuration

  • use OSGi main menü in order to enter KGS OCR Recognizer configuration

Instance configuration

  • per default an local OCR instance is configured.

  • there can be added other remote instances which are called by HTTP via new Instance button.

 

  • after naming the new instance it will appear in the list

  • in order to configure the remote instance, please click on the gear icon

  • the please enter the remote instance URL

  • the reachability of this remote instance can be tested by using the information icon

Per default, if an remote instance is configured, the local one is deactivated. See config item Disable builtin reco !

Global configuration

  • enter OCR plugin configuration

  • klick on “Configuration Editor”

  • There are 2 tabs for configuration

    • OCS

    • Common

Global configuration Items

OCR

config item

meaning

default

config item

meaning

default

Preserve Interword Whitespaces

 

 

User defined DPI

if defined the recognizer will recognize imaged gained from PDF with the given resolution.

300

OCR Whitelist

 

*

OCR Blacklist

 

 

Optimize for Invoice

 

false

Local Recognizer Pool Size

Amount recognizer instances for local recognizer. If not given the amount of CPUs is used.

 

Local Recognizer Pool Timeout

Timout for local recognizer pool. If no recognizer can be retrieved inside given timeout (ms) an Timeout-Exception is thrown. If no value is given, it will be endless waited.

60000

Common

config item

meaning

default

config item

meaning

default

Disable builtin reco

If there are other recognizer instances configured and this flag is enabled, the local recognizer instance is disabled.

true

Http Connection timeout (ms)

If other recognizer instances are configured, they will communicate by HTTP. This is the http timeout.

0

Query Thread Pool

 

 5

Debug Level

 

 2

Recognition Languages

The recognizer languages.

 eng+deu

Working Directory

 

 

Related