Prime Recognition praise Transym for accuracy

Large volume document scanning requires a high level of accuracy for it to prove cost effective when compared to manual data entry. Through their extensive research, Prime Recognition have identified that 66% of all lifetime imaging costs are attributable to correcting errors. For those processing a lot of documents, that could prove to be quite an expense over time.

To overcome this, Prime OCR uniquely incorporates six OCR engines which are automatically called upon to vote on documents that the primary engine has identified as difficult or of poor quality. Transym have partnered with Prime Recognition to integrate the TOCR engine into Prime OCR to add further accuracy and reliability to the solution.

Transym’s intensive testing, research and development processes have produced one of the best performing single engine solutions currently available.

“Prime Recognition continually evaluates OCR products to include with the six OCR engines we currently use in our voting system. As both a developer of OCR technology and a very sophisticated user of OCR technology we are in a good position to evaluate OCR products,” says Kenn Dahl, president of Prime Recognition Inc.

“Transym is the first new engine that we’ve tested in 14 years that is competitive with the leading engines. This is quite an accomplishment as the leading engines represent hundreds of man years of development, it is usually very difficult for relatively new companies to catch up. I look forward to what Transym can accomplish in the future building from their strong position.”

Transym are committed to further increasing the accuracy and reliability of our solutions. By creating products that are optimised for ease of integration, we enable partners such as Prime Recognition to frequently enhance their own offerings, maintaining their competitive edge and driving their performance forward.

Prime Recognition was founded in 1993 in Silicon Valley and relocated to the Seattle area in 2000. Internationally recognised for their high accuracy production systems they have won numerous awards including Best of AIIM, and Product of the Year from Imaging Magazine.

For more information on Prime Recognition or their solutions please visit: http://www.primerecognition.com/

CCI rely on Transym to add more power

The digital age has increased the pace at which business is conducted. Companies need quick access to working documentation to maintain a competitive edge. Document management applications have proven to be an effective solution to this growing problem.

Capture Plus, a Transym OCR partner, is a leading provider of affordable document management solutions for a wide range of customers.

“As the value leader in capture solutions, we needed a cost effective but reliable OCR engine that kept pace with the flexibility and power of our solution – especially for operator unattended processes,” says Randall Kochis, CEO of Westchester based CCI.

Capture Plus was designed in an “open architecture” environment to leverage free tools from Microsoft and Adobe. There was a strong requirement to maintain the integrity of their strategy and avoid any proprietary technology.

Transym OCR’s flexible, non invasive technology was the ideal solution.

“We use TOCR to perform zone OCR and then perform lookups and extraction of data from any accounting system database,” explains Kochis. “Zone OCR is also very useful for creating other automated flows which preclude the need for any manual indexing.”

This is a very popular application for wholesale distributors who perform zone OCR on their invoice and order numbers in order to validate against the data within their accounting system. This can involve large numbers of documents being batch processed, so a reliable solution that can work uninterrupted is essential.

TOCR’s ability to both minimise the number of errors encountered and to efficiently handle those that there are without stalling was a key factor in CCI’s selection of the technology.

“We also use TOCR to split documents based on words and to create searchable PDFs for simple and fast document retrieval,” added Kochis.

The result is that CCI have been able to maintain their open architecture strategy and produce a great value product that sells around the world.

“Historically, I used another vendor for my OCR engine, but I found their solution to be inferior to TOCR in many ways. The main reason, of course, is reliability. No comparison,” says Kochis.

“Capture Plus would not be as powerful without the use of TOCR. It is the best value in the marketplace.”

For more information on how Capture Plus can help you capture and manage your essential documents contact Randall Kochis at CCI at rkochis @ lcor.com / 001-610-436-6002.

Capture Plus

Speed versus Accuracy

In Version 3.2 we introduced a speed option facility, and this option has been carried over to Version 4.0. The speed tests below were performned on TOCR 3.3.

Speed options can be 0 (default), 1, 2, or 3, from slowest (0) to fastest (3). These options tell TOCR how exhaustive it should be in looking for improvements. There is a small loss in accuracy from slower to faster speed options.

Our testing on a large database shows the following changes with speed options. All % changes are relative to speed option 0.

Speed optionTime changeScore Accuracy Change
1-10.6%-0.0075%
2-17.0%-0.0177%
3-22.1%-0.0483%

The time changes (speed ups) are fairly regular, it would be rare for a higher speed option to cause a slowdown in processing, though it is possible for the odd file. The accuracy changes are much more variable, they are simply the effect of less exhaustive processing. They are an average and therefore a guide to what to expect.

The following table shows accuracy and speedup variation across a range of different datasets (A to J). Only speed options 0 and 3 are shown for simplicity (they provide the widest range of values). Data set maximum scores range form 951k to 11236k. The greatest speedups seem to us to come from the most difficult datasets (noisy, joined and broken characters, etc.)

 Option 0 Err %Option 3 Err %Err DifferenceErr % Increase% Time Change
A0.49460.49480.00030.0530-11.153
B6.05706.06840.01140.1875-53.375
C0.15130.16370.01258.2329-11.436
D0.08220.09550.013316.1309-13.873
E0.09150.10930.017819.5039-12.354
F0.48090.50160.02064.2896-22.727
G0.83440.86830.03394.0682-17.107
H1.00051.03760.03713.7111-16.336
I1.01771.06950.05195.0967-24.022
J2.50812.66760.15956.3600-39.376

Note that while Error % increases can in some cases look very high (D&E), they also have very high accuracy, and the error difference looks much more reasonable. Conversely in the case of high error difference (J), this is a low accuracy dataset, the error % increase is much more reasonable.

The table underestimates true TOCR accuracy since the cells mix different processing options (Lexon and Lexoff for example).

Please take note

The information in this article refers only to TOCR versions 3.2 and 3.3. TOCR 4.0 is a faster engine than either of these versions:

  • TOCR 4.0 64-bit is on average faster than TOCR 3.2 and 3.3 in all 4 speed options on our test data and hardware.
  • TOCR 4.0 64-bit is 33.8% faster than TOCR 3.2 and 3.3 using speed option 0, and 14.2% faster using speed option 3.
  • TOCR 4.0 32-bit is around 35% slower than TOCR 4.0 64-bit in each speed mode.

Testing for Greater Accuracy in OCR

This article was written in 2007 and covers TOCR Versions 1.4 and 2.0. These versions have now been legacied and updated by Version 3.3 Pro. Our mission is to produce the most reliable and accurate OCR Engine on the market and TOCR Version 4.0 is our most accurate engine to date.

We thought some of our more technically minded readers might be interested to know more about how we train TOCR and to see some accuracy figures for the different versions of TOCR with the Lex option on and off, and on a variety of data groups.

Testing

OCR is a classic machine learning problem.

Within Transym, we have developed software which allows TOCR to “learn” how to be more accurate from the data it is provided with.

From this larger version we have created our production releases of TOCR, which as a subset, have been through exactly the same lengthy and rigorous training. The results are products which are not only robust but in which we have great confidence.

In order to make TOCR as accurate as possible, we have created or sourced a large set of images of different resolutions, sizes, sources and fonts, each with a text file containing what the image actually reads to a human.  We call this verification data, it is also sometimes called ground truth data. Because of human error there is no certainty the verification data is 100% accurate, and sometimes it is a matter of opinion as to what poor quality text reads.

It is not possible for us to release a version of TOCR that “learns” in a working environment. Training can only take place over very large datasets (which takes considerable time), otherwise TOCR would adapt itself to “local” conditions but lose accuracy in other areas.

We do however welcome images from users to add to our database, especially if TOCR does not perform as well as hoped for.

Some of the images and verification data we have sourced ourselves, and some are from the ISRI database.

This data was sufficient for the development and production of TOCR Version 1.4 (the data is predominantly English).

Possibly uniquely, TOCR was and is trained by presenting it with real pages and then testing the accuracy against the known verification data and feeding back this information to the learning program.

For Version 2.0 to properly cover the European language characters and the other characters from the full windows character set we needed additional data. We did two things to achieve this:

  • We took non-English text from a variety of sources and manufactured images of that text in a variety of fonts.
  • Since some characters occur very infrequently, we manufactured verification data by randomising equal numbers of each character, and then manufactured images of the random charters in a variety of fonts and italic/bold combinations. Each page is of the same font, point size and style, but a wide variety of pages were created in different fonts and styles.

Our total image set now consists of over 100,000 images. (*This is correct as of 25/06/2012 – the number of images is always growing. We will update this page occasionally to reflect this.)

The data is classified into training classes:

  • English real data: scanned images of magazines, scientific reports, letters etc. We have quite a lot of this data so we split into two parts:
  • The really difficult images to be used for training.
  • The less difficult which we put to one side as a check that we were not overtraining.
  • European language data: real text from websites and other sources but in the main the images are manufactured. This was used in training TOCR.
  • English manufactured data: randomised characters on a page in the same font, different pages having a wide variety of fonts both regular and italic. We did not need to use this in training TOCR.
  • TOCR Version 2.0 recognisable character set (European for short) manufactured data: randomised characters on a page in the same font, different pages having a wide variety of fonts both regular and italic. This was used in training TOCR.

These classes have different characteristics and present different problems for TOCR.

The real data presents TOCR with all the problems associated with scanning, skew, merged characters, broken characters, noise and sometimes impossible to read images.

However, for English and European data we can use language lexical knowledge to help identify it, except where randomised.

The manufactured data is in effect a ” perfect scan” but still presents TOCR with difficulties to be solved when characters are randomised:

  • No use can be made of any lexical knowledge.
  • Some glyphs within the same font are identical. For example in Arial font, the I (character code 73) and l (character code 108) appear identical.
  • Only the shape and size and line position of the character can really help identify it.

We now have 5 groups of data, to be tested under 2 conditions: Lex On and Lex Off, with Version 1.4 and Version 2.0 of TOCR. Lex On or Off is a TOCR processing option available to the end user and programmer.

As you might expect, using TOCR with Lex On for random data produces worse results than with Lex Off, so these accuracy figures have been omitted.

Additionally it’s not useful information to test Version 1.4 on characters it has not been trained to recognise, so these accuracy figures have also been omitted.

It is simpler to think in terms of percentage of character errors rather than percentage of characters correct.

In summary:

Version 2.0 is an accuracy improvement on Version 1.4 even though it has more scope to get it wrong, i.e. it recognises a wider range of characters.

Lex should be On where appropriate as this improves accuracy considerably.

We have tested some of the more popular competitor products on our data and have yet to find a more robust or accurate OCR engine under $1000.00. If there any examples to the contrary that can be provided by our user and partner communities, we would welcome the chance to explore them as part of our continued commitment to improving the accuracy and reliability of our solutions.

The Selection Conundrum

Are all OCR engines created equal?

Choosing the right product in a market where everyone is supposedly the fastest and most accurate poses a dilemma: who to believe? You know that they can’t all be the best – or is OCR a rare example of an industry where all products are created equal?

Part of the problem is that OCR software providers have no defined benchmarks for testing, and no uniform code of standards to follow. This makes it difficult to compare performance statistics between different suppliers. The following may go some way to explaining why so many established integrators have turned to TOCR, even if they’ve used other products in the past.

The OCR Challenge

An OCR engine is faced with a difficult task – deciphering information quickly and accurately, while confronted with any number of problems, including:

  • Font changes, unusual fonts and broken characters
  • Characters in different orientations on the page
  • Creased, crumpled, stained and smudged pages
  • Foreign language and character sets
  • Pages with text obscured by annotations and diagrams
  • Poor quality scanning devices, or ink on the scanner glass

At the end of all this, OCR programmes are expected to extract accurate information from documents – at speed. Naturally, many are unable to cope with the demands, and no OCR is genuinely 100% accurate.

How accurate is accurate?

Here’s where the problem lies. Many OCR programs focus on speed – at the expense of truly accurate results. While they may claim high accuracy levels, when some engines are confronted with difficult tasks, such as the ones highlighted above, they give up (often after 30 seconds processing).

In many applications, the ability to extract meaningful data from the most difficult of documents is key to a project’s success – so you need an OCR engine that works harder to maximise the data it can extract.

Our Solution

At Transym, we’re confident that we have one of the most accountable forms of testing on the market. We’ve “taught” our software over a decade how to read and convert difficult information. TOCR draws on a database of tens of thousands of images, the result of near-constant research and improvement.

This is why we believe that TOCR is the best solution for systems integrators, and offers the best value OCR engine available anywhere. You get fast results, but more importantly you get accurate information that you can rely on.