This article was written in 2007 and covers TOCR Versions 1.4 and 2.0. These versions have now been legacied and updated by Version 3.3 Pro. Our mission is to produce the most reliable and accurate OCR Engine on the market and TOCR Version 4.0 is our most accurate engine to date.
We thought some of our more technically minded readers might be interested to know more about how we train TOCR and to see some accuracy figures for the different versions of TOCR with the Lex option on and off, and on a variety of data groups.
Testing
OCR is a classic machine learning problem.
Within Transym, we have developed software which allows TOCR to “learn” how to be more accurate from the data it is provided with.
From this larger version we have created our production releases of TOCR, which as a subset, have been through exactly the same lengthy and rigorous training. The results are products which are not only robust but in which we have great confidence.
In order to make TOCR as accurate as possible, we have created or sourced a large set of images of different resolutions, sizes, sources and fonts, each with a text file containing what the image actually reads to a human. We call this verification data, it is also sometimes called ground truth data. Because of human error there is no certainty the verification data is 100% accurate, and sometimes it is a matter of opinion as to what poor quality text reads.
It is not possible for us to release a version of TOCR that “learns” in a working environment. Training can only take place over very large datasets (which takes considerable time), otherwise TOCR would adapt itself to “local” conditions but lose accuracy in other areas.
We do however welcome images from users to add to our database, especially if TOCR does not perform as well as hoped for.
Some of the images and verification data we have sourced ourselves, and some are from the ISRI database.
This data was sufficient for the development and production of TOCR Version 1.4 (the data is predominantly English).
Possibly uniquely, TOCR was and is trained by presenting it with real pages and then testing the accuracy against the known verification data and feeding back this information to the learning program.
For Version 2.0 to properly cover the European language characters and the other characters from the full windows character set we needed additional data. We did two things to achieve this:
- We took non-English text from a variety of sources and manufactured images of that text in a variety of fonts.
- Since some characters occur very infrequently, we manufactured verification data by randomising equal numbers of each character, and then manufactured images of the random charters in a variety of fonts and italic/bold combinations. Each page is of the same font, point size and style, but a wide variety of pages were created in different fonts and styles.
Our total image set now consists of over 100,000 images. (*This is correct as of 25/06/2012 – the number of images is always growing. We will update this page occasionally to reflect this.)
The data is classified into training classes:
- English real data: scanned images of magazines, scientific reports, letters etc. We have quite a lot of this data so we split into two parts:
- The really difficult images to be used for training.
- The less difficult which we put to one side as a check that we were not overtraining.
- European language data: real text from websites and other sources but in the main the images are manufactured. This was used in training TOCR.
- English manufactured data: randomised characters on a page in the same font, different pages having a wide variety of fonts both regular and italic. We did not need to use this in training TOCR.
- TOCR Version 2.0 recognisable character set (European for short) manufactured data: randomised characters on a page in the same font, different pages having a wide variety of fonts both regular and italic. This was used in training TOCR.
These classes have different characteristics and present different problems for TOCR.
The real data presents TOCR with all the problems associated with scanning, skew, merged characters, broken characters, noise and sometimes impossible to read images.
However, for English and European data we can use language lexical knowledge to help identify it, except where randomised.
The manufactured data is in effect a ” perfect scan” but still presents TOCR with difficulties to be solved when characters are randomised:
- No use can be made of any lexical knowledge.
- Some glyphs within the same font are identical. For example in Arial font, the I (character code 73) and l (character code 108) appear identical.
- Only the shape and size and line position of the character can really help identify it.
We now have 5 groups of data, to be tested under 2 conditions: Lex On and Lex Off, with Version 1.4 and Version 2.0 of TOCR. Lex On or Off is a TOCR processing option available to the end user and programmer.
As you might expect, using TOCR with Lex On for random data produces worse results than with Lex Off, so these accuracy figures have been omitted.
Additionally it’s not useful information to test Version 1.4 on characters it has not been trained to recognise, so these accuracy figures have also been omitted.
It is simpler to think in terms of percentage of character errors rather than percentage of characters correct.
In summary:
Version 2.0 is an accuracy improvement on Version 1.4 even though it has more scope to get it wrong, i.e. it recognises a wider range of characters.
Lex should be On where appropriate as this improves accuracy considerably.
We have tested some of the more popular competitor products on our data and have yet to find a more robust or accurate OCR engine under $1000.00. If there any examples to the contrary that can be provided by our user and partner communities, we would welcome the chance to explore them as part of our continued commitment to improving the accuracy and reliability of our solutions.