OCR Tips – Transym

Use a good quality scanner

The higher quality the scanner, the better the image it will produce. Accurately scanned images create less errors, so you get faster and more accurate results.

Always check the image for potential scanning problems

If you’re processing a small number of documents, it’s always worth having a quick look to check for anything that might cause a problem. For example, badly distorted images, correction fluid, folded pages etc. If you’re processing large batches, it’s essential that you check the scanner too. A small amount of correction fluid on the glass could cause an error on every single page that you process.

Use 300 or 400 DPI

This is the optimum resolution for representing a normal sized character. It provides just the right balance between accuracy and efficiency. If the resolution is too low then the characters will be difficult to recognize. If it’s too high, processing time will increase and you’ll use more storage space.

Scan to TIFF or BMP, not JPEG

JPEG in particular, but also other image types, uses a lossy form of compression which is optimised for photos, and is not good for OCR. Scan to TIFF or BMP for an optimal OCR format.

Use the “scan for OCR feature” if available

Some scanners have built in filters to handle photo images and text differently. Using these filters will help produce a more accurate and readable image.

Scan in black and white

Using colour or grey scale can increase the image file size by 10 to 50 times. To keep the amount of data being processed and stored to a minimum, always scan in black and white where possible.

Keep sectioning turned off unless you need it

Sectioning allows any columns in the text to be recognized and read as a column. For example, if you have three columns next to each other, rather than seeing the top line of each column as a single sentence that has been broken up into three parts, it sees it as the top line of a column and reads down accordingly. If there are any tables in the document, they will need to be read left to right. The sectioning feature will sometimes read a table as columns unless it is turned off.

Deal with zones (segments of an image) all together rather than separately

TOCR does not currently support OCRing of zones (regions of a page); it will always OCR the whole image. If you wish to OCR zones then you should either:

a) OCR the whole image and extract characters based on the positional information returned or
b) Copy the zones to a new image and OCR the new image.

Accuracy is usually much improved if all the zones are OCR’d at once rather than OCRing the zones individually. To do this copy the zones to a clean image and separate them vertically, with lots of white space.

Post process the results

Some OCR engines will suggest alternatives for each error discovered (TOCR will return up to 4 alternatives for each character found ). If you know that certain areas of the page can only contain certain characters (digits, for example) then post process to ensure the correct output.