Training Tesseract

I’ve used several OCR packages over the years; the one I keep coming back to is Tesseract – but I’ve never been able to figure out how to train it for OCR before. This weekend, I spent some time going through the documentation (and some other helpful forums and blogs) to figure out how to properly train Tessract to improve its OCR accuracy.

First, try reading the documentation to get a general overview – although it’s not an easy read.

Step 1:
Figure out what your font name is going to be, let’s call our font test_font. Also, figure out what your language is going to be called. Let’s call our language test_language, and give it a 3-letter acronym of tla. Let’s create a working directory called c:\temp

Step 2:
Download and install tesseract, and this QT-based box editor, which you’ll need. Other box editors are available at the Tesseract Add-ons site.

Step 3:
Get a bunch of .tif files of your text that you’re going to use to do the training with (if they’re not in .tif format, use ImageMagick or some other program to create .tif files of them), and put them in your working directory. The above box editor doesn’t support multi-page .tif files, but some other box editors might. In this example, we’ve got 32 different files.

Take these .tif files, and rename them, using the following naming convention:


(I think there might be a limit on how many files you can use).

Step 4:
The Box Editor linked above will create the .box files directly from the .tif files, so you can just load the .tif files in the editor and start editing boxes. But if you want to create .box files manually and then edit them, then the command line to create the box files would be:

tesseract tla.test_font.exp0.tif tla.test_font.exp0 batch.nochop makebox
tesseract tla.test_font.exp31.tif tla.test_font.exp31 batch.nochop makebox

Also – see Tip #1 at the end before going any further.

Step 5:
Run tesseract for each of the .tif/.box files you’ve created, e.g.

tesseract tla.test_font.exp0.tif tla.test_font.exp0 nobatch box.train
tesseract tla.test_font.exp31.tif tla.test_font.exp31 nobatch box.train

Step 6:
Use the unicharset_extractor utility to generate the unicharset file – the syntax goes like:

unicharset_extractor ...

(the ellipsis above just denotes that you use each filename as an argument to the unicharset_extractor file)

Step 7:
Make a font_properties file in our working directory. This is just a simple text file (I used Notepad++), and is called ‘font_properties’ (no extension). The format of the text file has to be like:

<fontname> <italic> <bold> <fixed> <serif> <fraktur>

“where <fontname> is a string naming the font (no spaces allowed!), and <italic>, <bold>, <fixed>, <serif>, and <fraktur> are all simple 0 or 1 flags indicating whether the font has the named property.” (from the tesseract documentation site)

For this example, assuming our font is just a regular sans-serif font, not bold/italic/etc, the font_properties file would be a single line:

test_font 0 0 0 0 0

Step 8:
Use the shapeclustering program to create the master shape table:

shapeclustering -F font_properties -U unicharset ...

Step 9:
Use the mftraining program:

mftraining -F font_properties -U unicharset -O tla.unicharset ...

Step 10:
Use the cntraining program to create the character normalization sensitivity prototypes:

cntraining ...

Step 11:
Rename the shapetable, normproto, inttemp, pffmtable files to use your test_language prefix, e.g. in Windows, I’d use:

move inttemp tla.inttemp
move normproto tla.normproto
move pffmtable tla.pffmtable
move shapetable tla.shapetable

Step 12:
Use combine_tessdata to finish the process:

combine_tessdata tla.

(don’t forget the period at the end of the line)

This is going to generate your tla.trainneddata file, which you’re then going to copy into your tessdata directory, usually located at c:\Program Files\Tesseract-OCR\tessdata.

Now – you want to use tesseract to recognize your new font/language! The command line is slightly different, it just uses a -l test_language at the end:

tesseract input_file.tif output -l tla

This will create a file called output.txt which will hopefully have everything in it that you’re looking for!

Tip #1:
If you’ve got several files to do, instead of going through the work of boxing all of them with a completely foreign character set, just do one file that contains most of the character. You can then create the box files for the rest of the training files with:

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] -l yournewlanguage batch.nochop makebox

This will enter most of your characters into the dictionary, so that it will save you a significant amount of time boxing the rest of your files.

As well, if you’re going to be turning off column/formatting recognition (see tip #2 below, i.e. using the -psm 6 flag), then ensure you use it as well for the box production to ensure that it creates the boxes properly for your text.

Tip #2:
If your text has columns and tesseract is interpreting them incorrectly, you can turn off column detection by using the -psm 6 command line switch, e.g.:

tesseract input_file.tif output -l tla -psm 6

9 thoughts on “Training Tesseract”

  1. TY for the guide . . it really helped me out 🙂 🙂 . . . . Just one suggestions . . . QT Box Editor has too many dependencies suggest the use if the .NET based Box File Editor in the add-ons list.

  2. hi. i have a problem with unicharset file. When i create unicharset file, glyph_metrics is always: 0,255,0,255,0,32767,0,32767,0,32767 and script is always NULL. And any idea why?

  3. Good day guys, I have a problem; I am trying to do training on tesseracrt 3.03 with leptonica 1.71 install in my system, Ubuntu 14.04. is there anyone who succesfully train 3.03, I had problem also in compiling “text3image” and also “make training”

Leave a Reply

Your email address will not be published. Required fields are marked *