I’ve used several OCR packages over the years; the one I keep coming back to is Tesseract – but I’ve never been able to figure out how to train it for OCR before. This weekend, I spent some time going through the documentation (and some other helpful forums and blogs) to figure out how to properly train Tessract to improve its OCR accuracy.
First, try reading the documentation to get a general overview – although it’s not an easy read.
Figure out what your font name is going to be, let’s call our font test_font. Also, figure out what your language is going to be called. Let’s call our language test_language, and give it a 3-letter acronym of tla. Let’s create a working directory called c:\temp
Get a bunch of .tif files of your text that you’re going to use to do the training with (if they’re not in .tif format, use ImageMagick or some other program to create .tif files of them), and put them in your working directory. The above box editor doesn’t support multi-page .tif files, but some other box editors might. In this example, we’ve got 32 different files.
Take these .tif files, and rename them, using the following naming convention:
(I think there might be a limit on how many files you can use).
The Box Editor linked above will create the .box files directly from the .tif files, so you can just load the .tif files in the editor and start editing boxes. But if you want to create .box files manually and then edit them, then the command line to create the box files would be:
tesseract tla.test_font.exp0.tif tla.test_font.exp0 batch.nochop makebox
tesseract tla.test_font.exp31.tif tla.test_font.exp31 batch.nochop makebox
Also – see Tip #1 at the end before going any further.
Run tesseract for each of the .tif/.box files you’ve created, e.g.
tesseract tla.test_font.exp0.tif tla.test_font.exp0 nobatch box.train
tesseract tla.test_font.exp31.tif tla.test_font.exp31 nobatch box.train
Use the unicharset_extractor utility to generate the unicharset file – the syntax goes like:
unicharset_extractor tla.test_font.exp0.box tla.test_font.exp1.box ... tla.test_font.exp31.box
(the ellipsis above just denotes that you use each filename as an argument to the unicharset_extractor file)
Make a font_properties file in our working directory. This is just a simple text file (I used Notepad++), and is called ‘font_properties’ (no extension). The format of the text file has to be like:
<fontname> <italic> <bold> <fixed> <serif> <fraktur>
“where <fontname> is a string naming the font (no spaces allowed!), and <italic>, <bold>, <fixed>, <serif>, and <fraktur> are all simple 0 or 1 flags indicating whether the font has the named property.” (from the tesseract documentation site)
For this example, assuming our font is just a regular sans-serif font, not bold/italic/etc, the font_properties file would be a single line:
test_font 0 0 0 0 0
Use the shapeclustering program to create the master shape table:
shapeclustering -F font_properties -U unicharset tla.test_font.exp0.tr tla.test_font.exp1.tr ... tla.test_font.exp31.tr
Use the mftraining program:
mftraining -F font_properties -U unicharset -O tla.unicharset tla.test_font.exp0.tr tla.test_font.exp1.tr ... tla.test_font.exp31.tr
Use the cntraining program to create the character normalization sensitivity prototypes:
cntraining tla.test_font.exp0.tr tla.test_font.exp1.tr ... tla.test_font.exp31.tr
Rename the shapetable, normproto, inttemp, pffmtable files to use your test_language prefix, e.g. in Windows, I’d use:
move inttemp tla.inttemp
move normproto tla.normproto
move pffmtable tla.pffmtable
move shapetable tla.shapetable
Use combine_tessdata to finish the process:
(don’t forget the period at the end of the line)
This is going to generate your tla.trainneddata file, which you’re then going to copy into your tessdata directory, usually located at c:\Program Files\Tesseract-OCR\tessdata.
Now – you want to use tesseract to recognize your new font/language! The command line is slightly different, it just uses a -l test_language at the end:
tesseract input_file.tif output -l tla
This will create a file called output.txt which will hopefully have everything in it that you’re looking for!
If you’ve got several files to do, instead of going through the work of boxing all of them with a completely foreign character set, just do one file that contains most of the character. You can then create the box files for the rest of the training files with:
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] -l yournewlanguage batch.nochop makebox
This will enter most of your characters into the dictionary, so that it will save you a significant amount of time boxing the rest of your files.
As well, if you’re going to be turning off column/formatting recognition (see tip #2 below, i.e. using the
-psm 6 flag), then ensure you use it as well for the box production to ensure that it creates the boxes properly for your text.
If your text has columns and tesseract is interpreting them incorrectly, you can turn off column detection by using the -psm 6 command line switch, e.g.:
tesseract input_file.tif output -l tla -psm 6