Skip to content

tesseract

Tesseract is probably the most accurate open source OCR engine available. It differs from OpenCV because opencv is a general purpose image library. You could use it to build something like Teseeract.

How well does it work? I downloaded the latest portable version to try it out. The ReadMe is very helpful.

IMG_43111

out of the box

Out of the box just point it at an image.


>tesseract.exe IMG.JPG out
>cat out.txt
*1
*3
'?
i

v

.‘-go.-—: qv..»- . v_—
r : -.; 1

‘LT WT

Lp LMT 223500
62500

_ ' -r~,.\'4-5-p-—-A-. 4.
ts‘ "3' ”'

2 INCH HF COMP SHOES

..,. ...¢...-..

,, ..--4
..r.....

Pretty impressive that it correctly read the small text at the bottom but pretty said it missed the giant text in the middle.

focus on digits

Maybe if I focus on digits I can get the big “352264”.

In Tesseract-OCR\
>cat tessdata\configs\digits
tessedit_char_whitelist 0123456789
>tesseract.exe IMG_4311.JPG out digits
>cat out.txt
21
23
3
0

0

20522024 0101022 2 222
1 2 252 0

17 0011

00 1001 223500
02500

2 0 0554112402 42
2200 11 000

2 10168 512 60009 551053

2402 0242402

202 0224
2222232

So my guess is the letters are too big compared to the size of the picture. Tesseract is really geared towards looking at a page of text so it would make sense to ignore larger patterns and focus on smaller ones.

general approach for best results from tasseract

according to this stack overflow post

  1. fix DPI (if needed) 300 DPI is minimum
  2. fix text size (e.g. 12 pt should be ok)
  3. try to fix text lines (deskew and dewarp text)
  4. try to fix illumination of image (e.g. no dark part of image
  5. binarize and de-noise image

fix DPI

The original dpi of the image was 72. Probably a setting in the camera that could be changed, or changed automatically with pre processing.

As a quick test I changed the dpi in gimp to the recommended 300.


>tesseract.exe 300dpi.JPG out
>cat out.txt

_`__ ___`___,.. ....-u-9.`-""

._ , ,~...,.--... ....

..- ...-..-.......

..,\.-.,~. -
,, .. ....~\-.»..v
, ., ._..x-. o-

CHTT mi
«~352264

«:4.»

`$7

a
I
.

`a
n

V.`-w-v -2
..
'''-_§_``'.: '. `

QLD LMT 223500
`L1 WT 62500

co.
..
_ . ' ".".."4-1`
.:...--"!CT::~ ts`

2 INCH HF COMP SHOES

Success! I got the string CHTT 352264 I was looking for. But there is still a bunch of junk..

only allow alpha numeric

In Tesseract-OCR\
>cat tessdata\configs\alphanumeric
tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
>tesseract.exe IMG_4311.JPG out alphanumeric
>cat out.txt

V1 HFR4 9P

A 4 N NW

H 4 N

V A
H A V
F N

CHTT M
U352264

V4

37

I
1

V

V 4

V1 2

3 LD LMT 223500
LT WT 62500

A
P P 41
A IQ17 M

2 INCH HF COMP SHOES

what is next?

  • use regex to limit the pattern I want
  • pre-process the image more, possibly only look at largest text

links

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*