How to do OCR on a PDF document?

Question

Possible Duplicate:
How to extract text with OCR from a PDF on Linux?

I have a few documents in English and Hebrew that I scanned in and converted to PDF format.

Is there some free or cheap utility that can process a scanned PDF and do OCR, at least in English, preferably also in Hebrew?

Thanks!

A couple of similar questions. http://superuser.com/questions/28426/how-to-extract-text-with-ocr-from-a-pdf-on-linux/33203#33203 http://superuser.com/questions/64124/extracting-text-from-a-pdf-scanned-book http://superuser.com/questions/97470/scan-a4-doc-pdf-ocr-translate-to-english — heavyd, Feb 16 '10 at 16:47
The author of this question did not specify that he is running Linux. The so-called possible duplicate question is too localized, and may not apply at all to the author of this question. — eleven81, Feb 16 '10 at 17:03
Not only this is not duplicate - it's still unanswered. All 3 answers only yields into text extracts and not a PDF text-selectable document. — cregox, Jun 28 '13 at 16:05

score 1 · Answer 1 · answered Feb 16 '10 at 16:47

1

I found an interesting idea that lets Google do all the work of OCR'ing the PDF files for you.

answered Feb 16 '10 at 16:47

eleven81

15,376
15
55
83

Rather than what's at that link, it's simpler now to just use http://docs.google.com/viewer now. – ShreevatsaR Aug 29 '10 at 02:37

score 1 · Accepted Answer · answered Feb 16 '10 at 16:54

I found a list of free OCR software for Windows.

However, these programs need an image input, not a PDF input. For this, try a PDF-to-JPG converter.

score 0 · Answer 3 · answered Feb 16 '10 at 16:47

Personally, I would use Ghostview to convert them to an image, then Tesseract to convert them to text. This is a totally free, open source, cross platform solution that I have had very good results with when trying to convert plain text. I don't use it for complex documents with tables and such, but for plain text you can't beat the price.

How to do OCR on a PDF document?

3 Answers3

Linked

Related