OCR Image based PDF

Question

Possible Duplicate:
Extracting text from a .PDF scanned book
How to do OCR on a PDF document?

I've got a >200 page pdf manual that was produced by scanning hard copy. I'd like to convert it to a searchable text format, but am not having any success finding a tool to do so. Google's search results are highly polluted with crippleware trial software that can only do the first few pages of the file. The only truly free application I found, FreeOCR's pdf renderer fails to handle anything beyond the first few pages of the file.

Google's pdf viewer does OCR; but doesn't appear to provide any export option other than copy/paste; in addition to being very tedious, what it puts on the clipboard is only plaintext; which means I'd lose all of the line art and significant formatting due to horizontal placement.

@DanielAndersson Unfortunately, none of those were helpful. Blowing the file apart into hundreds of image files and then gluing them back together would be a massive waste of time (1st and 3rd link). I've already got plenty of tools that claim they'd do the job if I gave them money, but which I can't verify the claims of because the problematic parts of the file are beyond what they'd do for free (2nd link) — Dan Is Fiddling By Firelight, May 20 '12 at 17:46
Then put that info in your question as well so people know what you have tried and not. People aren't at this site because they like guessing :-) — Daniel Andersson, May 20 '12 at 19:05

score 2 · Answer 1 · answered May 20 '12 at 16:19

2

If you upload your PDF to Google Drive (Docs) and have your upload conversion settings to convert images to text and then convert the document to a Google Doc (this can all be done at upload). You should then be able to open the doc, click file > download as and select the format you want?

I just did this is a magazine page and it worked okay, not all of the fonts were recognised though.

answered May 20 '12 at 16:19

sgtbeano

571
3
13

The upload converter maxes out at a 2MB file size. If I import it by emailing it to myself (what I tried originally), I don't run into the limitation; but don't get the conversion options. – Dan Is Fiddling By Firelight May 20 '12 at 16:27
How about this service? It says it doesn't have any upload limits? http://www.newocr.com/ – sgtbeano May 20 '12 at 16:45
That service sort of works; but by trashing everything that's not an letter it breaks a moderate amount of formatting (most seriously some structural formulas for chemicals). – Dan Is Fiddling By Firelight May 20 '12 at 17:31
I used a pdf splitter to cut the file down below the upload limit; but the GoogleDoc converter didn't OCR the text at all; unlike what their PDF viewer does. – Dan Is Fiddling By Firelight May 20 '12 at 18:16

OCR Image based PDF

1 Answers1

Linked