0

I have installed everything, I have used an online tool to rip a PDF file to JPG, the problem is the tool put every page of the PDF into a separate image, now there are like 500 of them. Is there a way to just choose a folder and have tesseract put all the text of all the images into one text or word file?

As I understand PDF doesn't work with tesseract is the easiest way just to convert the PDF to JPEG or is there a better workaround?

I'm using tesseract on a Windows PC

Neil Meyer
  • 103
  • 4
  • Does https://superuser.com/questions/426690/ocr-image-based-pdf or any of the linked duplicates help? There are many possible products amongst the answers on both duplicates and their respective duplicates as well – Mokubai Jul 12 '21 at 13:27
  • You're converting the PDF text to images and then trying to OCR the text back? This doesn't look like the best method. – harrymc Jul 12 '21 at 13:40
  • I'm converting the PDF image to a JPG image to rip the text. I have read that Tesseract cannot read the text from PDF, I may be mistaken, I'm no expert – Neil Meyer Jul 12 '21 at 13:44
  • @Mokubai that does not mention Tesseract, can I ask for a solution that uses Tesseract. I believe it to be the best OCR software around (At least the best open-source solution) – Neil Meyer Jul 12 '21 at 13:47
  • You can ask if you want, but the fact that you have a lot of source images means you are likely to need some kind of script or wrapper that can work through the images for you and at that point you might as well be Googling for programs that do it and then you might as well look at complete solutions rather than bashing another set of disparate circular tools into your square hole. Given that tesseracts [command format](https://superuser.com/a/235022/19943) is relatively simple a batch file `FOR` loop and appending the output (`>>output.txt`) might be enough. – Mokubai Jul 12 '21 at 13:53
  • You've also not specified operating system or what you have tried so far... – Mokubai Jul 12 '21 at 13:53
  • @Mokubai that sounds like a neat solution, could you post it as an answer – Neil Meyer Jul 12 '21 at 16:36
  • The part of ripping from pdf or jpeg might be superfluous. I just need to find a way to batch process a bunch of images text into one big text file. – Neil Meyer Jul 12 '21 at 16:39
  • It would be tomorrow at the earliest that I could be at a machine where I could have a look and remember the batch syntax. – Mokubai Jul 12 '21 at 17:31
  • @NeilMeyer There's a good answer over on Stack Overflow: https://stackoverflow.com/a/45425708/278545 Basically replace the `*.tif` with whatever extension your files have. Not an answer because it's not my work, but feel free to post it as an answer yourself to close the loop on your question if it is a solution, just remember to give attribution to the original answer there. – Mokubai Jul 13 '21 at 09:19

2 Answers2

1

It depends on how the PDF was put together. If it incorporates a text layer harrymc's answer is your best bet, but if the PDF contains only image files, then extracting the images and using an OCR app like tesseract is your only option.

Open source (free) software gives you much greater resources than any pre-packaged solution to your problem. The only problem is that they are command-line tools which require a heavy investment of personal study and practice before you begin to realize their benefits. There is no "user-friendly" app will do what you want. If you are interested in learning command-line approaches to this problem, then as an absolute minimum start with pdftotext, pdfimages and an image manipulation system like imagemagic to support tesseract

user985675
  • 760
  • 3
  • 9
  • 18
0

I would suggest to use PDF viewer to convert the original PDF to text.

For example, Foxit PDF Reader can open the PDF. You may use the menu File > Save AS and save it in the format of "TXT Files (*.txt)". The result would be much more precise than OCR (no scan errors).

harrymc
  • 455,459
  • 31
  • 526
  • 924