I'm trying to do OCR on a pdf with a two-page layout - in a landscape-orientation page of the PDF, the left half is one (portrait-orientation) page, the right half is the next (portrait-orientation) page. Sometimes the layout messes up tesseract. Can I tell it about the layout, or efficiently splice the original PDF before running it through tesseract?
Asked
Active
Viewed 90 times
2
-
Based on the information given in what why does this involve Ubuntu? – David Feb 23 '21 at 07:59
-
It doesn't necessarily, but I've seen some good discussions of general features of OCR/tesseract on askubuntu. I've reposted here: https://ebooks.stackexchange.com/questions/8781/ocr-with-two-page-layout – Raffi Feb 23 '21 at 17:32
-
1I have used in the past pdfsandwich which today is causing a security problem error. it might work for you for ocr . for splitting I have used pdfshuffler , pdftk (to split odd and even pages), pfdquench to trim pages. some some tips for you to look at, good luck. also image magick (comes with pdfsandwich?) as 'convert in.pdf -crop 50%x0 +repage out.pdf' which is also giving me the error. – pierrely Mar 18 '21 at 00:27
-
working on it herehttps://stackoverflow.com/questions/52998331/imagemagick-security-policy-pdf-blocking-conversion..... for my ubuntu 20.04 editing the policy to read | write (no need to restart the service) got me going. – pierrely Mar 18 '21 at 00:42
-
1working on it herehttps://stackoverflow.com/questions/52998331/imagemagick-security-policy-pdf-blocking-conversion..... for my ubuntu 20.04 editing the policy to read | write (no need to restart the service) got me going. also tip: use nice and renice to slow the process down. (that did not seem to work this time), still the OCR did. – pierrely Mar 18 '21 at 05:07