2

I'm trying to do OCR on a pdf with a two-page layout - in a landscape-orientation page of the PDF, the left half is one (portrait-orientation) page, the right half is the next (portrait-orientation) page. Sometimes the layout messes up tesseract. Can I tell it about the layout, or efficiently splice the original PDF before running it through tesseract?

Raffi
  • 121
  • 4
  • Based on the information given in what why does this involve Ubuntu? – David Feb 23 '21 at 07:59
  • It doesn't necessarily, but I've seen some good discussions of general features of OCR/tesseract on askubuntu. I've reposted here: https://ebooks.stackexchange.com/questions/8781/ocr-with-two-page-layout – Raffi Feb 23 '21 at 17:32
  • 1
    I have used in the past pdfsandwich which today is causing a security problem error. it might work for you for ocr . for splitting I have used pdfshuffler , pdftk (to split odd and even pages), pfdquench to trim pages. some some tips for you to look at, good luck. also image magick (comes with pdfsandwich?) as 'convert in.pdf -crop 50%x0 +repage out.pdf' which is also giving me the error. – pierrely Mar 18 '21 at 00:27
  • working on it herehttps://stackoverflow.com/questions/52998331/imagemagick-security-policy-pdf-blocking-conversion..... for my ubuntu 20.04 editing the policy to read | write (no need to restart the service) got me going. – pierrely Mar 18 '21 at 00:42
  • 1
    working on it herehttps://stackoverflow.com/questions/52998331/imagemagick-security-policy-pdf-blocking-conversion..... for my ubuntu 20.04 editing the policy to read | write (no need to restart the service) got me going. also tip: use nice and renice to slow the process down. (that did not seem to work this time), still the OCR did. – pierrely Mar 18 '21 at 05:07

0 Answers0