0

if anyone could help I'd appreciate

I'm trying to output text via pdftotext from number of pdf files. Unfortunately my output keeps ending up like this: "* * * $ * # 2 %

Initially I thought that problem lies in fact that font is Arial so I've installed Arial font but that did not give any change. Using different encoding options does not give any better result either. Before installing Arial fonts evince could not show text in pdf file but after installation pdf is displayed fine so I thought that was the main problem but apparently not.

I'm using Centos 6.7

Thank you in advance for any feedback.

looser
  • 1
  • 1

1 Answers1

0

Unsure if this is the case here, but a PDF file may even use an arbitrary character encoding, referencing embedded glyphs simply by their index (0, 1, ...). This suffices to obtain a correct rendering (=visual appearance), but the text will be lost for practical purposes.

In that case, using a OCR on the PDF almost is the only way to obtain the original text. Or guessing the monoalphabetical substitution for each PDF, if it's a really important document.

jvb
  • 3,065
  • 1
  • 16
  • 18
  • Hi jvb, thank you for your feedback. Hoping that OCR is not going to be my solution as I'm trying to automate things as much as possible. Don't know if it matters but if it was not an issue of only not having proper fonts installed why could I have not seen my PDF properly in Evince before and now I can? Is it possible that pdftotext for some reason does not have access to fonts and Evince does? – looser Nov 01 '21 at 04:44
  • Hi @looser, you are right, arbitrary encoding usually goes with embedded fonts (in particular: partial fonts, with only the glyphs needed) - and installing fonts wouldn't change the appearance. So it might be a problem in `pdftext`. The `podofo` tools contain a similar command line tool named `podofotxtextract` which is based on a different library, maybe you could give that a try? – jvb Nov 01 '21 at 06:51
  • Hi @jvb. Thank you for your suggestion. At the moment trying to instal podofo but that opened another can of worms with lots of dependency issues. So trying to resolve that now :-/ – looser Nov 02 '21 at 05:54