1

Pdftotext of xpdf is working fine for normal embedded fonts file , but fails where embedded subsets fonts are there . Is there any workaround for this issue ?

Oliver Salzburg
  • 86,445
  • 63
  • 260
  • 306

2 Answers2

2

The issue is probably that the characters which are rendered using the subset font have a custom encoding - the numeric representation of the characters does not correspond to ASCII, Latin-1 or any other common encoding.

See

This means there isn't an easy workaround.

RedGrittyBrick
  • 81,981
  • 20
  • 135
  • 205
2

In this situation, I have printed the PDFs using the Adobe PDF printer via a high resolution (1200 dpi+), high quality image(up any settings you can). Then, I OCR the image PDF leaving me with a searchable and workable PDF.

When I have many PDFs to do over thousands of pages, I have opened multiple PDF windows at once to do this simultaneously using multiple cores for multiple PDFs. It is a PITA, but it works.

Hopefully your files are small! I've done this to upwards of 10,000 pages once (building code books). Not fun.

Damon
  • 1,903
  • 1
  • 14
  • 24