how to extract text from pdf with embedded subset fonts

Question

Pdftotext of xpdf is working fine for normal embedded fonts file , but fails where embedded subsets fonts are there . Is there any workaround for this issue ?

score 2 · Accepted Answer · edited May 23 '17 at 12:41

2

The issue is probably that the characters which are rendered using the subset font have a custom encoding - the numeric representation of the characters does not correspond to ASCII, Latin-1 or any other common encoding.

See

This means there isn't an easy workaround.

edited May 23 '17 at 12:41

Community

1

answered Oct 08 '13 at 09:23

RedGrittyBrick

81,981
20
135
205

score 2 · Answer 2 · answered Oct 08 '13 at 09:45

2

In this situation, I have printed the PDFs using the Adobe PDF printer via a high resolution (1200 dpi+), high quality image(up any settings you can). Then, I OCR the image PDF leaving me with a searchable and workable PDF.

When I have many PDFs to do over thousands of pages, I have opened multiple PDF windows at once to do this simultaneously using multiple cores for multiple PDFs. It is a PITA, but it works.

Hopefully your files are small! I've done this to upwards of 10,000 pages once (building code books). Not fun.

answered Oct 08 '13 at 09:45

Damon

1,903
1
14
24

Thanks for the answer . But how come the pdf viewer is able to correctly interpret it ? – Nishanth Lawrence Reginold Oct 08 '13 at 12:52
Probably because the encoding is embedded in the PDF, not the program. – Damon Oct 08 '13 at 16:53

how to extract text from pdf with embedded subset fonts

2 Answers2

Linked