4

We're a small group that is promoting the spread of Unicode in India (here legacy encodings are deeply entrenched). But I have a problem when I convert a document in unicode text in any Indic language to PDF format. The text displays as intended, but on copy pasting the content partially turns gibberish.

I am using inDesign CC for typesetting on a Win 7. I can export to epub format just fine. But the exported PDF has this problem. I also tried printing to Adobe PDF printer and PrimoPDF, it only got worse. On checking out PDF's on the internet, turns out this problem exists in all such unicode encoded Indic PDF (and probably all East Asian complex scripts). Is that a problem in the PDF specs?

Check out the PDF here http://www.rajbhasha.nic.in/pdf/dolebook-4.pdf

Copy any text and match with the original, you'll see characters are replaced by other characters, unnecessary white space has crept in.

Now we're promoting unicode on grounds that it'll make copy-pasting and searching/indexing easier. This problem totally destroys that. Any ideas?

coldbreeze16
  • 53
  • 1
  • 7
  • I can confirm that copy/pasting your document on a Mac also alters the characters. I can't read it, but there are a few noticeable differences. That might suggest the source conversion is at fault. Maybe have a look at [Calibre](http://calibre-ebook.com) (freeware) to do the conversion instead. It might at least tell you where the issue starts. – Tetsujin Sep 15 '16 at 10:18
  • Can confirm copy/pasting problem on Linux with `xpdf`. I looked at the PDF with `mutool`, it uses special fonts that don't use unicode encoding. You need some other program to do your typesetting (instead of inDesign CC), and one that produces PDFs with unicode encoding (no, I don't know any option for Windows 7). [This question](http://stackoverflow.com/questions/128162/unicode-in-pdf) has technical details about unicode in PDF, it seems possible, but not easy to do. – dirkt Sep 15 '16 at 11:35
  • Correction: the example PDF actually uses `/ToUnicode` mappings, but they don't seem to work for some reason. Don't know yet what goes wrong. – dirkt Sep 15 '16 at 11:51
  • See also: http://stackoverflow.com/questions/12703387/pdf-font-encoding-why-cant-i-copy-text-from-a-pdf – u1686_grawity Sep 15 '16 at 13:35
  • @Tetsujin: I did try to convert the resulting epub to pdf using various tools, including calibre. The problem actually worsened. – coldbreeze16 Sep 17 '16 at 17:27

1 Answers1

5

I decompressed the pdf with mutool clean and had a look at. The problem seems to be that as described as in this stackoverflow question, it's difficult to use unicode encoding for the fonts. For this reason, the fonts that the PDF contains use a different encoding. However, it also contains /ToUnicode objects for each font with a complicated mapping from the font glyphs to the unicode characters.

Now many PDF viewers (like e.g. xpdf on Linux) don't seem to pay attention to this complicated mapping (or at least not to a mapping with such a complexity, though they may work on more simple mappings), which is why you get garbage when trying to copy and paste. However, with other PDF viewers (like mupdf) it works, as I've confirmed.

So the problem is located in the PDF viewer, not in the document. Also, PDFs and unicode don't go together that well, as you can see from the complicated means necessary to do the translation.

Possible solutions: (1) pressure the developers of PDF viewers to fully support \ToUnicode mappings. Maybe fix them yourself for open source ones. (2) Promote the usage of a particular PDF viewer that works with the mappings. (3) Try to use fonts inside the PDF where the glyph encoding matches the unicode encoding. This seems possible with 16-bit unicode codepoints (and the Indian characters seem to be 16-bit as far as I can tell), but I don't know how well this will work, or which application you should use to produce such PDFs.

dirkt
  • 16,421
  • 3
  • 31
  • 37
  • This seems to be correct, because with different PDF readers I get different outputs on copying which is only possible if each one implemented the ToUnicode table interpretation differently. As for your proposed solutions, 1 and 2 are not feasible because people wont shift their OS and pdf reader just for this. And all pdf readers i've tested have problems (Adobe Acrobat, PDF X, Foxit, Google pdf viewer). As for 3, I didn't get it. This document uses standard Unicode Hindi font Mangal supplied with Win Vista and above. – coldbreeze16 Sep 17 '16 at 17:41
  • [Mupdf](http://mupdf.com/) also works on Windows, so you can try that too. qpdfview on Linux also works. I'll try to make an example file for (3) to see if it works even without ToUnicode tables, but that might take some time. – dirkt Sep 17 '16 at 19:31
  • It looks like `xpdf` just ignores any "complex" characters besides ASCII for cut and paste, while `mupdf` produces an UTF8 encoded paste. That means I can't properly test here on Linux. I've created a decompressed [PDF file](https://www.dropbox.com/s/h28u9yrth85xta2/indian.pdf?dl=0) with xetex that is not as complex as your example. You can inspect it with a text editor. Glyphs are in the 0200-0400 range, corresponding unicode is 09xx. Test your viewers with it, if you can paste unicode chars in the 0200-0400 range, creating a special font should work with that viewer. – dirkt Sep 19 '16 at 09:25
  • I was away from home, just returned and tested mupdf on Win 7 and Ubuntu 14.04. The same problem persists on copying. I am not sure, what I am doing wrong. I tried your PDF on all my viewers as well. No luck. – coldbreeze16 Sep 19 '16 at 18:54
  • Huh. I'm on Debian, which is very close to Ubuntu, and mupdf works fine (shift-right button to select). Where were you pasting it into? Can you do a `xclip -o | hexdump -C` from the commandline on the selection and post the results? (Packages `xclip`, `bsdmainutils` if not installed). Also, can you post what exactly are the results for my PDF with the various viewers? A tool like [inside clipboard](http://www.nirsoft.net/utils/inside_clipboard.html) helps, IIRC it also shows hex. – dirkt Sep 20 '16 at 06:24
  • For comparison: I get e.g. `e0 a4 b8 e0 a4 82 e0 a4 83 e0 a4 95`, which is how the UTF-8 encoded characters "संःक" look like. – dirkt Sep 20 '16 at 06:32
  • Here is the output for the first line 00000000 e0 a4 b8 e0 a4 82 e0 a4 83 e0 a4 95 e0 a5 83 e0 |................| 00000010 a4 a4 e0 a4 ae e0 a5 8d 0a |.........| 00000019 – coldbreeze16 Sep 21 '16 at 07:53
  • Yeah that's the thing! Look in your PDF. It shows संस्कृतम् but on copying it becomes संःकृतम् . the third letter स् has been replaced by ः – coldbreeze16 Sep 21 '16 at 07:57
  • I also noticed another strange effect. On Win 7, when I use any software to produce PDF (Acrobat, PDF printer, Primo PDF, inDesign, MS Word, Libre Office) it doesn't matter what font I use, the copied text is gibberish. But when I use Lyx and use XeTex to change fonts weird things happen. On using most any font "looks fine, but copies shit" happens. But when I use NirmalaUI font (comes with MS Word 2013) and use XeTex to output the PDF now the PDF "looks shit but copies fine" (almost...copied text has some spaces deleted but all text is intact). TBH, Nirmala is an incomplete half baked font – coldbreeze16 Sep 21 '16 at 08:04
  • I thought by "gibberish" you meant "something completely incomprehensible"? If you just mean "not accurate", and this happens for all viewers, the problem is essentially solved: Copy and paste worked correctly given the information in the PDF, it's just that the ligature SA+VIRAMA was rendered in a way that didn't preserve the information it was originally SA+VIRAMA. (Sorry, I had no idea how Devanagari works, I have to figure that all out on the way). So if that is the only problem, you "just" need a way to produce PDFs that keep this information in the /ToUnicode table. – dirkt Sep 21 '16 at 15:53
  • It very much looks like you need to put together your own font together with correct tables that also includes ligatures to prevent the "संः" effect. And then find or write some program that works with that font. Xetex/Xelatex might be a good candidate. – dirkt Sep 21 '16 at 15:56
  • Have a look at [this](https://www.dropbox.com/s/8p0j3yrlsr8a0kn/indian2a.pdf?dl=0) hand-modified PDF to see how the cmap should look like. – dirkt Sep 21 '16 at 16:19