4

I have a Microsoft Word document with Hebrew, and some of the vowel marks seem to be separate from the letters they are supposed to be under.

Example:

enter image description here

Using a string analyzer, I determined that the letters to which this was happening were being interpreted as "Alphabetic Presentation Forms" rather than regular Hebrew letters. (In the example above, the dotted gimmel had a unicode value of U+FB32, rather than U+05D2 with U+05BC.)

Is there any way to convert everything to standard Hebrew unicode characters, so the vowels will display properly?

Thanks!

Dave
  • 185
  • 1
  • 1
  • 8

3 Answers3

1

Try this niqqud add-on, maybe something has messed up with the way you have added the niqqud.

matan129
  • 1,940
  • 2
  • 18
  • 24
  • As far as I can tell, this add-on just streamlines the process of adding vowels, but will not fix the entire file at once. I am dealing with a very large file and don't want to redo the whole thing! – Dave Jul 04 '13 at 16:20
  • I don't think there is such fix, but can you upload the file (or part of it, it it's personal or something) and let me check? Also, did you wrote the file? If yes, how did you add the ניקוד? – matan129 Jul 04 '13 at 16:22
  • It's actually someone else's file, which I don't have permission to upload. I don't know how the nikkud was added, but I suspect that it was done in a recent version of Word that handles it in a way that is not recognized by my Word 2003. – Dave Jul 04 '13 at 16:24
  • It would seem that find/replace could solve the problem, but when I enter the unicode values for the offending letters in Word's "Find" box, it selects not only the letter but also the nekudah that follows it. – Dave Jul 04 '13 at 16:26
  • Alright, if someone else has wrote the file the problem is on his side, meaning the niqqud tool he had used messed up the order of the letters. Try: * Opening the file at another version of word * Changing the font Can you upload even a meaningless sentence from the file, made of assorted words in the file? – matan129 Jul 04 '13 at 16:27
  • I don't have any other version of Word. Changing the font doesn't help. [Here](https://dl.dropboxusercontent.com/u/3563246/temp.doc) is a link to a doc with a word from the file, so you can see what I mean. – Dave Jul 04 '13 at 16:33
  • btw, I realized that the "Find" box does work properly if "match diacritics" is checked. So theoretically this could be solved using find/replace, but that would be quite tedious... – Dave Jul 04 '13 at 16:34
  • Interesting, the word is display correctly under Word 2010: [Screenshot](http://i.imgur.com/U2yoSy3.png). – matan129 Jul 04 '13 at 16:36
  • I couldn't find a fix for that problem in Word 2003. But it might be only display issue, meaning the printing will be OK. Try to print even 2-3 lines from the document to test this (more then one word, to check if the line spacing gets in the niqqud's space and moves it) – matan129 Jul 04 '13 at 16:39
  • Hmm. I wonder what would happen if you saved it as a really old format, maybe it would convert the letters back to standard characters so that it can display properly in Word 2003? – Dave Jul 04 '13 at 16:40
  • It's not just a display problem; there is also an issue that when I try to open the file with a Hebrew word processor (Davkawriter), it doesn't recognize those letters at all. So I really need to revert those letters to the earlier standard. – Dave Jul 04 '13 at 16:43
  • Try saving it as Rich Text Format – matan129 Jul 04 '13 at 16:45
  • RTF didn't help. Maybe using Word 2010 to save as RTF would accomplish something, but I doubt it. – Dave Jul 04 '13 at 16:48
  • Well I suppose that all the options available, and if none of them works I guess there is no solution (that I can think of) :( – matan129 Jul 04 '13 at 16:50
0

Your test document seems to display ok on Word 2007, but when I copy and paste the text from it to the BabelPad editor, it gets displayed wrong the same way as in your picture. When I use the BabelPad command Convert → Normalization Form → To NFC, the display gets fixed.

It seems that the problem is not with precomposed characters like U+FB32 HEBREW LETTER GIMEL WITH DAGESH as such, but in conjunction with an additional combining mark like U+05B7 HEBREW POINT PATAH after it. Some programs cannot deal with such combinations, even though they can handle a fully decomposed form (base letter followed by two combining marks).

It is impossible (and probably irrelevant) to know how the character combinations got into the file. They are valid Unicode data, but unnormalized, and normalization would presumably fix the problem. It seems that you could really use any of the Unicode normalization forms here, but NFC is often favored for general reasons.

As far as I know, Word has no tools for normalization, so you would need to use external tools for it. BabelPad would be suitable for plain text, but I don’t know how well it handles large files, and you probably have some formatting you need to preserve. So maybe you could save the file as HTML, normalize the data to NFC in BabelPad, and then open the so modified HTML file in Word. (I first thought of using RTF instead of HTML, but Word seems to generate RTF that does not contain the actual Hebrew characters but some escape notations.)

Jukka K. Korpela
  • 5,035
  • 2
  • 20
  • 33
  • Thanks, but this is somewhat over my head. I'd rather not start changing file types, as the file is heavily formatted. Do you suppose that using Word's "Find / Replace" (with ^u to target Unicode) would work? There are only about 30 affected characters, and changing them to their separate components (e.g. U+FB32 to U+05D2 and U+05BC) would seem to solve the problem. – Dave Jul 05 '13 at 15:11
  • I tried opening an html version with BabelPad, as you suggested, but the convert to NFC option was greyed out. – Dave Jul 05 '13 at 15:22
  • Just realized that the option is available by selecting the text and using the context menu. Unfortunately, converting it to NFC didn't help. – Dave Jul 05 '13 at 15:28
  • When you say that NFC didn’t help, do you mean it didn’t fix the rendering in BabelPad (it did in my simple test with your data) or that the fix did not transfer to Word when the HTML file was opened in it (it did in my test, on Word 2007)? – Jukka K. Korpela Jul 05 '13 at 19:39
  • It seems that you could do the change using Find And Replace in Word, but it gets awkward, and you would need to use numbers in decimal and not hexadecimal notation, e.g. `^u65306` for U+FB32. – Jukka K. Korpela Jul 05 '13 at 19:48
0

I couldn't get this in as a comment, so I'll submit it as an answer. Based on @Jukka K. Korpela's suggestion, I composed a Word macro that converts the precomposed characters into 'normal' ones. It can be downloaded here.

Zeke
  • 1