BIP39 Japanese Mnemonic vector unit test process

Question

There is a json for unit tests of Japanese characters which I want to validate using Python, specifically with this fork of pybitcointools, which has bip39 functionality.

Unit tests from Trezor's python-mnemonic test vectors work fine (in Python 2.7 IME), however, this is straightforward since there's no normalization of unicode dialectics and such, since all mnemonics are lower case English.

The Japanese fields are:

Entropy (hex)
Mnemonic (Japanese)
Password (Japanese, appears to be the same for all tests)
Seed (hex, 64 bytes)
xprv

So entropy seeds mnemonic (bip39?), then mnemonic | password hashes to Seed; Seed then acts as the master key for the bip32 xprv? (correct me if I'm wrong!?)

So, assuming it's that straightforward...

how is the Japanese unicode text "normalized"? (Is it just NKFD Unicode normalization, which Electrum 2.0 does?)
what does "normal" mean for Japanese?

score 2 · Accepted Answer · edited Apr 13 '17 at 12:47

2

So entropy seeds mnemonic (bip39?), then mnemonic | password hashes to Seed; Seed then acts as the master key for the bip32 xprv? (correct me if I'm wrong!?)

That sounds about right. Most of the process is well detailed in BIP-39.

An SHA-256 is taken of the entropy, and the first entropy_len_in_bits / 32 bits of this hash are appended to the end of the entropy. The resulting entropy bit string is divisible into 11-bit-long chunks (it's no longer an integral number of bytes).
Each 11-bit chunk is converted into one of 2¹¹ mnemonic words.
The words are joined by spaces. For display purposes in Japanese, these should be Unicode IDEOGRAPHIC SPACEs, '\u3000'. If there's no need to display the mnemonic to the user, they can be "normal" SPACEs ('\u0020').
The mnemonic sentence is Unicode normalized in NFKD form. This converts any IDEOGRAPHIC SPACEs into SPACEs. It also changes some characters in some of the mnemonic words, therefore this step cannot be skipped. (The question What is NFKD normalization? is a whole separate topic that's probably best asked elsewhere IMO....)
The mnemonic sentence is converted into bytes via UTF-8 encoding.
The binary seed is computed as PBKDF2_{HMAC_SHA512}(key= "mnemonic" | passphrase, data=utf8_mnemonic, iterations=2048, out_bytes_length=64). The passphrase can be the empty string. It must first go through the same steps 4 and 5 as the mnemonic.
(this part isn't detailed anywhere AFAIK) The master extended private key is constructed by using the first 32 bytes of the binary seed as the private key, and the last 32 bytes as the chaincode.

Is it just NKFD Unicode normalization, which Electrum 2.0 does?

Electrum 2.x does use NFKD normalization, but it also performs additional steps, such as removing spaces between Japanese words after step 4. It also uses a different key string in step 6, and a completely different process prior to step 4. See this answer for an implementation of Electrum 2.x's mnemonic-words-to-seed procedure in Python.

edited Apr 13 '17 at 12:47

Community

1

answered Jun 04 '15 at 11:09

Christopher Gurnee

2,493
15
22

Thanks for the response. I've got gaps in how UTF-8 fits in, though I understand pretty well what NKFD does. **Why not just encode Unicode directly?** Also, I do understand the ideographic space, I think (\u3000 is the Japanese "equivalent" of \u0020). It was actually [Electrum 2.0's implementation of the seed preparation](https://github.com/simcity4242/pybitcointools/blob/master/bitcoin/deterministic.py%23L201) which was confusing me. That really is strange Electrum deviated in such obscure places – Wizard Of Ozzie Jun 04 '15 at 11:18
Scratch that. You can't encode Unicode 4 byte code points as well as UTF8, right? – Wizard Of Ozzie Jun 04 '15 at 11:22
@WizardOfOzzie Unicode strings are simply sequences of integers in the range [0,0x10FFFF]. Before you can hash it, you need to convert it into a sequence of bytes. A simple way is to just take each 4-byte int, and use those bytes as-is (UTF-32LE encoding), but this is inneficient from a space point of view (considering that English only needs 1 byte per character). UTF-8 is more complicated, but more space efficient most of the time. – Christopher Gurnee Jun 04 '15 at 11:27
@WizardOfOzzie Agreed that Electrum 2.x's normalization is much more complex, but it's helpful to minimize the chance of loss from mistyped mnemonics given the lack of any specific wordlists. (And BIP-39's requirement for specific wordlists was something Electrum 2.x's dev really disagreed with.) – Christopher Gurnee Jun 04 '15 at 11:34
Does this look right? `norm = lambda d: (' '.join(unicodedata.('NFKD', unicode(d)).split('\u3000'))).encode('utf-8')` – Wizard Of Ozzie Jun 04 '15 at 22:26
Assuming `d` is a mnemonic sentence of Python2 type string or a unicode? I'd think something like this for most languages (different for Chinese, which might not have spaces in `d`): `norm = lambda d: (u' '.join(unicodedata.normalize('NFKD', unicode(d)).split())).encode('utf-8')` (I added the missing "normalize" and changed split to split on all whitespace). Note that if you're accepting input from a user, BIP-39 requires that you verify the checksum. – Christopher Gurnee Jun 04 '15 at 22:47
Thx, I'll try it now. Hmmm, assuming py 2/3 so let's say both Unicode and str. Yep, I'm pretty well versed with bip39 so my mn2hex function asserts check_bip39 – Wizard Of Ozzie Jun 04 '15 at 23:25
I've gotten all the unit tests to work *except for the comparison of `bip39_hex_to_mn` with `VECTOR['mnemonic']` as the function returns a standard space (ie `\u0020`) whereas the test vectors use `\u3000`. Electrum (2.x) solves the ideographic space (`\u3000`) problem by concatenating all CJK words without spaces. See https://github.com/simcity4242/pybitcointools/blob/master/test_bip39.py#L123 (I've only got it to work with `u' '.join(v['mnemonic'].split())`, whereas it should just be `v['mnemonic']` – Wizard Of Ozzie Jun 09 '15 at 04:35
Tangentially related, why do the word lists for English and Japanese have the correct format (each word is a unicode object consisting of unicode characters), whereas the Spanish, Chinese and French (https://github.com/simcity4242/pybitcointools/blob/master/bitcoin/_bip39.py#L2349 and below) wordlists are of this format: `'\xe7\x9a\x84'`? – Wizard Of Ozzie Jun 09 '15 at 04:42
@WizardOfOzzie By my reading of the standard, bip39_hex_to_mn() should return mnemonics with ideographic spaces for Japanese, so that the result can be used for both display purposes and for calculating the binary seed. As it's currently written, the returned value is only suitable for the latter purpose. Given this change, you could remove the "hack" you made to TestBIP39JAP. – Christopher Gurnee Jun 09 '15 at 14:04
@WizardOfOzzie Regarding your word lists, I agree with you that I'd prefer they all be unicode objects for consistency. Where did you get your word lists? Perhaps you should download the official BIP-39 word list text files directly from [the repo](https://github.com/bitcoin/bips/tree/master/bip-0039), and load them with something like `with io.open(language+'.txt', encoding='utf_8_sig') as words_file: words[language] = tuple(word.strip() for word in words_file)`. – Christopher Gurnee Jun 09 '15 at 14:22
I loaded them using iPython, then wrote the variable to a file with `%store var >> file.py`. I'll update both recommendations, thanks! – Wizard Of Ozzie Jun 10 '15 at 03:24

BIP39 Japanese Mnemonic vector unit test process

1 Answers1