How to handle incomplete blocks when reading blk*.dat files?

Question

I'm trying to write a parser for blk*.dat files from scratch. Right now, I can read blocks, extract the header fields and calculate the correct block hash based on the header fields.

I just tried to run the parser through all blk-files and it failed after successfully reading the first ~300 blk-files. When inspecting the failing file with a hex-editor, I found two "magic byte fields", only separated by 4 other bytes.

Since the mandatory block-header is supposed to be 80 bytes in length, I'm now quite confused.

I found three probably related questions:

My guess, based on these other questions, is the following:

This is an 'incomplete block'
This happens when a block can only be partially downloaded, or written to disk.
Deleting the affected blk files and re-downloading parts of the blockchain doesn't make sense, since this can be expected to happen again.
The parser should be able to skip these blocks.
The overall blk file is not missing actual data and is not corrupt.
The 'incomplete block' will be downloaded and written again further down in the blk file (or to an other blk-file).

Which of my assumptions are correct? .. which are not? Am I missing something?

If you go on 0xd0 bytes further, is there another magic 4-bytes? Theoretically the f9beb4d9 could occur within the data of the block itself, so you can't be sure that any occurrence marks the start of a block. You should be using the lengths. Though the first bytes of the block should be the version number, I think, and I am not sure whether f9beb4d9 would be valid there. — Nate Eldredge, Feb 15 '18 at 02:58
@eponymous I deleted and re-downloaded all blk files. This time, I made sure that the node didn't stop a single time. After that, I had no issues reading all of the blk files. This supports your theory. — forgemo, Feb 15 '18 at 18:06
Yeah. My files are in a strange state due to many forced node restarts during IBD. The software is fine with this because there’s an external database of pointers to the blocks on disk which can just skip the corruption. — Claris, Feb 15 '18 at 18:07
@NateEldredge No, there is not. It's also very unlikely for those 4 bytes to occur within the data. https://bitcoin.stackexchange.com/questions/2337/how-was-the-magic-network-id-value-chosen — forgemo, Feb 15 '18 at 18:10
Unlikely to occur by chance, perhaps, but there is nothing preventing anyone from intentionally inserting those 4 bytes in a transaction (via OP_RETURN, for instance). A miner who was feeling ornery could also twiddle the coinbase transaction until those bytes appeared in the Merkle root; it would take something on the order of 100 million tries, which is not much. In that case you'd see those bytes in the block header. — Nate Eldredge, Feb 15 '18 at 18:30
Given your additional description, I agree that this is probably not the issue here, but it shows that your code definitely needs to handle this case properly. — Nate Eldredge, Feb 15 '18 at 18:31
Thanks @NateEldredge, I didn't think of somebody generating those bytes on purpose. I wonder if this has already been tried? Your are absolutely right, though. My parser has to handle this properly. — forgemo, Feb 15 '18 at 19:36
@forgemo: I'd be kind of surprised if it hasn't. Maybe I'll search the blockchain sometime to see. If nobody has done it then maybe I will :-) — Nate Eldredge, Feb 15 '18 at 19:39
@NateEldredge I couldn't resist to search for it. :) As you already guessed, somebody did it. The following block from 2012 seems to contain such a case. https://blockchain.info/block/00000000000005D7E684BEB913BD73FDC33BFD06C1FDF247E599F4D9D6061B91?format=hex This further stresses your suggestion to handle these cases properly. — forgemo, Feb 15 '18 at 20:17

forgemo · Accepted Answer · 2018-02-17T18:20:17.867

I'm going to answer my own question. These are my conclusions based on the comments from Nade Eldrege and eponymous.

Assumption 1: This is an 'incomplete block'

Yes, this is an incomplete block. More specifically, it is not part of the blk file format.

Assumption 2: This happens when a block can only be partially downloaded, or written to disk.

Yes, evidence strongly suggests that this happens when the bitcoind is interrupted / killed while writing to disk. If restarted, it just continues to download and write new blocks to the blk file. It doesn't clean up incomplete block.

Assumption 3: Deleting the affected blk files and re-downloading parts of the blockchain doesn't make sense, since this can be expected to happen again.

Yes and no. Yes, because deleting and re-downloading the blk files without interruptions solves the issue. When writing a blk-parser, however, it should be able to handle these incomplete blocks. For example, by skipping to the next valid block.

Assumption 4: The parser should be able to skip these blocks.

Yes. See #3

Assumption 5: The overall blk file is not missing actual data and is not corrupt.

I'm not sure, but since the bitcoind just continues with downloading blocks and works without any issues, one can assume that the incomplete block is just re-downloaded afterwards.

Assumption 6: The 'incomplete block' will be downloaded and written again further down in the blk file (or to an other blk-file).

It seems so. See #5