mdadm RAID Fast Setup with Empty Drives?

Question

I am recreating some RAID5 disks as RAID6 with mdadm. There is no data on the drives that I care about.

Setting up RAID takes a while to setup the shape - I accept that when there there is data that needs to be striped and parity calculated, however with these drives, they are empty - or at least I want them to be considered empty.

So is there a way to skip the parity calculation and tell mdadm to just set up the superblocks and be done, or otherwise, what exactly is it spending all this time on when there is no data to move around?

md3 : active raid6 sdf3[5] sde3[4] sdd3[3] sdc7[2] sdb3[1] sda3[0]
      1953114112 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
      [>....................]  resync =  1.3% (6790144/488278528) finish=409.3min speed=19604K/sec

Note that I am not talking about --assume-clean where you are rebuilding an array from a pre-existing set of disks that contain a RAID array that you know is correct. I am talking about an array that should be considered empty, not considered correctly striped.

So lets say for the sake of this question that the devices have been pre-populated with zeros.

The proper use is to just ignore that the disks are updating the checksums. You can still use the RAID while it do that. You could even lower the priority of that checksum writing so the load on the IO busses and disk subsystem is lower. You have to look that up, as I don't remember how I did that. :-) — Anders, Feb 07 '22 at 15:58

psusi · Accepted Answer · 2012-06-19T13:11:23.867

10

You can use --assume-clean but unless you are using raid5 ( not raid6 ) and the disks actually are full of zeros, the first time it runs a parity check, it will come up with errors that will need corrected, so you should not do this. You don't need to wait for the resync to finish before you can start using the array; it will chug along in the background until it is done.

edited Jun 19 '12 at 13:11

answered Jun 18 '12 at 22:59

psusi

7,837
21
25

Thanks psusi - this doesn't address the question. – Paul Jun 18 '12 at 23:13
2

@Paul, umm.. yes, it does. There is no such thing as "empty" there is only in sync, or not in sync. – psusi Jun 19 '12 at 02:17
It doesn't address the question I am asking. The xor of a block of zeros is zero. So if I could tell mdadm that the space can be considered empty, it could create parity very quickly. The parity would only need to be calculated properly when a specific block is written to, at which point it doesn't matter that it was "wrong" previously. I could even actually zero the disk. – Paul Jun 19 '12 at 03:09
2

@Paul, again, there is no such thing as "empty". Generally yes, it won't matter that the parity on stripes that have never been written is wrong, which is why you can get away with --assume-clean, even when the disks are not full of zeroes, but the wrong parity will be detected and corrected when mdadm does a parity check. – psusi Jun 19 '12 at 03:18
Is this an issue of semantics I am missing? When I say "empty" I mean without data, and more specifically, with zeros. So parity of zeros is zero. If I can assert to mdadm that something contains zeros and it actually does contain zeros, there is no wrong parity to correct during a parity check. – Paul Jun 19 '12 at 03:27
2

@Paul, the meaning of "empty" is not well defined. Most people use it to mean they have not put any files or a filesystem on the device, and do not know or care what it currently contains. If you know the disks are all full of zeros, then --assume-clean is what you want. It will take your word for it that the disks are all zeroed and not recompute the parity, and as you mention, a parity of zero will be correct for data disks of all zeros, at least for raid5 ( not for raid6 ). – psusi Jun 19 '12 at 03:35
If you update your answer with this clarification I'll accept it. Btw, I reckon even with raid6 it wouldn't take much to do the same - while the parity wouldn't be zeros, it would be the same for each block. So mdadm could have a feature where you say "I promise there is nothing but zeros" and mdadm would do reed-solomon codes based on that promise. I imagine there isn't much demand for this though. – Paul Jun 19 '12 at 04:28
There are a good reason to let the RAID write all checksums, and that is to early check for bad sectors on the disks. I do recommend that you look into blogs written by the engineers and computer scientist who wrote ZFS. There are a lot of usefull info about large disk systems and unique problems which they tried to solve. And it was a lot fun reading too. :-) – Anders Jun 19 '12 at 11:34
@Paul, edited... – psusi Jun 19 '12 at 13:13
@psusi, Paul did define the meaning of "empty" by stating the disks to just contain sectors set to all zero. – PoC Nov 26 '21 at 10:58
@PoC that is JUST Pauls definition of "empty", not `mdadm` definition of empty. So no, that definition isn't valid when it comes to `md` raids. As Paul said, there are probably not much demand on that functionality, as the check sums can be calculated WHILE you use the disk raid. No need for waiting for it to be calculated. It will ALSO be calculated when you write to the disk. – Anders Jan 26 '22 at 17:47
1

@Anders: Since the mdadm bitmap seems to be optional, I assert that mdadm has no concept of "empty". Only of block checksum (mis)matches. Checksum calculation means heavy write load on SSDs, also. Not exactly desirable. And, online checksum validation gives a notable performance hit. – PoC Jan 27 '22 at 18:28
Sorry @PoC, I mixed user @Paul with @psusi. You are right in your statements. As I understand it we do agree with @psusi :-) `mdadm` have no consept of empty records, just if the checksum on a record is correct or not, as you write. Sorry again. – Anders Feb 07 '22 at 15:55

score 2 · Answer 2 · answered Nov 26 '21 at 15:07

In general, a newly created array to enable device redundancy on zeroed disks would not need any prior syncing as long as the checksum (or copy, for RAID1) of those zeroed input blocks is also zero. There is no functional difference how a block is zeroed: Prior to RAID creation, or through the process of the RAID sync. So, indeed, --assume-clean is what can be safely used to skip the time consuming (and in case of SSDs wear inducing and thus undesired) random (re)write of blocks from zero to zero.

To my understanding, the mdadm write-intent-bitmap is a device-local (not array-local) indicator about the consistency of individual devices with each other. I'm not sure if the bitmap itself is used as an indicator of inconsistency on the array level, aka If all bitmap bytes are zero, the array can be assumed in sync; if not, checksums must be rewritten/data copied for RAID1.

Within the constraints of the assumptions outlined above, the most safe approach to create an array without needing a prior sync for full redundancy seems to me to create it on guaranteed zeroed disks with --assume-clean --bitmap=none, and — if desired — add a bitmap in a second step. This provides consistency without sync in any case, also is safe in degraded mode, and also gives a clean result with a checkarray run. Again, this is true only for RAID levels where the calculated checksum of zeroes is also zero, or for RAID1 where a copy of a zero yields also a zero.

Here comes some speculation. I don't know enough about the inner workings of mdadm to know for sure what happens if non-zeroed disks are used with --assume-clean --bitmap=none. So take the following statements with caution.

Assuming checksum calculation for reads is done in degraded mode only (very likely, for performance reasons), it's even safe to not zero disks before bundling them in an array: Checksums of blocks will be corrected "lazy", after each write to the array. Data blocks not having been written to the array (and thus with a not matching checksum) can be considered unimportant: From a file system's PoV, they're "free space". And because reads to unallocated blocks do not trigger a checksum fault, there should be no functional difference from reading unallocated blocks from a single disk for whatever reason.

This is likewise for a RAID1: Already written data is consistent on all mirrored members. Never written data giving inconsistent reads doesn't matter.

If a partially written array is used in degraded mode, already written data has correct checksums/copies and thus can be recreated correctly. All free blocks still don't matter. If mdadm returns garbage from recalculating the checksum from never written blocks, it's just different garbage, but still irrelevant because not in use by the file system.

In short: The filesystem keeps track of allocated blocks. Since these blocks are written to the array before the need to be re-read eventually, data is consistent.

Regarding checkarray, it cannot know which blocks have ever been written, so it will need to correct all not yet written blocks, be it checksum-based, or just a copy as with RAID1. Unless the write-intent bitmap plays a more important role than I anticipate, that is.

What I did not yet mention is the problem of software bugs, corrupted file systems through power outages, and faulty disk sectors. Possible scenarios and effective mitigations (such as the data=ordered mount option for ext4) are left as an exercise to the reader.

Some assumptions done with others data. But as long as that is clear, then they are on their own if someone want to try this, I guess. The only way to really know is ask `md` developers, I guess. — Anders, Jan 26 '22 at 17:57

Anders · Answer 3 · 2012-06-19T11:28:18.597

1

You can't do this with a software or hardware RAID. All checksums need to be written on the disks, which takes time. You can do it later, but then the parts of the disk that isn't written to, will have to do that before you can use them.

This is basicly because the RAID system and file systems doesn't know a thing about each other. ZFS has a sollution for this, but there the RAID parts are deep integrated with the file system. So the RAID subsystem actually knows what parts of the disks are used to store data on and which can be used later and then write the check sums to them.

You can add throughput speed to the software RAID or you start using the RAID before all checksums are written, and let the software RAID handle this for you later. Witch is what @psusi wrote.

edited Jun 19 '12 at 11:28

answered Jun 18 '12 at 23:57

Anders

143
6

I understand how RAID works, I am trying to understand what takes place with checksums on an empty disk. Checksums need to be calculated every time something is written to disk, so calculating them before any data is present doesn't make much sense - "doing it later" is sensible when there is nothing there. There isn't a file system to consider in this scenario. – Paul Jun 19 '12 at 01:10
6

There's *always* data present. It might not be important or meaningful, but every sector always has a value. (Zero is a value too!) The RAID implementation doesn't know which sectors contain important data, so it has to treat them all as important and calculate their parity. – Wyzard Jun 19 '12 at 04:42
@Wyzard Paul was trying to explain that from a human viewpoint, it's just dumb to read zero bytes from each RAID disk, calculate the checksum (zero), and overwrite zeros in a parity strip with that calculated zero value. – PoC Nov 26 '21 at 11:02
@PoC and the Software RAID system `md` DOES NOT CARE what the human put on the RAID, it doesn't know. That is not the case with ZFS, which KNOWS about the RAID and file system. So it can make a difference between data and no data sectors on the RAID. That is, Pauls definition and assumptions are plain wrong when it comes to `md`software RAID:s. – Anders Jan 26 '22 at 17:51
@Anders: This entire discussion revolves around mdadm. I don't understand why you suddenly introduce ZFS. – PoC Jan 27 '22 at 18:29
@PoC because that is a good example where the system actually KNOWS which sectors in the RAID is used and which isn't used to store data in. `mdadm` does NOT know which sector has valid data and which has not been written to and doesn't contain valid data. That is why I used ZFS as an example. You can look into BTRFS which have similar features as ZFS (which `mdadm` doesn't have). – Anders Feb 07 '22 at 15:43
@Anders "The system" is a really blurry term for comparison purposes. Any file system also knows which blocks/sectors in the backing store are in use, no matter if a single disk is employed, or a RAID. The only difference which matters is that ZFS is integrating RAID and File System functionality as one entity. As far as I've understood, that is. Still, I assert that there's no need for mdadm to know about truly "occupied" blocks for the sake of the whole discussion. – PoC Feb 08 '22 at 12:53
Yes, FILE systems knows which sectors they are using. RAID doesn't know this. A FILE system like ZFS and BtrFS have merged file systems and raid into one system, so there the file system and raid knows about each other, so there the concept of "empty sector" actually works, because that is how they are designed. So that is why the original question doesn't work, but it works with ZFS and BtrFS. And yes, it IS a difference as how MD is designed. – Anders Feb 09 '22 at 14:23
@PoC but I can agree on that we doesn't agree on this one. ;-) – Anders Feb 09 '22 at 14:23

mdadm RAID Fast Setup with Empty Drives?

3 Answers3