3

I have a 9TB XFS partition consisting of four 3TB disks in a RAID-5 array with a chunk size of 256KB, using MDADM.

When I created the partition, the optimal stripe unit and width values (64 and 192 blocks) were detected and set automatically, which xfs_info confirms:

# xfs_info /dev/md3
meta-data=/dev/md3               isize=256    agcount=32, agsize=68675072 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=2197600704, imaxpct=5
         =                       sunit=64     swidth=192 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

However, I was experiencing slow transfer speeds, and while investigating I noticed that unless I specifically mount the partition with -o sunit=64,swidth=192, the stripe unit is always set to 512 and the width at 1536. For instance:

# umount /dev/md3
# mount -t xfs -o rw,inode64 /dev/md3 /data
# grep xfs /proc/mounts
/dev/md3 /data xfs rw,relatime,attr2,delaylog,inode64,logbsize=256k,sunit=512,swidth=1536,noquota 0 0

Is this intended behavior? I suppose that I could just start mounting it with sunit=64,swidth=192 every time, but wouldn't that make the current data (which was written while mounted with sunit=512,swidth=1536) misaligned?

The operating system is Debian Wheezy with kernel 3.2.51. All four harddisks are advanced format disks (smartctl says 512 bytes logical, 4096 bytes physical). The fact that the values are multiplied by 8 makes me wonder if this has anything to do with the issue, seeing that it matches the multiplication factor between 512 and 4096 sector size disks.

Can anyone shed some light on this? :-)

Sauron
  • 33
  • 1
  • 5
  • mount options can't move existing data relative to the underlying block device stripe geometry. Either your data on disk is aligned, or it isn't. Fortunately, alignment matters a lot more for writes on RAID5 than for reads. So it's a non-issue except for files like VM images, swap files, or other stuff that might be rewritten in-place (WITHOUT truncation, e.g. dd `conv=notrunc`). – Peter Cordes Mar 02 '15 at 07:00
  • See https://raid.wiki.kernel.org/index.php/RAID_setup#XFS for how to create XFS on RAID, if auto-detection of the underlying stripe geometry didn't work. – Peter Cordes Mar 02 '15 at 07:20
  • Large stripe sizes are appropriate for most stuff these days. 512k stripe width is reasonable. I/O commands sent to hardware can be done in fairly large units, so smaller stripe sizes would tend to lead to smaller hardware commands than would be optimal. There is some old stuff on https://raid.wiki.kernel.org/index.php/Performance, and some of the links are dead. Small chunks for RAID5 could be justified if you have a write-heavy workload that can manage to batch requests into sequential chunks up to a certain size, but not larger. Set chunk size to make a write cover a full stripe. – Peter Cordes Mar 02 '15 at 07:29

1 Answers1

4

Your mystery multiply-by-8 is because xfs_info shows sunit/swidth in bsize blocks, typically 4096 bytes. When specifying sunit/swidth in mount with -o or fstab, they are specified in 512-byte units. Note the "blks" string after the sunit/swidth numbers on your sample xfs_info output. 4096/512=8, hence the mystery multiplier.

man 5 xfs spells this out in the sunit stanza, as does mkfs.xfs, regarding 512B units.

man xfs_growfs, which doubles as the manpage for xfs_info, spells out how the units for xfs_info are bsize bytes.

Confusing, yes. Very bad design choice from a UI perspective, yes.

Specifying "-o sunit=64,swidth=192" was probably a bad idea, as really you wanted 64/8=8 and 192/8=24. You may have "hardcoded" the 8-times-greater values into the FS now having mounted them with the larger numbers. The man page is pretty explicit about never being able to switch to a lower sunit. However, you could probably try, and see if you get mount errors. Mount for XFS should (but no guarantees) be robust enough to not eat your data: it should just spit out an error and refuse to mount, or mount with sane options ignoring what you specify. Make backups first.

That said, there may actually be nothing wrong with 8-times-greater sunit/swidth, as this is all about alignment, and those numbers are still aligned. Perhaps there may be fragmentation issues or issues if most of your files are tiny?

Aside: What I am working on now and finding intriguing is what to change sunit/swidth values to when you grow/reshape your md RAID by adding 1 disk. From the man page it appears you cannot change sunit unless you literally double the number of disks, but it seems changing swidth is still possible. Whether this results in proper alignment in most cases remains to be seen. Information from people actually doing this seems scarce.

  • http://xfs.org/index.php/XFS_FAQ#Q:_How_to_calculate_the_correct_sunit.2Cswidth_values_for_optimal_performance. The mount options are in units of 512B, so the correct setup for his HW is `sunit=256* 1024/ 512=512`, and `swidth=sunit*4=2048`. – Peter Cordes Mar 02 '15 at 07:07
  • re: reshaping after adding a raid disk. Correct, sunit doesn't change, just the swidth. sunit only changes if you `mdadm --grow --chunk something_new`. And don't worry. If you get it wrong, data and metadata will write slower while you have a FS mounted with geometry that doesn't match the underlying storage, but there is no chance of that causing data loss. And little chance of causing lower read performance when you use the data later. – Peter Cordes Mar 02 '15 at 07:08
  • Also, hey, a Cordes. I know comments aren't the right place for a discussion, but I pretty much never actually run into anyone else with the same last name, even online. – Peter Cordes Mar 02 '15 at 07:10