Is it worth to defragment XFS on SSD (many files)?

Question

I have quite old mail server (CentOS 6) of 4+ Tb mail in maildirs (so you may imagine numbe of files). It is VM (VMware 7) on SSD storage.

I wonder if I should care for FS fragmentation at all? I can afford to rsync whole storage to new VM disk (effectively make fragmentation to almost zero level), or run online defragmentation tool (since this is XFS), but should I care at all? SSD backed storage seems to forgive anything?

forest · Answer 1 · 2022-12-29T00:12:14.237

How does defragmenting help?

When storing data on an HDD with a spinning disk, defragmentation reduces the mechanical seek time penalty significantly. Highly-fragmented filesystems on those devices can become extremely slow. Solid state devices do not have any seek penalty. However, modern high-speed SSDs are often IOPS limited, meaning that the rate at which commands can be sent over the bus is the bottleneck, not the ability of the device to access the actual data and service those requests.

Defragmenting a filesystem increases the number contiguous logical sectors in a given file. This can improve performance, even for an SSD, because reading a small portion of a file can be done with a single I/O request. If that same portion was fragmented, it might take multiple requests. This could easily saturate the bus when using an extremely-fast SSD with a heavily-fragmented filesystem.

However, there are better ways to improve performance.

Use TRIM instead

If you want to improve performance on an SSD, you have to help the SSD know which blocks can be erased by using TRIM. A TRIM command allows the filesystem to tell the SSD which sectors are empty and can be handed over to the garbage collector, even if they still have (unallocated) data on them. This greatly improves the device's ability to manage storage. On Linux, you can execute the command fstrim /var/mail (or wherever your partition is) to give the SSD a list of unallocated sectors that can be discarded. See fstrim(8) for more details on command usage.

TRIM greatly improves the ability of the SSD to counter write amplification and allows for better wear leveling. This will not speed up reads from the device, but it will speed up writes. If your drive has very little free space, there will be less benefit (after all, you wouldn't have many free sectors).

Solid state drives have their own form of defragmentation

Because the FTL (flash translation layer, a part of the flash controller) of an SSD is responsible for mapping logical sectors to physical sectors, there's no serious need to defragment a filesystem in most situations. The time it takes to defragment and the wear is generally not worth it.

SSDs will automatically attempt to optimize the storage of files. This optimization is subject to the limitations of NAND flash. To understand the limitations, we have to describe some terminology:

Cells are the smallest physical unit which stores data in a flash device. In SLC devices, each cell stores a single bit. MLC devices, each cell stores two (by holding four different voltage levels). TLC devices store three (by holding 8), and QLC, four (by holding 16). Multiple cells, plus some extra to hold error-correcting codes that help recover from failures of individual cells, make up a page.

Pages are a collection of cells that can be read from or written to at once. A page is how NAND flash devices implement physical sectors, and most pages are between 512 bytes and 4 KiB in size. Pages can only be written to if they have been erased (all bits initialized to 1). If a page has data on it, it cannot be overwritten without first being erased. This is due to the fact that a single bit can only be changed from a 1 to a 0. The other way around can only be done by erasing an entire page. However, it's not possible to erase individual pages. To erase a page, you must erase the entire block.

Blocks are a collection of pages and are measured in tens or hundreds of kilobytes. 128 individual 4 KiB pages may make up a single 512 KiB block. The only action that can be taken on an individual block is erasure. This will erase all pages that it contains. Any data in any of those pages that needs to be preserved must be moved to another block that has some erased pages already. This is done by the flash controller transparently. If you have very few free blocks, the controller may need to move some data around in order to write anything, which slows down writes continuously.

This poses a problem: In order to most effectively optimize writes, the SSD needs to know which sectors the filesystem no longer needs. However, block storage devices don't even have a concept of filesystems. They can't distinguish between a sector that is no longer allocated to any file but still contains leftover data and a sector that is still part of a file and must be preserved. Because making SSDs aware of filesystems would be a Sisyphean task, it's better to simply tell the SSD which sectors the filesystem no longer needs. Modern SSDs support this via TRIM.

+1. Exciting that science has finally figured out the FTL drive, but tbh it's a bit underwhelming — gronostaj, Dec 21 '22 at 15:25
You quite literally can't defragment an SSD unless you're using some weird drive that doesn't have its own controller, as the OS has no control over where on the drive the data is stored. — ScottishTapWater, Dec 21 '22 at 15:57
`If a page has data on it, it cannot be overwritten without first being erased. This is due to the fact that a single bit can only be changed from a 1 to a 0. ` My understanding (and correct me if I'm wrong) is this means that a page can be written to sequentially in several separate operations, so long as one only tries to change bits that have not be written yet (or more accurately, the write only tries to set bit that have not been cleared yet). — Michael, Dec 21 '22 at 18:01
@ScottishTapWater: You can defragment the *filesystem* (XFS, EXT4 or whatever) stored on the logical block device exposed by the SSD firmware. As forest said, reading heavily fragmented files will require the OS to send more smaller requests to the SSD over SATA, SAS, or NVMe, and take more space for metadata for the extent list. You're correct that you can't ask the SSD to defrag its internal flash translation layer, but that's a separate thing and more or less orthogonal. The SSD controller internally has direct access to individual blocks and doesn't have to batch things into read cmds. — Peter Cordes, Dec 21 '22 at 22:13
Not that it would usually be worthwhile to do it; most filesystems resist fragmentation well enough not to be a problem when used on SSDs. — Peter Cordes, Dec 21 '22 at 22:14
@Michael, that is true, and I have seen applications that store small amounts of state sequentially into EEPROM like that to reduce the number of erase cycles (the latest save state is the last one that isn't all 1s). — Toby Speight, Dec 22 '22 at 13:17
@TobySpeight That works for EEPROM but is very unlikely to work with typical SSDs. The reason is that each page has multiple extra cells used for error correction, and the chance that a change to even a single bit results in no ECC bits needing to change from a 0 to a 1 is highly unlikely. Also, modern SSDs transparently encrypt all data with a feature called SED to make secure erasure easier (it just wipes the encryption key). This means that a single change to any bit in a page will result on an average of 50% of other bits being toggled on SED-enabled drives. — forest, Dec 22 '22 at 23:44
Yes @forest, that's why I mentioned that it was bare EEPROM. It's important that it doesn't have a translation layer, too. — Toby Speight, Dec 23 '22 at 12:12
"Modern SSDs support this via TRIM": just to clarify, Trim was introduced by pretty much every operating system and SSD more than 10 years ago, so you can simply assume that Trim is available. — Fabio says Reinstate Monica, Jan 09 '23 at 14:26
@FabiosaysReinstateMonica Many external USB SSDs lack TRIM because the USB controller doesn't pass the commands along. The same is true with most SD cards. — forest, Jan 09 '23 at 21:38

John · Accepted Answer · 2022-12-21T21:37:20.043

15

You should not attempt to defragment a file / folder on an SSD. Defragment will not do anything useful for solid state storage and should not be used. "Fragmentation" does not really apply to the organization of cells in an SSD.

If you need to Shrink (reduce the space occupied by the machine) you can do this in VMware using the shrink command.

I use shrink a lot but you would have to try for the older machine. It should work.

Open a terminal session, from there run:

sudo vmware-toolbox-cmd disk shrink /

Allow the shrink to finish. There is a terminal component that will reach 100% and then a VMware GUI Window that will shrink the machine.

Shrink will recover space and remove unused space in the machine but little if any impact in performance

edited Dec 21 '22 at 21:37

answered Dec 19 '22 at 20:20

John

46,167
4
33
54

I haven't specified, I care for partition that holds the mail (like /var/spool/mail), not the root dir. That is, I simple wonder if storing millions of files may be faster if FS is not fragmented on SSD. Seems not worth al all. – Alexander Dec 19 '22 at 20:42
8

Solid state drives can store data in any of the cells and will not benefit from any particular file organization. – John Dec 19 '22 at 20:45
14

Each file being contiguous still helps a bit. – Deduplicator Dec 20 '22 at 15:45
10

@Deduplicator That is correct. Modern SSDs are frequently IOPS limited and not throughput limited. A defragmented file will require fewer IOs to read and defragmented free space will require fewer IOs to write. Whether that makes it worth defragmenting is another question. – David Schwartz Dec 20 '22 at 23:55
2

@Deduplicator although the statement is true, you have to go out of your way to create those scenarios, and the OP is talking about mail server archives. There's functionally no real "contiguous" data when dealing with that kind of data. If it's a large media collection, sure, but email archives? – Nelson Dec 21 '22 at 03:32
10

this is a common ***MYTH*** - sequential reads are still significantly faster than random reads, even on SSDS. the difference is much smaller than on traditional rotating harddrives, but sequential reads are much faster than random reads. – hanshenrik Dec 21 '22 at 10:50
2

The point is how the disk is provided by VMware: If the disk is just an abstraction of the real SSD, it's not guaranteed that a fragmented disk inside a VM is actually defragmented on the physical device. Also frequent defragmentation will reduce the life of the SSD. Probably the best strategy is not to fill the filesystem to more than 80%, asuming the disk seen by the VM is fully provisioned; otherwise it will fragment anyway. – U. Windl Dec 21 '22 at 21:04
@hanshenrik what you said is true, but it does not translate to "go defrag SSD drives". Defrag **CANNOT** create the contiguous reads that gives you the performance boost because the SSD's sectors have a logical layer to it. It isn't necessarily the actual, physical location on the hardware, and there is a system in place to automatically move them around to distribute read and write. The bottom line is **defrag still doesn't help an SSD**, so don't do it. – Nelson Mar 17 '23 at 03:41
@Nelson actually it goes all the way down to the protocol level - the message to the drive "read 2048 bytes from logical sector 1" is faster than "read 512 bytes from logical sector 1, and 512 bytes from logical sector 3, and 512 bytes from logical sector 6, and 512 bytes from logical sector 9" - both approaches read 2048 bytes, but the first is faster than the second~ – hanshenrik Apr 05 '23 at 10:23

score 9 · Answer 3 · edited Aug 19 '23 at 18:52

SSDs don't cost extra time to seek between distant parts of the disk, so the potential gains are very small. Like at best just a tiny reduction in CPU usage for the server, maybe fractions of a percent as a wild guess.

But it's a valid question; if reads for multiple small files can be coalesced at the block level into one larger read, that's fewer commands sent to the SSD, so you'll get more of its available bandwidth. At least that's a reasonable thing to hope for or wonder about. Same for reading one larger file.

But it still probably doesn't justify actually doing anything, especially because XFS can only defrag individual files, not AFAIK group multiple small files into contiguous storage (in order of filename or otherwise).

Your files are mostly small and each individual file was probably written all at once, not appended later. You say you have maildir, one file per email message. Not mbox like traditional /var/mail, where it's one file per mailbox with messages appended as they arrive. In that case there might be fragmentation, although deleting a message involves rewriting at least the tail of the file from that point.

So that being the case, you probably don't have much fragmentation of the kind XFS can do anything about; most files probably only have one extent. You can check with filefrag /var/mail/**/* | grep -v '1 extent found'.

For more detail you can use filefrag -e to see the individual extents of a file, or since you're on XFS you can use xfs_bmap to get the same info. See a unix.SE answer for example output.

As @Roger Lipscombe comments, the physical flash cells that store a file's data might not be contiguous even if the logical block addresses are; the flash remapping layer inside the SSD might fragment at that level. But that's fine; the SSD firmware should be able to read the necessary parts of flash with minimal overhead; unlike the kernel having to send commands in some protocol (SATA, SAS, or NVMe), the firmware is directly connected to the raw flash and it's probably not slower to read 2 separate regions than one longer region.

If SSDs don't cost extra time to seek between distant parts of the disk, why does the [Samsung 980](https://semiconductor.samsung.com/consumer-storage/internal-ssd/980/) 250GB SSD advertise a sequential read speed of up to 2900 MB/s and a random read (4KB, QD1) speed of 17,000 IOPS which is just 68 MB/s? — Cristian Ciupitu, Aug 19 '23 at 18:55
@CristianCiupitu: Command overhead and latency. For QD1 (queue depth = one), the computer doesn't even send the request for the next block until after it gets the data for the previous block. This means it's not keeping the "pipeline" full, leaving gaps when the command-processing part of the SSD is idle, and other gaps when the flash-read part is idle. Higher queue depths can keep all parts working in parallel on different blocks, if it supports enough parallelism to bottleneck on the throughput of some part. — Peter Cordes, Aug 19 '23 at 19:02
@CristianCiupitu: If one command requests only a 4K block, instead of say 32K or 128K, it takes 8x to 32x more command-processing throughput to keep up with the same MB/s throughput. And a greater queue depth of requests to not bottleneck on latency. (Throughput vs. latency considerations for a pipeline are similar to instruction processing inside a CPU (https://www.lighterra.com/papers/modernmicroprocessors/), or to requests over a network.) — Peter Cordes, Aug 19 '23 at 19:04

score 5 · Answer 4 · answered Dec 19 '22 at 20:27

5

Defragmenting a solid state drives does nothing to increase disk performance as the seek/head movement times do not exist as in drives with moving parts. It will also increase wear on the chips which have a limited number of write cycles resulting in reduced lifetime.

If you need to to shrink a partition, you can do that with the drive unmounted.

answered Dec 19 '22 at 20:27

SHawarden

638
3
5

9

SSD disks don't even necessarily represent the blocks linearly. Just because the blocks are next to eachother in the filesystem doesn't mean they're next to eachother on the SSD. – Roger Lipscombe Dec 20 '22 at 09:39
8

This is not true. An 8KB file that is contiguous will require one I/O to read. An 8KB file that is fragmented into two 4KB blocks will require two I/Os to read. Modern SSDs frequently hit their IOPS limited before they hit their data rate limit. – David Schwartz Dec 20 '22 at 23:56
"*If you need to to shrink a partition, you can do that with the drive unmounted.*" XFS doesn't shrink, it only supports online growing. – Wes Toleman Dec 23 '22 at 02:37

Is it worth to defragment XFS on SSD (many files)?

4 Answers4

How does defragmenting help?

Use TRIM instead

Solid state drives have their own form of defragmentation