How to compare huge sparse files efficiently?

Question

There are two sparse files. They are proved identical by diff. But it took 20 minutes (too long time) to compare. I am thinking of taring them into tiny files to speed up the comparison. But they tar into different outputs.

They are 512GB huge sparse files, with only around 40K meaningful data.

% ls -l sparse_file_one/
total 40
-rw-r--r-- 1 midnite midnite 512711720960 Mar  4 23:12 sdd.img
% ls -l sparse_file_two/
total 48
-rw-r--r-- 1 midnite midnite 512711720960 Mar  4 23:13 sdd.img

% du sparse_file_one/sdd.img
40      sparse_file_one/sdd.img
% du sparse_file_two/sdd.img 
48      sparse_file_two/sdd.img

diff comparison takes 20 minutes. They are proved identical.

% diff -qs --speed-large-files sparse_file_one/sdd.img sparse_file_two/sdd.img | pv
68.0 B 0:20:57 [55.4miB/s] [     <=>                                                     ]
Files sparse_file_one/sdd.img and sparse_file_two/sdd.img are identical

As their du disk usages differ, I look into filefrag and confirm that their internal representations differ.

% filefrag -v sparse_file_one/sdd.img
Filesystem type is: ef53
File size of sparse_file_one/sdd.img is 512711720960 (125173760 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:    6866944..   6866944:      1:            
   1:     8192..    8194:    6852608..   6852610:      3:    6875136:
   2:    12288..   12288:    6854656..   6854656:      1:    6856704:
   3:    16384..   16384:    6868992..   6868992:      1:    6858752:
   4:    16448..   16449:    6869056..   6869057:      2:            
   5:    16512..   16512:    6869120..   6869120:      1:             last
sparse_file_one/sdd.img: 4 extents found

% filefrag -v sparse_file_two/sdd.img
Filesystem type is: ef53
File size of sparse_file_two/sdd.img is 512711720960 (125173760 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:    6871040..   6871040:      1:            
   1:     8192..    8195:    6856704..   6856707:      4:    6879232:
   2:    12288..   12288:    6858752..   6858752:      1:    6860800:
   3:    16384..   16384:    6860800..   6860800:      1:    6862848:
   4:    16448..   16449:    6860864..   6860865:      2:            
   5:    16512..   16512:    6860928..   6860928:      1:            
   6: 125173759..125173759:  132128862.. 132128862:      1:  132018175: last,eof
sparse_file_two/sdd.img: 5 extents found

tar completes promptly. It takes literally no time. But the tar output sizes differ. No wonder they will not be compared identical.

% cd ../sparse_file_one/

sparse_file_one % tar -cvSf sdd.img.tar --mtime=@0 sdd.img | pv
tar: Option --mtime: Treating date '@0' as 1970-01-01 08:00:00
sdd.img                                                  
8.00 B 0:00:00 [26.2KiB/s] [  <=>                                              ]

sparse_file_one % ls -l
total 80
-rw-r--r-- 1 midnite midnite 512711720960 Mar  4 23:12 sdd.img
-rw-r--r-- 1 midnite midnite        40960 Mar  5 00:22 sdd.img.tar

% cd ../sparse_file_two 

sparse_file_two % tar -cvSf sdd.img.tar --mtime=@0 sdd.img | pv
tar: Option --mtime: Treating date '@0' as 1970-01-01 08:00:00
sdd.img
8.00 B 0:00:00 [ 520KiB/s] [  <=>                                              ]

sparse_file_two % ls -l
total 100
-rw-r--r-- 1 midnite midnite 512711720960 Mar  4 23:13 sdd.img
-rw-r--r-- 1 midnite midnite        51200 Mar  5 00:23 sdd.img.tar

(With reference to this post, nullifying the mtime makes identical tar archives. I could make identical archives from other identical sparse or non-sparse files. But this behaviour is apparently not guaranteed.)

(According to this post, if I could extract the content of a sparse file with less than 10 minutes, it would be faster to verify they are identical. But I do not know python. It would be nice if certain Linux native program could do it.)

PS - I would prefer using diff to cmp for the directory recursive comparison possibility.

What if you `fallocate --dig-holes` each file prior to using `tar`? — Kamil Maciorowski, Mar 04 '22 at 17:48
@KamilMaciorowski - Bingo! You are genius. They are identical now. Thing becomes more interesting that I have a **third** file. It was originally `tar` into 71680 bytes. After **twice** `fallocate -d`, it `tar` into identical files. If only **once** `fallocate -d` is run on the third file, its `tar` output does not match any of the previous 4 tar outputs (first file, second file, before/after `fallocate -d`). FYI first file, before/after `fallocate -d` are identical. Apparently there were extra holes in the second and third files. **But why we need to run dig-holes twice for a certain file?** — midnite, Mar 04 '22 at 19:02
No idea so far. There is another potential problem (although possibly not in your case): the result of `--dig-holes` depends on the block size of the filesystem. — Kamil Maciorowski, Mar 04 '22 at 19:07

Arthur Moraes Do Lago · Answer 1 · 2022-10-19T16:09:26.910

1

I think this tool I made might be useful to you: https://github.com/ArthurMLago/sparsediff

I had to compare huge sparse files too, 60G apparent size, a couple hundred k in real disk usage. I never found a good solution online, so I ended up making my own application that uses lseek, SEEK_HOLE and SEEK_DATA to efficiently look for the relevant sections of the first file, and compare with the second. Output was inspired in hexdump -C, and is intended for binary files.

edited Oct 19 '22 at 16:09

answered Oct 19 '22 at 03:44

Arthur Moraes Do Lago

111
2

You should make the declaration that its your application, a little more prominent [Self Promotion](https://superuser.com/help/behavior) – Rohit Gupta Oct 19 '22 at 03:57

How to compare huge sparse files efficiently?

1 Answers1