0

I would like to find difference in bytes in files. However, du/diff command with -a list also directories and subdirectories. I want only the files in subdirectories and directories, not these ones.

I know about --exclude option, but i dont know how to manipulate it to do that. thanks.

My os is linux debian.

my command is

dira=/mnt/hdd_a/;
dirb=/mnt/hdd_b/;
diff -u <(cd $dira && du -ab | sort -k2) <(cd $dirb && du -ab | sort -k2)

also I cannot fully understand output. Directories have difference of + or - for multiple reasons i suppose eg. attributes. I dont care about that. However, in hundrends of files, diff prints some files without + or -. Why? They may differ in some other attribute except size?

--- /dev/fd/63  2023-08-22 01:38:15.775099368 +0300
+++ /dev/fd/62  2023-08-22 01:38:15.775099368 +0300
@@ -1,6 +1,6 @@
-364123856483   .
+364123860579   .
 435823780  ./vid_01.mkv
-33781164566    ./news_a
+33781168662    ./news_b
 19110023   ./news_c/covers_09.rar
 161634304  ./news_d/video_d7.avi
 17080320   ./news_e/video_d17.avi

As i understood, the "u" options prints only the 3 lines. I want all diff lines and only these. Not identical lines (file sizes).

Using diff --changed-group-format='%<' --unchanged-group-format='' <(cd "$dira" && du -ab | sort -k2) <(cd "$dirb" && du -ab | sort -k2)

prints some files with their sizes without any indication of "+/-". so I cannot know if the diff is from source or from destination files. Note that whole files are missing from destination.

Estatistics
  • 229
  • 2
  • 10

1 Answers1

2

du/diff command with -a list also directories and subdirectories. I want only the files in subdirectories and directories, not these ones.

I know about --exclude option, but i dont know how to manipulate it to do that. thanks.

If I understand you correctly, you only want to see the sizes of all files in the directory tree, not the total size of the contents of any directories themselves. Unfortunately, the --exclude option of du doesn't appear to support using something like / to indicate directories, e.g. du --exclude='*/' will still output the sizes of directories.

Instead of using any options of du itself to filter out directories, you can use a command like find to get a list of files only (e.g. using its -type f option), and then pass this list to du. The find command outputs each filename on its own line, and we can pipe this list of filnames to du with the aid of xargs. The xargs command expects individual arguments to be delimited by any whitespace character (e.g. space, tab, newline), but in case any filenames contain whitespace, then xargs will not do what we expect, so instead we tell find to delimit the filenames with NULL characters with -print0, and tell xargs to expect such input with -0:

find . -type f -print0 | xargs -0 du -b

I would like to find difference in bytes in files. [...] diff prints some files without + or -. Why? They may differ in some other attribute except size?

To do this, you need to directly compare the file sizes of the two files whose sizes you wish to compare. The diff command does not do this. Rather, diff is used for comparing the contents of two files, e.g. if file a.txt contains the following...

a
b
c

... and file b.txt contains the following...

a
b
d

then diff a.txt b.txt outputs this:

3c3
< c
---
> d

This tells you that difference between the two files is this: on line 3 of a.txt, the line c was removed (<) and the line d was added (>).

Using diff with the -u option causes it to format the output in the style of a "unified context" patch file, as is used by the patch command, and similar in style to patch files used by other tools, such as Git. That is, diff -u a.txt b.txt gets you this instead:

--- a.txt   2023-08-22 00:38:07.477617454 +0100
+++ b.txt   2023-08-22 00:38:12.533616240 +0100
@@ -1,3 +1,3 @@
 a
 b
-c
+d

This should help you understand why you are seeing + and - in the output of the command you have run. Specifically, cd $dira && du -ab | sort -k2 outputs the sizes of the contents of $dira, sorted by item name, and thus diff -u <(...) <(...) takes two such outputs and shows you the differences between those outputs. Lines preceded by - indicate files that exist in $dira but not in $dirb, and vice-versa for lines preceded by +.


The diff command does not do anything more intelligent, such as directly showing you the difference in file sizes between specific pairs of files in $dira and $dirb. For that, you need to somehow specify which pairs of files you'd like to compare the sizes of.

For example, if you want to compare the sizes of $dira/news_a and $dirb/news_b, then you should do so directly. If you want to only compare the sizes of pairs of files in $dir_a and $dir_b whose names are exactly the same, e.g. $dir_a/news_a and $dir_b/news_a, then this can be done programatically, as in the following Bash script:

#!/bin/bash

script_location="$( dirname "$(readlink -f "${BASH_SOURCE:-$0}")" )"

dir_a="$1"
dir_b="$2"

cd "$dir_a"
dir_a_filenames="$(find . -type f)"

cd "$script_location"
cd "$dir_b"
dir_b_filenames="$(find . -type f)"

# Combine filename lists
all_filenames="$( sort -u <(echo "$dir_a_filenames") <(echo "$dir_b_filenames") )"

# For each filename in $all_filenames, compare the size of that file in $dir_a with the same file in $dir_b
IFS=$'\n'
cd "$script_location"
for file in $(echo "$all_filenames"); do
    file_a="$dir_a/$file"
    file_b="$dir_b/$file"

    file_a_size="$(if [ -f "$file_a" ]; then stat --format='%s' "$file_a"; else echo 0; fi)"
    file_b_size="$(if [ -f "$file_b" ]; then stat --format='%s' "$file_b"; else echo 0; fi)"
    size_diff=$(($file_b_size - $file_a_size))

    echo -e "$file\tA size = $file_a_size\tB size = $file_b_size\tSize difference = $size_diff"
done

The $IFS environment variable defines what characters are used as item delimiters in constructs such as for loops. Here, we set it to the newline character, $'\n', for a similar reason as we used NULL delimiters with xargs earlier.

We use stat instead of du to get the file sizes, since it is a bit quicker, and we treat the sizes of non-existent files as being zero for the purposes of reporting their size and calculating the size differences; the command [ -f filename ] is used to check whether the file filename exists.

The Bash syntax $((...)) is used to perform calculations, e.g. $((2+3)) outputs 5; here we are just subtracting one file size from the other.

Jivan Pal
  • 252
  • 1
  • 7
  • There are many valid points in this answer, but the shell code near the end is very poor. You're storing the output of `find`s in "scalar" (non-array) variables, then you `echo` them while giving to `sort`, then you `echo` the result to build the `for` loop. [These `echo`s are unnecessary](https://superuser.com/q/1352850/432690), transparent at best, otherwise possibly harmful. Finally you're relying on word splitting from the unquoted `$()` which is the [Bash pitfall number one](https://mywiki.wooledge.org/BashPitfalls#for_f_in_.24.28ls_.2A.mp3.29), just obfuscated with variables and `echo`s. – Kamil Maciorowski Aug 22 '23 at 04:19
  • It works, thanks! `cd "$script_location"` was removed. I created a null file in destination eg. a.txt (deleted any line) but not in source. it reports `./a.txt A size = 0 B size = 0 Size difference = 0.` How to report what files missing in one of dirs? – Estatistics Aug 22 '23 at 07:41
  • after `size_diff=$(($file_b_size - $file_a_size))` I added: `zerov=0; if [ $size_diff -ne $zerov ] then echo -e "$file, $file_a_size, $size_diff, "not equal"" else echo -e "$file, $file_a_size, $size_diff, "equal"" fi` in order to be easy to grep results equal or not equal – Estatistics Aug 22 '23 at 09:09
  • @Kamil, in debian, i have filenames with spaces, trailing spaces, new line, special characters, chinese characters, underscores etc. The above script works. Maybe, under other circumstances, you must be right. – Estatistics Aug 22 '23 at 09:19
  • @KamilMaciorowski I know enough Bash to get complex stuff done, and haven't had a need to use arrays in my 13 years of using Linux. Frankly, I also just really don't like the array syntax. Combine that with exposure to (and preference of) Zsh, which does arrays completely differently, and I prefer to just steer clear of arrays in shell contexts entirely. In general I get what you're saying about redundant `echo`es, but I don't think there are any in the code I wrote above, e.g. if `$all_filenames` were an array, then you could do `for f in $all_filenames`; but it isn't. – Jivan Pal Aug 22 '23 at 19:16
  • @KamilMaciorowski As for word-splitting being a pitfall: As with anything of the sort, it's only a pitfall if you don't know what you're doing; is using C strings in C a pitfall? Only if your strings might contain `'\0'`. Likewise, is using word-splitting here a pitfall? Only if your filenames might contain `'\n'`, and I think it's a perfectly safe assumption to say that they don't. – Jivan Pal Aug 22 '23 at 19:17
  • @KamilMaciorowski I would also like to reference [the Unix philosophy](http://www.catb.org/~esr/writings/taoup/html/ch01s06.html): "Write programs to handle text streams, because that is a universal interface." `sort` is such a program. When `sort` learns how to handle array input, let me know. – Jivan Pal Aug 22 '23 at 19:21
  • @Estatistics "How to report what files missing in one of dirs?" Rework the usage of `[ -f ... ]` to inform you that the file doesn't exist rather than just treating its size is 0. If you just want to see what files do/don't exist in the destination compared to the source, you may as well just use a different tool, like rsync, e.g. `rsync -an --delete-during src/ dest/` (`-n` flag crucial! If you omit it, files will be copied and deleted!). – Jivan Pal Aug 22 '23 at 19:33
  • Some enjoy writing robust code, others are satisfied with mediocrity. The assumption that filenames don't contain newlines is a wishful thinking, I've seen users doing surprising things ([deliberately!](https://linux.codidact.com/posts/288401)). When you assume, at least state it clearly. There is more weirdness in the code. Why do you `cd "$script_location"`? An invocation like `~/bin/scrpt foo bar` will execute `cd bar` in `~/bin/`, thus it will try to process files in `~/bin/bar/`; but `cd bar` will probably fail and `find .` will run in `~/bin/`. It's good the script only prints things. – Kamil Maciorowski Aug 22 '23 at 20:16