0

I'm building on the answer here to write a backup script. The script I have is roughly

backup_files="/etc /home"
excludes="--exclude-vcs --exclude-ignore-recursive=.tarignore"

#(Skip irrelevant details)

total_size= du -csb $backup_files |awk '{print $1}'|tail -n 1

tar cf - $excludes $backup_files -P | pv -s $total_size | gzip > "$target_file"

Only, the computation for total_size ends up overestimating the amount of time. I've been fiddling around with the script to tighten the estimate, but I'm encountering some problems. For instance, I have tried

all_files=$(tar cvf /dev/null $excludes $backup_files -P |grep -v -e /$)
total_size=$(du -csb $all_files)

Which runs into the issue of too many arguments (approximately a million files). Iterating over this with a for loop runs into issues with filenames. Among other things, spaces break the loop and some odd Unicode filenames break stuff. Also, I tried timing the loop and it would take hours.

With a few pointers from comments and a now deleted answer, I've gotten as far as

run_tar () {
  printf '%s\n' "$excludes" "$backup_files" | tar -cSPf - --files-from -
}

list_files () {
  printf '%s\n' "$excludes" "$backup_files" | tar -cvPf /dev/null --files-from - | grep -v -e /$
}

compute_size(){
  list_files | while read -r f;
  do
    echo -ne "$f\0"
  done | du -csb --files0-from - |awk '{print $1}'|tail -n 1
}

This fixes the overhead from the for loop and the problems with spaces. Currently, it takes about a minute or two to process a million or so files.

Where I'm still stuck with are the Unicode errors. The filenames are rendered as e.g. Yle P\344\344uutiset.xml. Forwarding errors to /dev/null hides the problem, and this is a handful of files anyway. A ls of one of the misbehaving directories shows that there's a file called 'Yle P'$'\344\344''uutiset.xml'. I think this instance is a case of filename breakage but the issue remains that these are still valid filenames. For that matter, the newline character is also a valid filename separator.

How do I include the few files that I'm missing from the total?

Haem
  • 101
  • 3
  • Instead of storing multiple file names in variables and using a `for` loop it might be better to pipe the `tar` output to `while IFS= read -r line`. See https://mywiki.wooledge.org/BashFAQ/001 – Bodo Aug 10 '23 at 08:19
  • Your comment under my (now deleted) answer was right. My next idea is something like `total_size="$(tar -cvvPf /dev/null $excludes $backup_files | awk '{s+=$3-($3+511)%512+1023} END{print s+1024}')"`. Close enough if you are not going to use `-S`, totally flawed if you are. – Kamil Maciorowski Aug 10 '23 at 08:27

0 Answers0