2

Given: A generic binary file and a block size

Desired output: A copy of the binary file where all the blocks that contain only Zero-Bits/Bytes have been removed/stripped from the file

I really wonder why I cannot find a tool that does this simple job. I created a small script, but its performance is ridiculous. There must be an existing software able to do this, isn't there?!

Maybe the issue with finding this is caused by the fact that there are so many terms that can be used to express this need...

Edit: The sed thread you mentions replaces every byte, I just want to replace 0-bytes if there are at least blocksize many in a row.

I want to investigate a very large very sparse file (not sparse as in sparse file in the file system) and for this analysis I want to cut out irrelevant parts

EDIT 2: The file size is in the order of 10 to 1000 GB. For small sizes, my slow own tool is alright, but for such larger files ...

  • 3
    Why should there be an extant tool that does this? What's the general use case (i.e., why would people generally want this)? – Jeff Zeitlin Dec 17 '19 at 13:20
  • Write it by yourself. ~10 codelines on VBS. – Akina Dec 17 '19 at 13:43
  • @JeffZeitlin beacuse this is nothing but the replacement of a binary string – dsfsdfdsf23 Dec 17 '19 at 13:56
  • @Akina I already did an own implementation as stated before and it is dead slow. I doubt that this is a 10-liner in VBS, if you don't use the naive "read everything to RAM" approach which would not work for me, as the file is waaaay to big. – dsfsdfdsf23 Dec 17 '19 at 13:58
  • 1
    _Binary_ string replacement is not exactly - in my experience - a commonly-needed task. _Text_ is far more common, and there _are_ tools for _that_. You'll need to own-code this; I wouldn't recommend PowerShell or VBScript; you'll probably do better using a native-code compiled language. – Jeff Zeitlin Dec 17 '19 at 14:05
  • @dsfsdfdsf23 so use sed? If it is a disk image be aware that you might be removing structures that are important for it to work as a disk image. – Seth Dec 17 '19 at 14:05
  • 1
    Does this answer your question? [How to use sed to remove null bytes?](https://superuser.com/questions/287997/how-to-use-sed-to-remove-null-bytes) – Seth Dec 17 '19 at 14:07
  • 1
    @Seth - I wouldn't use a text-processing tool like `sed` or `awk`; they tend to treat a binary zero as having a special meaning (usually end-of-string). – Jeff Zeitlin Dec 17 '19 at 14:07
  • @JeffZeitlin I probably wouldn't either. But he really hasn't specified much. My guess would be it's a hard disk image with empty space in which case creating a smaller image and just running a copy might be easier. I'm guessing based on it not fitting in RAM. – Seth Dec 17 '19 at 14:10
  • @Seth - that's why I asked about the 'use case', which was carefully ignored. – Jeff Zeitlin Dec 17 '19 at 14:13
  • @JeffZeitlin you asked about the "general use case" and why I expect an existing tool - you did not ask for MY use case – dsfsdfdsf23 Dec 17 '19 at 15:38
  • @Seth thank you for the warning, I am aware of that and it is not a disk image anyway. The sed thread you mentions replaces every byte, I just want to replace 0-bytes if there are at least blocksize many in a row. I want to investigate a very large very sparse file (not sparse as in sparse file in the file system) and for this analysis I want to cut out irrelevant parts – dsfsdfdsf23 Dec 17 '19 at 15:38
  • *the file is waaaay to big* The chunk to load may be huge enough (and double-buffering may help), but from practice (for x86, I think x64 will differ slightly) - 4 Mb chunk is enough, a further increase in increases the performance quite a bit. – Akina Dec 17 '19 at 15:43
  • A "use case" is not a general statement of what you want the tool to do; it is a statement of why someone would want to perform the operation. Your explanation of why the `sed` solution is inadequate contains a valid use case; the "general" version of that would be something like "to reduce file size for analysis of sparse arrays implemented as files". – Jeff Zeitlin Dec 17 '19 at 16:11
  • Or rather, "to reduce memory utilization when analyzing...." – Jeff Zeitlin Dec 17 '19 at 16:12
  • @dsfsdfdsf23 so look for a sequence of block size null bytes and just replace them? – Seth Dec 18 '19 at 06:28
  • @JeffZeitlin: Use case is splitting files into individual/separate files, which have file boundaries set with null bytes ie. concatenated into one larger file (similar to grouping files in folders) – Zimba Mar 23 '21 at 08:04

1 Answers1

0

bbe is "a sed-like editor for binary files". In Debian it's in the bbe package.

It would be best if you could do s/^\0*$// to identify blocks full of null bytes and remove them. My tests indicate such regex-like expressions don't work in bbe. You can still use (almost) as many \0 as you need:

s/\0\0…\0\0//

where denotes the right number of \0 substrings. If you choose large block size, then it may be problematic to pass an accordingly long string via the command line. Fortunately bbe supports reading a script from a file. Proceed like this:

# The following function uses non-POSIX 'for' loop. Rewrite if necessary.
gen_script() {
   printf 's/'
   for ((i=0;i<"$1";i++)); do
      printf '\\0'
   done
   printf '//\n'
}

# This needs to be a plain decimal number:
blocksize=512

gen_script "$blocksize" > bbe-script
<binary_file_in bbe -b ":$blocksize" -f bbe-script >binary_file_out

Problems:

  1. The above implementation of gen_script is pretty slow, rather impractical for large blocksize.
  2. In my tests bbe misbehaved for blocksize greater than 16384 (i.e. blocks of 16 KiB). This makes the first problem irrelevant.
  3. In this role bbe itself seems not very fast either. I don't know how large your "very large file" is. If I were you I would try

    pv binary_file_in | bbe -b ":$blocksize" -f bbe-script >binary_file_out
    

    and after few seconds I would be able to tell if ETA is acceptable.

Kamil Maciorowski
  • 69,815
  • 22
  • 136
  • 202