3

I work with some big image datasets containing millions of images, and I often need to compress the results of each step of processing to be uploaded as backup.

I have seen that some datasets can be downloaded as a set of .zip files, which can be unzipped independently into the same folder as one consistent dataset. This can be pretty convenient as it enables me to pipeline the download -> decompress -> delete archive process, which is more efficient in terms of both time and storage space, as explained below with arbitrary time/sizes:

  • When decompressing a single 100GB .zip, let's say downloading takes 5 minutes and decompressing takes 10 minutes. I need 15 minutes to get all my data. Assuming the .zip had a 50% compression ratio, I need to use 100+200 = 300GB disk space.
  • When decompressing two 50GB .zip, let's say downloading each takes 2.5 minutes and decompressing each takes 5 minutes. I can do: 2.5 minutes downloading zip1, 5 minutes decompressing zip1 and 2.5 minutes downloading zip2 simultaneously, delete zip1, then decompress zip2 in 5 minutes, for a total of 2.5+5+5 = 12.5 minutes. Meanwhile, I only need to have at maximum zip2, folder1 and folder2 on disk at the same time, so 50+100+100 = 250GB of disk space.

These time and space savings increase as we increase the number of separate zip files. I am therefore looking for a way to do this.

My requirements are as such:

  • The method can work on any folder structure, no matter how deep
  • Compression results in .zip files of roughly equal size
  • All resulting archives can be decompressed independently to reconstruct part of the folder (sometimes I may want to use only part of the dataset for tests, in which case I don't want to have to decompress the entire dataset)
  • Optional:
    • The method should be able to show a progress bar
    • The method is fast and efficient

I think I would be able to write a bash or python script that fits the first few requirements, but I doubt it would be fast enough.

I am aware of the -s switch in zip and the -v switch in 7z, but they both require the users to have all the parts of the archive to be able to decompress any part of it, which is much less desirable.

  • Two remarks: (1) Super User is not a script writing service. "These are my requirements, gimme script and lemme test that" is not a good question. What is *your* script (or at least a stub) so far? Where are you stuck? Helping you with [Python](https://superuser.com/tags/python/info) may be considered off-topic, shell scripts are mostly OK. (2) If you choose an archiver and a compressor ([see the difference](https://superuser.com/a/1559701/432690)) able to run in a pipe then it won't be 10 minutes + 5 minutes; it can be max(10 minutes, 5 minutes), so 10 minutes. – Kamil Maciorowski Nov 27 '20 at 07:16
  • Additionally: decompressing from a pipe does not require disk space for compressed data. OTOH resuming a broken download can be problematic (if possible at all). Still you may want to redesign your approach, especially if your network connection is reliable. Is the ZIP format a constraint or your choice? – Kamil Maciorowski Nov 27 '20 at 07:35
  • @KamilMaciorowski 1) Sorry if I gave off the impression of asking people to write me a script. Ideally I am just looking for a pointer to a tool I have overlooked or some one-liner that would do the trick. 2) I was not aware I could pipe a download to decompress directly. That sounds pretty ideal. Any pointers? Additionally) I chose zip format based on the possibility of extracting only part of the data, and because it is widely used so that my teammates do not have to install extra tools. – LemmeTestThat Nov 27 '20 at 10:35

1 Answers1

0

The ZIP file format is really just a container (basically a folder) which contains compressed files. This is in contrast with the .tar.gz format which is frequently used on Linux platforms. The advantage of ZIP is that the contents can be individually extracted exactly as you are hoping to do without extracting the entire archive.

Indeed most operating systems, including Windows, natively support opening a ZIP folder to review file names and metadata without extracting the entire archive. And it isn't difficult to extract just a subset of a large directory structure (in Windows you mealy copy-paste a selection of files)
7-Zip is able to do this as well but you have to press the "Copy" button and then specify the destination.

There are issues with nested .zip files, generally the parent .zip will have to be fully extracted in order to review the children.

As an aside note, the .tar.gz format I mentioned uses the same DEFLATE algorithm as ZIP, but it can sometimes compress better since the file names and metadata is also compressed. The cost to doing this is that usually the whole archive must be extracted to review any of it's contents.