49

Please find my OS details:

$ uname -a
AIX xxyy 1 6 000145364C00

I've tried the following command to get size of a file in gzip archive:

$ gzip -l mycontent.DAT.Gz
compressed  uncompr.   ratio   uncompressed_name
-1223644243 1751372002 -75.3%  mycontent.DAT.Gz

Not sure how to interpret the unzipped size from this. Compressed file size close to 4 GB.

So, I tried this option in order to capture correct data:

$ zcat mycontent.DAT.Gz | wc -c

It gives me this error:

mycontent.DAT.Gz.Z:A file or directory in the path name does not exist.
0

Can you please tell me how to capture this value from shell script without decompressing the source file?

Journeyman Geek
  • 127,463
  • 52
  • 260
  • 430
user238010
  • 491
  • 1
  • 4
  • 3
  • Are you sure about the integrity of the archive? It reports its own compressed size as ~1.7G. If it is really ~4GB I would guess there is a problem. – terdon Jul 14 '13 at 14:41

6 Answers6

49

To answer the question title:

How can I get the uncompressed size of gzip file without actually decompressing it?

As you obviously know, the option -l (--list) is usually showing the uncompressed size.
What it shows is not calculated from the data, but was stored in the header as part of the compressed file.

In your case, the -l option does not work for some reason.
But it's not possible to 'measure' the uncompressed size from the raw compressed data - there is just no information about anything else in the compressed data - which is not surprising, as the point of compression is to leave out anything not needed.

You do not need to store the uncompressed data on the disk: zcat file.gz | wc -c is the right approach - but as @OleTange answered, your zcat seems to be not the one from gzip.
The alternative is using the gzip options -d (--decompress) and -c (--to-stdout), combined with wc option -c (--bytes):

gzip -dc file.gz | wc -c
Volker Siegel
  • 1,504
  • 11
  • 21
  • 25
    The `-l` option has a bug for files bigger than 4GB: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=149775 – Flimm Jun 29 '15 at 11:29
10

Your zcat is not GNU zcat but from compress. Try:

gzcat mycontent.DAT.Gz | LC_ALL=C wc -c
gzip -dc mycontent.DAT.Gz | LC_ALL=C wc -c
Ole Tange
  • 4,529
  • 2
  • 34
  • 51
9

I like using pv as it shows a more human readable information and progress:

zcat file.gz | pv > /dev/null

Outputs:

7,65GiB 0:00:44 [ 174MiB/s] [
phuclv
  • 26,555
  • 15
  • 113
  • 235
Eduardo
  • 323
  • 4
  • 9
  • 1
    this also decompresses the source file like [this answer](https://superuser.com/a/619659/241386) so it's not a solution to this problem – phuclv Mar 27 '20 at 02:02
  • 1
    It depends on how you interpret the question. It seems he doesn't want to generate a decompressed file as he complains only about the error he is getting from `wc`. There is no way of getting the size without reading the contents. @phuclv – Eduardo Mar 30 '20 at 19:46
  • No, decompressing the content is never necessary to get file metadata. The header **always** contains the original size of each file and their hashes to compare after decrypting – phuclv Mar 31 '20 at 01:21
  • 4
    @phuclv no. Headers give incorrect information many times. I came here because of this. My 1.7GB file was showing 4GB decompressed when actually it was almost 8GB. Check this issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=149775#10 – Eduardo Mar 31 '20 at 17:04
4

Unfortunately, the only way to know is to extract it and count the bytes. gzip files do not properly report uncompressed data >4GB in size. See RFC1952, which defines the gzip file format:

ISIZE (Input SIZE)
    This contains the size of the original (uncompressed) input
    data modulo 2^32.

This discrepancy might be a little more obvious if whatever version of gzip you are using didn't have a bug: it is treating the ISIZE value as a signed 32-bit integer (resulting in -1223644243), rather than an unsigned 32-bit integer (which would result in 3071323053).

The most you can determine based on the header alone is that the real size of the uncompressed data is

(n * 4,294,967,296) + 3,071,323,053

where n is some whole number.

James Luke
  • 41
  • 2
0

I'm finding everything sites in the web, and don't resolve this problem the get size when file size is bigger of 4GB.

my solution is this:


[oracle@base tmp]$ timeout --signal=SIGINT 1s tar -tvf oracle.20180303.030001.dmp.tar.gz
    -rw-r--r-- oracle/oinstall 111828 2018-03-03 03:05 oracle.20180303.030001.log
    -rw-r----- oracle/oinstall 6666911744 2018-03-03 03:05 oracle.20180303.030001.dmp

for get total size from gz file:

[oracle@base tmp]$ echo $(timeout --signal=SIGINT 1s tar -tvf oracle.20180303.030001.dmp.tar.gz | awk '{print $3}') | grep -o '[[:digit:]]*' | awk '{ sum += $1 } END { print sum }'
    6667023572
  • 2
    This would be a better answer if you explained that it only works for tarballs and you cleaned it up (timeout is not necessary, and neither is grep). – kbolino Nov 05 '18 at 21:42
0

gzip -l did not work for me, just git -1 ... but this did:

unzip -l file.zip
phuclv
  • 26,555
  • 15
  • 113
  • 235
grosser
  • 383
  • 1
  • 3
  • 8