How can I get the uncompressed size of gzip file without actually decompressing it?

Question

Please find my OS details:

$ uname -a
AIX xxyy 1 6 000145364C00

I've tried the following command to get size of a file in gzip archive:

$ gzip -l mycontent.DAT.Gz
compressed  uncompr.   ratio   uncompressed_name
-1223644243 1751372002 -75.3%  mycontent.DAT.Gz

Not sure how to interpret the unzipped size from this. Compressed file size close to 4 GB.

So, I tried this option in order to capture correct data:

$ zcat mycontent.DAT.Gz | wc -c

It gives me this error:

mycontent.DAT.Gz.Z:A file or directory in the path name does not exist.
0

Can you please tell me how to capture this value from shell script without decompressing the source file?

Are you sure about the integrity of the archive? It reports its own compressed size as ~1.7G. If it is really ~4GB I would guess there is a problem. — terdon, Jul 14 '13 at 14:41

score 49 · Answer 1 · answered Oct 18 '14 at 18:20

To answer the question title:

How can I get the uncompressed size of gzip file without actually decompressing it?

As you obviously know, the option -l (--list) is usually showing the uncompressed size.
What it shows is not calculated from the data, but was stored in the header as part of the compressed file.

In your case, the -l option does not work for some reason.
But it's not possible to 'measure' the uncompressed size from the raw compressed data - there is just no information about anything else in the compressed data - which is not surprising, as the point of compression is to leave out anything not needed.

You do not need to store the uncompressed data on the disk: zcat file.gz | wc -c is the right approach - but as @OleTange answered, your zcat seems to be not the one from gzip.
The alternative is using the gzip options -d (--decompress) and -c (--to-stdout), combined with wc option -c (--bytes):

gzip -dc file.gz | wc -c

The `-l` option has a bug for files bigger than 4GB: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=149775 — Flimm, Jun 29 '15 at 11:29

Ole Tange · Answer 2 · 2020-03-28T16:58:12.767

10

Your zcat is not GNU zcat but from compress. Try:

gzcat mycontent.DAT.Gz | LC_ALL=C wc -c
gzip -dc mycontent.DAT.Gz | LC_ALL=C wc -c

edited Mar 28 '20 at 16:58

answered Jul 14 '13 at 14:58

Ole Tange

4,529
2
34
51

2

This decompresses the source file. Maybe it's what the OP wants, but this is not the answer to the question. – Marco Jul 14 '13 at 16:54
1

Ah, that explains why it was looking for a file ending in .Z – Hennes Sep 29 '15 at 16:38

score 9 · Answer 3 · edited Mar 28 '20 at 06:20

9

I like using pv as it shows a more human readable information and progress:

zcat file.gz | pv > /dev/null

Outputs:

7,65GiB 0:00:44 [ 174MiB/s] [

edited Mar 28 '20 at 06:20

phuclv

26,555
15
113
235

answered Mar 26 '20 at 22:54

Eduardo

323
4
9

1

this also decompresses the source file like [this answer](https://superuser.com/a/619659/241386) so it's not a solution to this problem – phuclv Mar 27 '20 at 02:02
1

It depends on how you interpret the question. It seems he doesn't want to generate a decompressed file as he complains only about the error he is getting from `wc`. There is no way of getting the size without reading the contents. @phuclv – Eduardo Mar 30 '20 at 19:46
No, decompressing the content is never necessary to get file metadata. The header **always** contains the original size of each file and their hashes to compare after decrypting – phuclv Mar 31 '20 at 01:21
4

@phuclv no. Headers give incorrect information many times. I came here because of this. My 1.7GB file was showing 4GB decompressed when actually it was almost 8GB. Check this issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=149775#10 – Eduardo Mar 31 '20 at 17:04

James Luke · Answer 4 · 2021-08-19T20:21:51.910

4

Unfortunately, the only way to know is to extract it and count the bytes. gzip files do not properly report uncompressed data >4GB in size. See RFC1952, which defines the gzip file format:

ISIZE (Input SIZE)
    This contains the size of the original (uncompressed) input
    data modulo 2^32.

This discrepancy might be a little more obvious if whatever version of gzip you are using didn't have a bug: it is treating the ISIZE value as a signed 32-bit integer (resulting in -1223644243), rather than an unsigned 32-bit integer (which would result in 3071323053).

The most you can determine based on the header alone is that the real size of the uncompressed data is

(n * 4,294,967,296) + 3,071,323,053

where n is some whole number.

edited Aug 19 '21 at 20:21

answered Aug 15 '21 at 22:04

James Luke

41
2

Shouldn't that be `(n * 4,294,967,296) + 3,071,323,053`? – Eric Aug 18 '21 at 12:31
@Eric Whoops, you're right! Edited to fix it. – James Luke Aug 19 '21 at 20:21

score 0 · Answer 5 · answered Mar 08 '18 at 22:37

I'm finding everything sites in the web, and don't resolve this problem the get size when file size is bigger of 4GB.

my solution is this:


[oracle@base tmp]$ timeout --signal=SIGINT 1s tar -tvf oracle.20180303.030001.dmp.tar.gz
    -rw-r--r-- oracle/oinstall 111828 2018-03-03 03:05 oracle.20180303.030001.log
    -rw-r----- oracle/oinstall 6666911744 2018-03-03 03:05 oracle.20180303.030001.dmp

for get total size from gz file:

[oracle@base tmp]$ echo $(timeout --signal=SIGINT 1s tar -tvf oracle.20180303.030001.dmp.tar.gz | awk '{print $3}') | grep -o '[[:digit:]]*' | awk '{ sum += $1 } END { print sum }'
    6667023572

This would be a better answer if you explained that it only works for tarballs and you cleaned it up (timeout is not necessary, and neither is grep). — kbolino, Nov 05 '18 at 21:42

score 0 · Answer 6 · edited Mar 28 '20 at 06:20

0

gzip -l did not work for me, just git -1 ... but this did:

unzip -l file.zip

edited Mar 28 '20 at 06:20

phuclv

26,555
15
113
235

answered Sep 29 '15 at 16:33

grosser

383
1
3
8

How can I get the uncompressed size of gzip file without actually decompressing it?

6 Answers6