41

I got ZIP file(s), which contains files, which filenames are in some encoding. Let's say I know encoding of those filenames, but I still dont know how to properly decompress them.

Here is example file, it contains one file "【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12.ass"

I know used encoding is GB18030 (Chinese)

Question is - how to unpack that file in FreeBSD using unzip or other CLI utility to get proper encoded filename? I tried everything what I could, but result was never good. Please help.

I tried on OSX:

MBP1:test 2ge$ bsdtar xf gb18030.zip
MBP1:test 2ge$ ls
%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12/      gb18030.zip
MBP1:test 2ge$ cd %A1%BESSK%D7%D6Ļ%D7顿The\ Vampire\ Diaries\ %CE%FCѪ%B9%ED%C8ռ%C7S06E12/
MBP1:%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12 2ge$ ls
%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12.ass*
MBP1:%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12 2ge$ find . | iconv -f gb18030 -t utf-8
.
./%A1%BESSK%D7%D6L抬%D7椤縏he Vampire Diaries %CE%FC血%B9%ED%C8占%C7S06E12.ass 
MBP1:%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12 2ge$ convmv -r -f gb18030 -t utf-8 --notest .
Skipping, already UTF-8: ./%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12.ass
Ready!

I tried similar with unzip, but I get similar problem.

Thanks, now trying on FREE BSD, where I am connecting using SSH from OSX (Terminal):

# locale
LANG=
LC_CTYPE="C"
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=C

The first thing, I would like to is to proper show Chinese names. I changed

setenv LC_ALL zh_CN.GB18030
setenv LANG zh_CN.GB18030

Then I downloaded file and try to "ls" to see proper characters, but not luck. So I think I have to solve first Chinese locale to verify when I get proper result, actually I can compare it. Can you also help me please with this?

Giacomo1968
  • 53,069
  • 19
  • 162
  • 212
2ge
  • 511
  • 1
  • 5
  • 4

14 Answers14

35

Here's what I do on Ubuntu 16.04 to unzip a zip in any encoding, as long as I know what that encoding is. The same method should work on FreeBSD because it only relies on widely available unzip tool.

  1. I double-check the exact name of the encoding, as to not misspell it: https://www.iana.org/assignments/character-sets/character-sets.xhtml

  2. I simply run

    $ unzip -O <encoding> <filename> -d <target_dir>
    

    or

    $ unzip -I <encoding> <filename> -d <target_dir>
    

    choosing between -O or -I according to instructions here:

    $ unzip -h
    UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
      ...
      -O CHARSET  specify a character encoding for DOS, Windows and OS/2 archives
      -I CHARSET  specify a character encoding for UNIX and other archives
      ...
    

    which means that I simply try -O and it should work, because not a lot of people would create a .zip file in Unix...


So, for your specific example:

  1. The exact encoding name is GB18030.

  2. I use the -O flag and:

    $ unzip -O GB18030 gb18030.zip -d target_dir
    Archive:  gb18030.zip
       creating: target_dir/【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12/
      inflating: target_dir/【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12/【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12.ass
    

    ... it works.

mbdevpl
  • 451
  • 4
  • 5
  • 3
    For zips created by Greek Windows I had success with this method and encoding CP737 – ndemou Sep 21 '17 at 09:01
  • Bravo! I double checked the man page, it actually *works* but totally undocumented, none the zsh completion have this parameter. – ttimasdf Mar 29 '18 at 06:46
  • 8
    `unzip` does not have this option in Mac OS X and always creates percent-encoded filenames. @javacom's `unar` suggestion worked as a charm. – Phil Krylov Apr 10 '18 at 18:48
  • 1
    Looks like a Debian-specific functionality. My `unzip` tells it's `UnZip 6.00 of 20 April 2009, by Info-ZIP. Maintained by C. Spieler` and doesn't provide such options. – L29Ah Apr 11 '19 at 19:06
  • 3
    @L29Ah My `unzip` in Debian 9 is exactly the same version and has no such options. Probably Ubuntu specific? – Arnie97 Apr 16 '19 at 14:20
  • @Arnie97 and L29Ah: The unzip on CentOS 7.6.1810 (not Debian family) reports itself as `UnZip 6.00 of 20 April 2009, by Info-ZIP. Maintained by C. Spieler.` and it has these options. – mbdevpl Apr 18 '19 at 01:31
  • why this is not accepted answer? – Wang Apr 25 '19 at 10:33
  • 1
    You can use `-O` option on any distributions. First, download the source by `apt source unzip` on Ubuntu (live environment is enough). Second, copy the `unzip-6.0` directory to your system. Third, `cd` into the directory. Finally, execute `sudo make --file=unix/Makefile generic && sudo make --file=unix/Makefile install` to compile and install. The default `prefix` is `/usr/local` (not just `/usr`). For the detailed explanation, read `README` and `INSTALL`. This procedure is confirmed on Arch Linux, whose original `unzip` doesn't supply `-O` option. – ynn Sep 19 '19 at 16:33
  • @ynn Or you can pick only `unzip-6.0/debian/patches/20-unzip60-alt-iconv-utf8.patch` and apply it to an official source by [Info-ZIP](http://infozip.sourceforge.net/UnZip.html) and then compile and install. This procedure is also confirmed on Arch Linux. (On Arch, you can `asp checkout unzip` and then `makepkg -o` and then apply the patch and then `makepkg -ei`.) – ynn Sep 19 '19 at 18:04
  • `unzip -O utf-8 archive.zip` worked just fine for me. – Mitali Cyrus Aug 10 '20 at 15:51
30

Method 1 : use unar utility

sudo apt-get install unar

unar -e gb18030 gb18030.zip

Method 2 : Use a python script to unzip the file (reference https://gist.github.com/usunyu/dfc6e56af6e6caab8018bef4c3f3d452#file-gbk-unzip-py )

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# unzip-gbk.py

import os
import sys
import zipfile
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--encoding", help="encoding for filename, default gbk")
parser.add_argument("-l", help="list filenames in zipfile, do not unzip", action="store_true")
parser.add_argument("file", help="process file.zip")
args = parser.parse_args()
print "Processing File " + args.file

file=zipfile.ZipFile(args.file,"r");
if args.encoding:
    print "Encoding " + args.encoding
for name in file.namelist():
    if args.encoding:
        utf8name=name.decode(args.encoding)
    else:
        utf8name=name.decode('gbk')
    pathname = os.path.dirname(utf8name)
    if args.l:
        print "Filename " + utf8name
    else:
        print "Extracting " + utf8name
        if not os.path.exists(pathname) and pathname!= "":
            os.makedirs(pathname)
        data = file.read(name)
        if not os.path.exists(utf8name):
            fo = open(utf8name, "w")
            fo.write(data)
            fo.close
file.close()

The example gb18030.zip will extract the following file

【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12
【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12/【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12.ass
muru
  • 1,195
  • 9
  • 33
javacom
  • 301
  • 3
  • 3
13

On most POSIX filesystems the filename is just a series of bytes and it's up to userspace to make any sense of it. You can use this to your advantage.

  1. First, extract the archive using bsdtar, since the unzip tool seems to mangle the file names, while bsdtar will extract them raw. (I'm testing this on Linux. I guess FreeBSD just calls it tar.)

    $ bsdtar xf gb18030.zip
    
  2. Verify that tools like iconv can successfully decode the names:

    $ find . | iconv -f gb18030 -t utf-8
    

    (Note that this only affects the find output, not files themselves.)

  3. Finally use convmv to convert the file names to UTF-8:

    $ convmv -r -f gb18030 -t utf-8 --notest .
    

    (Note: I had to install Encode::HanExtra from CPAN for the GB18030 support, and manually add use Encode::HanExtra; to /usr/bin/convmv even though it's supposed to

  4. In case convmv is unavailable, script it:

    $ find . -depth | while read -r old; do
        old=./$old;
        head=${old%/*};
        tail=${old##*/};
        new=$head/$(echo "$tail" | iconv -f gb18030 -t utf-8);
        [ "$old" = "$new" ] || mv "$old" "$new";
    done
    

    (At least on Linux, this has an advantage in that iconv is almost always available, and it always supports gb18030.)

u1686_grawity
  • 426,297
  • 64
  • 894
  • 966
  • thanks grawity looking into this. I am testing right now on OSX (but thats really close to FreeBSD, and I think result will be similar). adding comment to my question, can not edit here... – 2ge Feb 02 '15 at 17:08
  • 1
    @2ge: Ah, OSX might actually be quite different, as HFS+ internally forces file names into NFD UTF-16 rather than storing bytestrings, so there's a possibility that it'll corrupt the GB18030 names before you get a chance to convert them. – u1686_grawity Feb 02 '15 at 17:25
  • I edited original question, add some more comments. – 2ge Feb 05 '15 at 17:21
  • Yeah, I tried it on macOS Sierra and bsdtar reported lots of "Failed to create xxx" errors (because the parent directory names are corrput). Had to copy my archive to a Linux VPS, use unzip -O to extract it, and copy the result back to my Mac using ssh -C. – Chang Qian Sep 25 '17 at 09:57
7

On OS X, you can use a GUI application called The Unarchiver. It can be installed using Mac App Store or Homebrew Cask:

brew cask install the-unarchiver

When you open a ZIP file with it, the application lets you choose the appropriate encoding using preview of a filename from the archive.

Melebius
  • 1,798
  • 2
  • 18
  • 27
4

7z supports charset ID with a switch -scs, e.g.:

7z x -scs903 some.zip

where 903 is 中文簡體 charset. A longer list of charset IDs can be found here.

L29Ah
  • 158
  • 8
ohho
  • 2,964
  • 10
  • 36
  • 48
  • 3
    `7z` `-scs` switch chooses only the encoding of the `@`-defined file list. – Phil Krylov Apr 10 '18 at 18:51
  • 1
    Acoording to [documen](https://sevenzip.osdn.jp/chm/cmdline/switches/charset.htm), it works on `add` and `update` only... I tried with extract but in vain. My current way in windows is using ubuntu in wsl2 to unzip a legacy encoding zip... – Louis Go Jun 30 '20 at 06:40
2

unar never turn me down:

brew install unar

unar -e GBK *.zip
igonejack
  • 131
  • 2
1

I just used 7zip and it managed to pick the right encoding – something that standard zip couldn't do.

However, I used it on Windows, with the GUI tool. Maybe the command line 7z will work for you, too.

Melebius
  • 1,798
  • 2
  • 18
  • 27
Berry Tsakala
  • 1,385
  • 3
  • 18
  • 33
  • 3
    Yes, there is ***now*** another answer recommending 7z.  You can hardly expect Berry’s answer to “add more” to an answer that was posted almost five months later. – Scott - Слава Україні Jun 01 '18 at 07:15
  • @Scott My apologies, I failed to read the English month abbreviations correctly. – Melebius Jun 04 '18 at 14:30
  • OK. You might want to know that, if you put your mouse pointer over any date on the page (and “hover” there), it will show you the date as numbers. (At least this works on computers; people say it doesn’t work well on phones.) Also, below the bottom right corner of the question, you will see “active  oldest  votes”. This is answer sort order. If you click on “oldest”, then you will get the answers in order from oldest to newest. – Scott - Слава Україні Jun 04 '18 at 15:59
1

Use 7z to extract the file

7z x yourfile.zip

After that, convert the encoding of those filenames yourself:

convmv --notest -f from_encoding -t utf-8 -r your_extracted_folder/

This works for me.. from_encoding in my case is tis-620 (which is a Thai encoding), you need to find an appropriate encoding of your language. A popular one usually solves the problem but if the file name is still unreadable then try changing from_encoding to other things such as windows-1252 or shift-jis (Japanese) or whatever, you can list the available encoding using command:

convmv --list
iconv --list

This is very simple "how to solve" method for me.

off99555
  • 131
  • 3
0

Shell sh oneline script with iconv:

for f in /path/*.txt; do mv $f `echo $f | iconv -f 866 -t UTF-8`; done

Script above is loop doing iterate through whilecard and move files from one codepage (866) to another (utf8).

Same and with reading while-card from pipe line:

echo * | for f in `read f&&echo $f`; do mv $f `echo $f | iconv -f 866 -t UTF-8`; done

There is no output except access rights denied if any. Also warning is possible when filename is the same in both codepage, because it appears as move file to same path.

oklas
  • 101
  • 3
  • Please edit your answer to include an explanation of what your script is doing, and maybe a possible output of what may be the outcome. – vssher Mar 12 '20 at 02:52
0

Wrote a patch for unzip fixing this issue: https://sourceforge.net/p/infozip/patches/29/

The same patch for p7zip: https://sourceforge.net/p/p7zip/bugs/187/

unxed
  • 41
  • 2
0

python3 script to unpack cp866 archive:

#!/usr/bin/python3
from zipfile import ZipFile
import os
import sys

def extract(filepath, directory = '', listonly = False):
  with ZipFile(filepath, 'r') as zip:
    for name in zip.namelist():
      data = zip.read(name)
      unicode_name = name.encode('cp437').decode('cp866')
      type = "DIR" if zip.getinfo(name).is_dir() else "FILE"

      print(type, unicode_name)
      if listonly:
        continue
      if zip.getinfo(name).is_dir():
        continue

      unicode_name = directory + '/' + unicode_name
      dirpath = os.path.dirname(unicode_name)
      if not os.path.exists(dirpath):
        os.makedirs(dirpath)
      f = open(unicode_name, 'wb')
      f.write(data)
  return 0

kwargs = {}
i = 1
while i < len(sys.argv):
  arg = sys.argv[i]
  if arg[0] != '-':
    kwargs['filepath'] = arg
  elif arg == '-l':
    kwargs['listonly'] = True
  elif arg == '-h':
    kwargs['usage'] = True
  elif arg == '-d':
    i += 1
    kwargs['directory'] = sys.argv[i]
  i += 1

argc = len(kwargs)
if argc > 3:
  print("Error: Max. 3 args expected,", argc, "are given.")
  exit(1)

print("Arguments given:", kwargs)

if "usage" in kwargs:
  print("""
Usage: %s [OPTIONS] FILEPATH")
Options:
  -l - list files only
  -d - output directory
""" % sys.argv[0])
  exit(1)

ret = extract(**kwargs)
exit(ret)

Example:

❯ ./unzip Budget_2020.zip -d dir
Arguments given: {'filepath': 'Budget_2020.zip', 'directory': 'dir'}
FILE Исполнение бюджета 2020 г/Исполнение бюджета 2020 года.pdf
DIR Исполнение бюджета 2020 г/Приложения к Заключению/
FILE Исполнение бюджета 2020 г/Приложения к Заключению/01_Прил_к Заключению Доходы.xls
FILE Исполнение бюджета 2020 г/Приложения к Заключению/02_Прил_к Заключению ГП.xlsx
FILE Исполнение бюджета 2020 г/Приложения к Заключению/03_Прил_к Заключению ГП ГРБС.xlsx
FILE Исполнение бюджета 2020 г/Приложения к Заключению/04_Прил_к Заключению ГП ИНД.pdf
legale
  • 1
  • 1
0

With 7zip, You can specify the encoding to use with the -mcp switch.

To extract simplified Chinese zip files with GB18030 encoding (Code page 54936)

7z e -mcp=54946 zipname.zip
qris
  • 1
0

If the zip archive is created with non unicode codec, you can specify the character encoding as unzip -O <encoding> <filename> -d <target_dir>, see @Melebius's answer.

But if a zip file is created with a non unicode codec and also encrypted with a password including non ascii characters, the password you pass to unzip command also needs to be encoded as bytes in this codec. On Linux, the argument you pass to unzip will be read as utf-8, so if it has a password like 吸血鬼日记, this won't work: unzip -O GB18030 -P '吸血鬼日记' compressed.zip.

So you need a way to provide password encoded in GB18030 as bytes to unzip. There's no simple way to do this with unzip command , but this can be done with a Python script:

from zipfile import ZipFile


def extract_zip(archive_name, out_path, pwd, codec):
    # password also needs to be encoded with codec
    password = pwd.encode(codec) if pwd else None
    # metadata_encoding argument is available in Python3.11
    with ZipFile(archive_name, "r", metadata_encoding=codec) as myzip:
        myzip.extractall(out_path, pwd=password)


extract_zip("compressed.zip", "output_dir", "吸血鬼日记", "GB18030")
oeter
  • 238
  • 1
  • 2
  • 7
-1

Since unzip is mangling the encoding of non-ascii file, the simplest workaround, as mentioned in other answers, is to switch to 7z and specifically to 7za which worked as expected on mac:

7za x '*.zip'

Note the use of quotes — this prevents expansion by the shell (bash, zsh, etc) and delegates the expansion to 7za.

Also, depends on your use case, but with 7za there was no need to explicitly specify the encoding — unlike unzip, it managed to infer the correct encoding.

ccpizza
  • 7,456
  • 6
  • 54
  • 56
  • This won't work if the archive is created with some windows code page, there's no way for `7z` (`p7z` on linux/unix) to specify the code page as `GB18030`. – oeter Apr 01 '23 at 11:48