12

I have 3 types of file name encodings on reiserfs mounted hard drive: CP1251, KOI-8, UTF-8 and ASCII. I really need to convert all encodings to UTF-8, recursively. Is there any utility, which will detect source encoding and convert it to UTF-8 or I have to write Python script?

Pablo
  • 2,457
  • 5
  • 22
  • 28
  • In the general case, it is not possible to automatically "guess" the name encoding (for example, most of the byte sequences are valid KOI-8 and CP1251 (but diff.) filenames). Do you have any extra clue to help know the name encoding? –  Jan 04 '15 at 01:13
  • No other clue :( – Pablo Jan 04 '15 at 09:24
  • Do you have both lowercase and uppercase filenames? –  Jan 05 '15 at 00:16
  • Yes, I have both lowercase and (all) uppercase files names. – Pablo Jan 05 '15 at 00:20
  • Anyone in need? Check out `detox`. It worked for me between ISO-8859-1 and UTF-8 using `-s iso8859_1-only` – Alwin Kesler Feb 18 '19 at 13:59

3 Answers3

21

Use convmv, a CLI tool that converts the file name between different encodings. To convert from (-f) these encondings to (-t) UTF-8 do the following:

convmv -f CP1251 -t UTF-8 inputfile
convmv -f KOI-8  -t UTF-8 inputfile
convmv -f ASCII  -t UTF-8 inputfile

In addition, if you want to convert the file content, use iconv, a CLI tool to convert file content to different encodings. To convert from (-f) these encondings to (-t) UTF-8 do the following:

iconv -f CP1251 -t UTF-8 inputfile > outputfile
iconv -f KOI-8  -t UTF-8 inputfile > outputfile
iconv -f ASCII  -t UTF-8 inputfile > outputfile
Marcos Roriz Junior
  • 4,569
  • 4
  • 28
  • 43
1

Nope. One of the big downsides to the old code page system is that there is no way to detect which one is being used; you must simply know that a priori. If you do know which files are using which encoding then you can convert the names using something like:

mv somefile `echo somefile | iconv -f CP1251 -t UTF-8`
psusi
  • 37,033
  • 2
  • 68
  • 106
  • Too many files to rename manually... I thought the codepages have distinct character code ranges. – Pablo Jan 04 '15 at 09:22
  • @Pablo, no, that is the entire point: with an 8 bit byte you only had 256 possible character codes. After subtracting the normal set of ASCII characters and control codes that leaves 128 left for additional codes, which isn't enough to represent the full range of characters in all languages. Each code page makes its own use of those upper 128 codes to represent characters important to the user. The only way to figure out which is in use is to try displaying each possible code page and see if the name appears to make sense and that isn't something a computer and decide. – psusi Jan 04 '15 at 17:44
  • well, python `chardet` is somehow detecting it... – Pablo Jan 04 '15 at 18:25
  • @Pablo, neat... looks like it makes an educated guess based on the prevalence of different characters in written language. In other words, it assumes that certain characters, like goofy glyphs are less popular than say, an accented 'a', and tries interpreting the characters in each code page and finds the one that has the most codes that match the more popular characters. It likely isn't very accurate though, especially over a small number of characters, such as a file name. – psusi Jan 04 '15 at 18:44
0

Same solution with iconv as @psusi sugeses but with loop and while-card:

Also oneline shell sh script:

for f in /path/*.txt; do mv $f `echo $f | iconv -f 866 -t UTF-8`; done

With reading while-card from pipe line:

echo * | for f in `read f&&echo $f`; do mv $f `echo $f | iconv -f 866 -t UTF-8`; done
oklas
  • 181
  • 2
  • 4