How can I identify non-ASCII characters from the shell?

Question

Is there a simple way to print all non-ASCII characters and the line numbers on which they occur in a file using a command line utility such as grep, awk, perl, etc?

I want to change the encoding of a text file from UTF-8 to ASCII, but before doing so, wish to manually replace all instances of non-ASCII characters to avoid unexpected character changes effected by the file conversion routine.

RedGrittyBrick · Answer 1 · 2012-04-27T13:27:07.240

19

$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/'  utf8.txt
2 Pour être ou ne pas être
4 Byť či nebyť
5 是或不

or

$ grep -n -P '[\x80-\xFF]' utf8.txt
2:Pour être ou ne pas être
4:Byť či nebyť
5:是或不

where utf8.txt is

$ cat utf8.txt
To be or not to be.
Pour être ou ne pas être
Om of niet zijn
Byť či nebyť
是或不

edited Apr 27 '12 at 13:27

answered Apr 26 '12 at 22:07

RedGrittyBrick

81,981
20
135
205

1

Thanks. The perl snippet works directly, but the grep version doesn't work with GNU grep 2.16. I was able to make it work via: `LC_ALL=C grep -n -P [$'\x80'-$'\xFF']`, where the first bit turns off collation. – Joe Corneli Sep 18 '14 at 12:23

score 4 · Answer 2 · edited Jun 12 '20 at 13:48

4

I want to change the encoding of a text file from UTF-8 to ASCII ...

... replace all instances of non-ASCII characters ...

Then tell your conversion tool to do so.

$ iconv -c -f UTF-8 -t ASCII <<< 'Look at 私.'
Look at .

$ iconv -c -f UTF-8 -t ASCII//translit <<< 'áēìöų'
aeiou

edited Jun 12 '20 at 13:48

Community

1

answered Apr 26 '12 at 22:44

Ignacio Vazquez-Abrams

111,361
10
201
247

He said he wanted to do that replacement manually. Perhaps the most appropriate replacement is context-dependent. – mark4o Apr 27 '12 at 00:39

How can I identify non-ASCII characters from the shell?

2 Answers2