1

For the past few months, I've been learning about the command line with the help of William E. Shotts' The Linux Command Line. The Linux Command Line remains a popular book for newbies who would like to learn more about the Linux command line.

In one of the chapters, it introduces the tr command. The book says that character sets can be constructed in one of three ways: an enumerated list such as ABCDEFGHIJKLMNOPQRSTUVWXYZ; a character range, such as A-Z; and POSIX character classes, such as [:upper:].

The part that I don't understand is when the book tells the reader to be wary about using character ranges for the character set because of the locale collation order, and suggests that the reader use POSIX character classes instead.

I've personally never encountered a problem using character ranges such as A-Z with

echo "lowercase letters" | tr a-z A-Z

so why should I refrain from using character ranges in favor of POSIX character classes?

In case you are wondering, my locale is en_US.UTF-8.

muru
  • 193,181
  • 53
  • 473
  • 722

1 Answers1

1

You're using UTF-8. Yay! ASCII, and by extension UTF-8 (because the UTF guys tried to make it a superset of ASCII), has the alphabets in alphabetical order with no gaps, so a-z contains all the normal lowercase characters and nothing else, and so on.

However, that need not be true on some other encoding. The classic example is EBCDIC:

The gaps between letters made simple code that worked in ASCII fail on EBCDIC. For example for (c='A';c<='Z';++c) would set c to the 26 letters in the ASCII alphabet, but 40 characters including a number of unassigned ones in EBCDIC. Fixing this required complicating the code with function calls which was greatly resisted by programmers.

I'd like to think nobody uses weird stuff like this anymore, but who knows?


GNU tr doesn't support Unicode, AFAIK, but for programs that do, [[:upper:]] would also match Unicode characters that are considered uppercase alphabets, for example, a full-width "A", or an A with an accent: À.

$ printf "%s\n" A a A À | grep '[[:upper:]]'
A
A
À
$ printf "%s\n" A a A À | grep '[A-Z]'   # I'm also using Unicode, so grep tries to be friendly
A
À
$ printf "%s\n" A a A À | LC_ALL=C grep '[A-Z]'
A 
muru
  • 193,181
  • 53
  • 473
  • 722
  • Question: Is the uppercase A with an accent that you used in your example unicode? Or is it just a regular capital A with an accent? – John_Patrick_Mason Nov 09 '17 at 02:23
  • @John_Patrick_Mason https://paste.ubuntu.com/25922185/ – muru Nov 09 '17 at 02:25
  • So for systems that use UTF-8, À is included in the character set [A-Z], but not full-width "A"? Sorry, I'm still trying to understand what encoding means. – John_Patrick_Mason Nov 09 '17 at 02:40
  • That's partly a matter of locale as well. In whichever locale I'm currently on (probably en_GB), grep treats the collation rules as such that À compares equal to A. May get different results on different locales. – muru Nov 09 '17 at 02:52
  • @John_Patrick_Mason encoding is an incredibly complex topic. I wouldn't expect you to understand it from half-dozen answers and comments. Learn as you go. – muru Nov 09 '17 at 02:53
  • So the real reason why grep print À under my system is not because À is physically present in the character set [A-Z] (which would make sense since the English language does not have accents) but rather because grep treats À as A? Is that correct? – John_Patrick_Mason Nov 09 '17 at 03:15
  • @John_Patrick_Mason yes, for your particular locale (and as it happens, mine too). May not be the case with some other locale. – muru Nov 09 '17 at 03:16
  • And the reason full-width capital A does not appear in the output of `grep [A-Z]`, even though it is part of unicode just like À is because 1) A is not treated as a regular A and 2) because the character set [A-Z] only includes "regular" capital letters. Is that right? – John_Patrick_Mason Nov 09 '17 at 04:01
  • @John_Patrick_Mason dunno. As I said, the collation rules depend on the locale. I don't know why the rules are the way they are for a given locale. I'd advise you to avoid thinking of some characters as "regular" and some as not. I doubt the standards or the people implementing these things do so. – muru Nov 09 '17 at 04:05
  • OK, I guess I got an answer to my question. If the encoding where different than UTF-8, some programs would behave unexpectedly. I'm going to mark this as solved. – John_Patrick_Mason Nov 09 '17 at 04:22