20

I'm developing a bash script and came up with the following strange behaviour!

$ echo £ |cut -c 1
�

The sign £ is passed to the next command cut whose filter is picking one character only.

When I modify the filter in the cut command to pick 2 characters, then the £ is passed through!

$ echo £ |cut -c 1-2
£

Not a severe problem, I have a workaround solution in the script, but why does the filter in the cut command require 2 positions instead of 1 when picking a £ sign?

αғsнιη
  • 35,092
  • 41
  • 129
  • 192
Peter Schramm
  • 209
  • 2
  • 3
  • 3
    Potential duplicate of [this Unix.SE question](https://unix.stackexchange.com/questions/163721/can-not-use-cut-c-characters-with-utf-8). – marcelm Nov 03 '20 at 23:18

2 Answers2

43

The cut command in Ubuntu is not multi-byte character aware. Characters are the same as bytes for this version of the cut command.

The pound sign (£) is a UTF-8 character that consists of two bytes (c2 and a3):

$ echo £ | od -t x1
0000000 c2 a3 0a
0000003

Note: The 0a character is the "New Line" (ASCII "Line Feed" character).

When you cut the first character from the line, you are selecting only the c2 part of £, and this is not a valid UTF-8 character. As a result you get the strange question mark (the replacement character) on screen:

$ echo £ | cut -c 1 | od -t x1
0000000 c2 0a
0000002

Note: The above was tested with the latest version of cut in Ubuntu 20.10 (GNU coreutils version 8.32).

If you want to select multi-byte characters, you can use the grep (GNU grep version 3.4) command like this:

$ echo x£β | grep -o '^.'
x
$ echo x£β | grep -o '^..'
x£
$ echo x£β | grep -o '^...'
x£β

This answer was improved with the help of the comments.

FedKad
  • 9,212
  • 7
  • 40
  • 79
  • 3
    _"The `cut` command is not multi-byte character aware."_ - Interestingly, (GNU) cut has both options for selecting bytes (`-b`), and for selecting characters (`-c`). One would hope it would know how to deal with multi-byte characters then... – marcelm Nov 03 '20 at 23:16
  • You might want to change `echo` to `echo -n` in your first example, so that there's no extra `0a` – Grzegorz Oledzki Nov 04 '20 at 06:03
  • 2
    Initially I did that way @GrzegorzOledzki . However, since the second example with `cut` had it already, I removed the `-n` in the first example, for consistency. – FedKad Nov 04 '20 at 09:09
  • 8
    @marcelm Some `cut`s do actually make a distinction between `-b` and `-c`. My `cut (GNU coreutils) 8.32` does the right thing with `-c` in an UTF-8 locale, but it turns out that it's due to a downstream Fedora patch. Upstream coreutils still handle `-b` and `-c` as aliases of the same thing at the moment. – TooTea Nov 04 '20 at 09:42
  • 3
    Note, that strange question mark is known in Unicode as the replacement character. It’s officially supposed to be used when a character or byte cannot be translated to a Unicode code point in the currently selected encoding (and in some cases it may also be used to represent characters that the current font does not include glyphs for). – Austin Hemmelgarn Nov 04 '20 at 12:11
  • 2
    @marcelm, [`cut` is specified to have both `-b` and `-c`](https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/utilities/cut.html). The GNU implementation [just treats them as identical](https://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/cut.c;h=0f6ba602c207018721414459e0c2df18d15dd190;hb=8d13292a73ecf1f265f77731d3ace29866e3d616#l503)... – ilkkachu Nov 04 '20 at 13:58
19

In UTF-8 encoding, the hex value of £ is 0xC2 0xA3 (c2a3) which is 11000010 10100011 in binary.

So it's two bytes (like two character). cut -c considers each byte a character which produces .


$ echo -n £ | xxd
00000000: c2a3                                     ..

$ echo -n £ | wc --bytes
2
Ravexina
  • 54,268
  • 25
  • 157
  • 179
  • 1
    Characters starting from U+0080 (Latin-1 Supplement) usually show similar behaviour. You can find Unicode table on https://unicode-table.com/ – Kulfy Nov 03 '20 at 11:37
  • 3
    UTF-8 can have up to 4 bytes, which is not very intuitive. It's a gotcha, as it includes 7-bit ASCII but extends it. – mckenzm Nov 04 '20 at 08:12
  • Curiously, `echo -n £ | wc --char` returns `1` so wc knows a different definition of char than cut. – Criggie Nov 04 '20 at 19:38
  • 2
    To be clear GNU cut considers each byte to be a character with -c — other versions of cut will treat characters correctly. – Tim Nov 04 '20 at 22:09