1

My apologies, I must have made some mistakes when performing the initial tests, as after putting everything into a single script the xxd output does indeed always match the stdouput.

The entire script is here: https://pastebin.pl/view/454913ec I'm updating my question and leave the original (but wrong) question below.

The output of the script that I'm getting is the following:

$ ./test.sh
# Case 1A: echo -n $TEST1
hello world
00000000: 6865 6c6c 6f20 776f 726c 64              hello world

# Case 1B: echo -n -e $TEST1
hello world
00000000: 6865 6c6c 6f20 776f 726c 64              hello world

# Case 1C: echo -n "$TEST1"
hello
world
00000000: 6865 6c6c 6f0a 776f 726c 64              hello.world

# Case 1D: echo -n -e "$TEST1"
hello
world
00000000: 6865 6c6c 6f0a 776f 726c 64              hello.world

# Case 1E: printf "%s" $TEST1
helloworld
00000000: 6865 6c6c 6f77 6f72 6c64                 helloworld

# Case 1F: $ printf "%s" "$TEST1"
hello
world
00000000: 6865 6c6c 6f0a 776f 726c 64              hello.world

# --------------------------------
# Case 2A: $ echo -n $TEST2
hello\nworld
00000000: 6865 6c6c 6f5c 6e77 6f72 6c64            hello\nworld

# Case 2B: echo -n -e $TEST2
hello
world
00000000: 6865 6c6c 6f0a 776f 726c 64              hello.world

# Case 2C: echo -n "$TEST2"
hello\nworld
00000000: 6865 6c6c 6f5c 6e77 6f72 6c64            hello\nworld

# Case 2D: echo -n -e "$TEST2"
hello
world
00000000: 6865 6c6c 6f0a 776f 726c 64              hello.world

# Case 2E: printf "%s" $TEST2
hello\nworld
00000000: 6865 6c6c 6f5c 6e77 6f72 6c64            hello\nworld

# Case 2F: printf "%s" "$TEST2"
hello\nworld
00000000: 6865 6c6c 6f5c 6e77 6f72 6c64            hello\nworld

So the xxd output at least is the same for the same stdout output. Again, apologies for that!

So the remaining questions for me are:

  1. Why does Case 1E result in the output helloworld

  2. Which byte sequences are REALLY contained in TEST1 and TEST2 and which is the propper, portable way to figure that out?

  3. How can I make printf interpret the type of newline encoded in TEST2?

  4. is the following assignment portable (in the sense that it will always result in the same binary content in the variables?

$ TEST1="hello

world" $ TEST2="hello\nworld"

in another question I read that the locale only applies at expansion time, so that should mean it should, right?


Original (but wrong) question:

I performed the following tests using git bash:

$ TEST1="hello
> world"
$ TEST2="hello\nworld"


# Case 1A:
$ echo -n $TEST1
hello world
$ echo -n $TEST1 | xxd
00000000: 6865 6c6c 6f20 776f 726c 64              hello world

# Case 1B:
$ echo -n -e $TEST1
hello world
$ echo -n -e $TEST1 | xxd
00000000: 6865 6c6c 6f20 776f 726c 64              hello world

# Case 1C:
$ echo -n "$TEST1"
hello
world
$ echo -n "$TEST1" | xxd
00000000: 6865 6c6c 6f0a 776f 726c 64              hello.world

# Case 1D:
$ echo -n -e "$TEST1"
hello
world
$ echo -n -e "$TEST1" | xxd
00000000: 6865 6c6c 6f0a 776f 726c 64              hello.world

# Case 1E:
$ printf "%s" $TEST1
helloworld
$ printf "%s" $TEST1 | xxd
00000000: 6865 6c6c 6f77 6f72 6c64                 helloworld

# Case 1F:
$ printf "%s" "$TEST1"
hello
world
$ printf "%s" "$TEST1" | xxd
00000000: 6865 6c6c 6f0a 776f 726c 64              hello.world
$

# --------------------------------

# Case 2A:
$ echo -n $TEST2
hello\nworld
$ echo -n $TEST2 | xxd
00000000: 6865 6c6c 6f20 776f 726c 64              hello world

# Case 2B:
$ echo -n -e $TEST2
hello
world
$ echo -n -e $TEST2 | xxd
00000000: 6865 6c6c 6f20 776f 726c 64              hello world

# Case 2C:
$ echo -n "$TEST2"
hello\nworld
$ echo -n "$TEST2" | xxd
00000000: 6865 6c6c 6f0a 776f 726c 64              hello.world

# Case 2D:
$ echo -n -e "$TEST2"
hello
world
$ echo -n -e "$TEST2" | xxd
00000000: 6865 6c6c 6f0a 776f 726c 64              hello.world

# Case 2E:
$ printf "%s" $TEST2
hello\nworld
$ printf "%s" $TEST2 | xxd
00000000: 6865 6c6c 6f77 6f72 6c64                 helloworld

# Case 2F:
$ printf "%s" "$TEST2"
hello\nworld
$ printf "%s" "$TEST2" | xxd
00000000: 6865 6c6c 6f0a 776f 726c 64              hello.world
$

first: I find this frustrating. Also I wish I could add some custom colors to codeblocks on stackoverflow to visualize the problem better (like, color equal outputs in equal colors).

second: With that off my chest - can someone help me make sense of these outputs by explaining the base rules that affect these results?

So, some things that confuse me for example:

  1. Even though the printed stdout outputs are different for TEST1 and TEST2 (for example Case 1A results in different output than case 2A), it seems that the actual bytes that xxd receives as input are identical in all respective TEST1 and TEST2 cases (with all respective cases I mean Case 1x has always the same xxd output as Case 2x, even though the corresponding stdout outputs of the same command are not equal). How is that possible?

  2. Obviously the contents of TEST1 and TEST2 must differ somehow, otherwise it wouldn't be possible that echoing/printing them could result in different stdout outputs. So, how can I properly output the ACTUAL bits (as hex or whatever, doesn't matter so long as it's a clear representation of the actual variable content) contained in those variables?

  3. the TEST1 cases would indicate that xxd receives a 0A newline ascii character exatly when the printout also shows a linebreak. However, in the TEST2 cases Case 2B prints a linebreak but doesn't result in a 0A character and Case 2F does not print a linebreak while however resulting in a 0A character

I kiiinda get that it seems the linebreak is differently encoded in the TEST1 and TEST2 variables, and that when echoing double-quoting seems to expand (is that the right terminology?) the kind of linebreak contained in TEST1, while the -e flag to echo seems to interpret the kind of linebreak encoded in TEST2, but that doesn't explain the xxd outputs as well as the printf cases.

  1. why does Case 1E result in

     $ printf "%s" $TEST1
     helloworld
    
  2. How can printf be made to apply the kind of linebreaks encoded in the TEST2 variable?

  3. What should be the most important lesson learned here?

notes: I refrained from adding

$ TEST3="hello\n
world"

to keep the question short.

I also tested using single quotes ' ' instead of double quotes " " when defining the variables, which does not seem to affect the results.

matthias_buehlmann
  • 335
  • 1
  • 6
  • 19
  • 1
    I can't reproduce the problem in bash on a Linux machine. – choroba Mar 01 '21 at 15:06
  • hmm - maybe that's a bug in git bash then? could you maybe post your exact outputs in an answer? – matthias_buehlmann Mar 01 '21 at 15:07
  • 1
    I've fixed your code formatting. You need double indentation for code in lists. Single indentation attaches the paragraph to preceding list item. – gronostaj Mar 01 '21 at 15:10
  • @gronostaj ah that makes sense, thank you – matthias_buehlmann Mar 01 '21 at 15:11
  • I don't know the answer, so I'm hesitating to post an answer. – choroba Mar 01 '21 at 15:35
  • Does `set | grep TEST` show anything on Git Bash? On WSL Bash it shows the values as `TEST1=$'hello\nworld'` and `TEST2='hello\nworld'` – DuncG Mar 01 '21 at 16:32
  • @DuncG yes, I get the same output as you on Git Bash. What does that mean for the binary contents of the variables? – matthias_buehlmann Mar 01 '21 at 16:36
  • It looks like your TEST1 has encoded one character for newline whereas TEST2 is encoded sequence of backslash followed by `n`. You should get a clue if you print the length of each with `echo ${#TEST1} ${#TEST2}` which should print `11 12` showing the difference. – DuncG Mar 01 '21 at 17:09
  • @DuncG you are correct about the printed lengths! So is it always going to be an 0A for an actual line break (as in TEST1), or is this platform dependent? – matthias_buehlmann Mar 01 '21 at 17:21
  • 1
    I would expect it would be `0A` character on MacOSX and GNU/Linux (including on Windows via WSL or Cygwin), but I'm not sure about old Macs (might be `0D`) or other native Windows builds. – DuncG Mar 01 '21 at 17:48

2 Answers2

1

Even though the printed stdout outputs are different for TEST1 and TEST2 (for example Case 1A results in different output than case 2A), it seems that the actual bytes that xxd receives as input are identical in all respective TEST1 and TEST2 cases (with all respective cases I mean Case 1x has always the same xxd output as Case 2x, even though the corresponding stdout outputs of the same command are not equal). How is that possible?

They're not identical. I cannot reproduce your results using Bash on Linux nor using Git's MSYS Bash on Windows.

when echoing double-quoting seems to expand (is that the right terminology?) the kind of linebreak contained in TEST1

If you quote a variable expansion, its value will remain as-is. If you don't quote a variable expansion, its value will be split into multiple parameters at whitespace. This is done by the shell itself, and happens regardless of which command you're using.

(Exception: Expansions done as part of string variable assignments aren't split. For example, foo=$TEST1 will preserve the original value.

However, Expansions done as part of array assignments are split. For example, foo=($TEST1) will result in a two-element array containing hello and world.)

Later on, when the echo command receives multiple arguments, it always joins them using a single space.

Obviously the contents of TEST1 and TEST2 must differ somehow, otherwise it wouldn't be possible that echoing/printing them could result in different stdout outputs. So, how can I properly output the ACTUAL bits (as hex or whatever, doesn't matter so long as it's a clear representation of the actual variable content) contained in those variables?

Use typeset -p TEST1 or declare -p TEST2. (I think Ksh/Zsh prefer typeset, Bash prefers declare, both do the same thing.)

Using printf %s "$TEST1" works for strings, though the above two also handle arrays. You can also use the %q expansion which will backslash-escape any special characters in the printed value (using $''-style quoting, which can then be used in a shell script again).

> printf %q "$TEST1"
$'hello\nworld'

> printf %q "$TEST2"
hello\\nworld

why does Case 1E result in helloworld

As mentioned before, an unquoted variable expansion causes its value to be split at whitespace and provided as multiple parameters. So the command in Case 1E is equivalen to:

printf "%s" "hello" "world"

and while it might seem nonsensical in most other languages carrying printf(), the printf command in Bash will repeat the pattern until it completely runs out arguments, meaning that the above is actually equivalent to:

printf %s "hello"
printf %s "world"

How can printf be made to apply the kind of linebreaks encoded in the TEST2 variable?

The %b expansion works like %s but additionally expands the backslash-escapes found in the argument.

$ printf %b 'Hello\t,\nworld\t!'
Hello   ,
world   !

What should be the most important lesson learned here?

Don't write shell scripts.

Quote variables in shell scripts, unless you know exactly when not to.

u1686_grawity
  • 426,297
  • 64
  • 894
  • 966
  • "They're not identical. I cannot reproduce your results using Bash on Linux nor using Git's MSYS Bash on Windows." Can you tell me in which cases your xxd output differ between TEST1 and TEST2? – matthias_buehlmann Mar 01 '21 at 15:55
  • I must apologize, I must indeed have made an error, after putting everything into one script (https://pastebin.pl/view/454913ec) I do get different outputs for the xxd lines too. That resolves some, but not all of my questions. I will update my question accordingly, thank you! – matthias_buehlmann Mar 01 '21 at 16:20
  • also, in regards to the `%b` argument to printf, is there any difference in using `printf %b` `printf "%b"` or `printf '%b'`? – matthias_buehlmann Mar 01 '21 at 16:47
  • 1
    The shell interprets all three as the exact same string, and the printf command is passed the two characters `%b` in all three cases. Unquoting is done by the shell, not by the command. – u1686_grawity Mar 01 '21 at 16:50
0

I'm not sure that this accounts for all the differences, but I believe that the difference is that TEST1 contains a carriage-return (\r) and not a newline (\n).

In addition, this carriage-return is part of the string as binary character and needs no interpretation to be issued.

You may see the differences by the following code:

$ echo $TEST1 | od -w32 -t x1c
0000000  68  65  6c  6c  6f  20  3e  20  77  6f  72  6c  64  0a
          h   e   l   l   o       >       w   o   r   l   d  \n

$ echo $TEST2 | od -w32 -t x1c
0000000  68  65  6c  6c  6f  5c  6e  77  6f  72  6c  64  0a
          h   e   l   l   o   \   n   w   o   r   l   d  \n

One should also remember that \r and \n are interpreted by the terminal, not by Bash. This means that mixing up their handling by Bash and the terminal can come up with various results according to the order in which the operations were done.

harrymc
  • 455,459
  • 31
  • 526
  • 924
  • I get a different result than you. For TEST2 i get the same result as you, but for TEST1 i get `68 65 6c 6c 6f 20 77 6f 72 6c 64 0a` – matthias_buehlmann Mar 01 '21 at 16:04
  • It's possible, as I'm using Ubuntu on Windows by WSL. Different Linux distributions may give different results. However, this still shows that the contents are different between the two. – harrymc Mar 01 '21 at 16:07
  • How to know whether that difference is in the od command or in the actual content of the variables? – matthias_buehlmann Mar 01 '21 at 16:34
  • I don't trust any utility with binary characters, especially ones that were designed for text. – harrymc Mar 01 '21 at 16:40