34

Suppose I want to remove all files in a directory except for one named "notes.txt". I would do this with the pipeline, ls | grep -v "notes.txt" | xargs rm. Why do I need xargs if the output of the second pipe is the input that rm should use?

For the sake of comparison, the pipeline, echo "#include <knowledge.h>" | cat > foo.c inserts the echoed text into the file without the use of xargs. What is the difference between these two pipelines?

Sathyajith Bhat
  • 61,504
  • 38
  • 179
  • 264
seewalker
  • 713
  • 1
  • 6
  • 8
  • 6
    You should not use `ls | grep -v "notes.txt" | xargs rm` to remove everything except for `notes.txt`, or in general, [never parse `ls` output](http://mywiki.wooledge.org/ParsingLs). Your command would break if a single file contained a space, for example. The safer way would be `rm !(notes.txt)` in Bash (with `shopt -s extglob` set), or `rm ^notes.txt` in Zsh (with `EXTENDED_GLOB`) etc. – slhck May 27 '13 at 06:10
  • To avoid spaces you could do `find . -maxdepth 1 -mindepth 1 -print0 | xargs -0` instead of `ls | xargs` :-) – flob Jun 27 '14 at 09:41

3 Answers3

49

You are confusing two very different kinds of input: STDIN and arguments. Arguments are a list of strings provided to the command as it starts, usually by specifying them after the command name (e.g. echo these are some arguments or rm file1 file2). STDIN, on the other hand, is a stream of bytes (sometimes text, sometimes not) that the command can (optionally) read after it starts. Here are some examples (note that cat can take either arguments or STDIN, but it does different things with them):

echo file1 file2 | cat    # Prints "file1 file2", since that's the stream of
                          # bytes that echo passed to cat's STDIN
cat file1 file2    # Prints the CONTENTS of file1 and file2
echo file1 file2 | rm    # Prints an error message, since rm expects arguments
                         # and doesn't read from STDIN

xargs can be thought of as converting STDIN-style input to arguments:

echo file1 file2 | cat    # Prints "file1 file2"
echo file1 file2 | xargs cat    # Prints the CONTENTS of file1 and file2

echo actually does more-or-less the opposite: it converts its arguments to STDOUT (which can be piped to some other command's STDIN):

echo file1 file2 | echo    # Prints a blank line, since echo doesn't read from STDIN
echo file1 file2 | xargs echo    # Prints "file1 file2" -- the first echo turns
                                 # them from arguments into STDOUT, xargs turns
                                 # them back into arguments, and the second echo
                                 # turns them back into STDOUT
echo file1 file2 | xargs echo | xargs echo | xargs echo | xargs echo    # Similar,
                                 # except that it converts back and forth between
                                 # args and STDOUT several times before finally
                                 # printing "file1 file2" to STDOUT.
Gordon Davisson
  • 34,084
  • 5
  • 66
  • 70
  • nice explanation! Just like to point out that `printf` is the "true" inverse of xargs – CervEd Dec 05 '22 at 11:02
  • @CervEd `printf` isn't quite the inverse either; for one thing, you'd need to add a format string (like `xargs printf '%s '` or something). But also, `xargs` itself parses its input in ways that irreversibly lose information (treating all whitespace as equivalent, removing quotes and escapes, etc), so a true inverse of it is not possible. – Gordon Davisson Dec 05 '22 at 23:17
  • ofc you have to add the appropriate formatting string, whether that be space, tab, newline or some other control character. This is the crucial difference to `echo`, which always prints out arguments separated with a newline. `xargs` actually doesn't treat all white-space equally. Newlines are treated differently, for example `cat <(seq -s $'\t' 1 4) <(seq -s $'\t' 5 6) <(seq -s $'\t' 7 7) | xargs -I% echo %` is different from `seq 1 7 | xargs -I% echo %`. The differences between whitespace are often subtle and inconsequential, but not always. In the 1st example, echo is invoked 3 times, 2nd 7 – CervEd Dec 06 '22 at 10:01
  • @CervEd The `-I` mode does make it a little better, but it still messes with quotes and escapes (and some versions with spaces and tabs too). You can get even closer to a direct inverse with `xargs -0 printf '%s\0', though that'll append a null byte to the end of anything that doesn't already end with one. – Gordon Davisson Dec 07 '22 at 02:05
  • It complicates things but at the same time it's very powerful. Anyhow, my main point was to point out that `printf` is a "truer" inverse of `xargs` since `xargs` doesn't quite treat all whitespace the same. For example `(printf '1 2 3 4 \n'; printf '5 6 7\n') | xargs -L 1 echo` versus `(printf '1\n' ; printf '2 3 4 \n'; printf '5 6 7\n') | xargs -L 1 echo` – CervEd Dec 07 '22 at 10:00
10

cat takes input from STDIN and rm does not. For such commands you need xargs to iterate through STDIN line by line and execute the commands with command line parameters.

Alex P.
  • 2,713
  • 12
  • 15
2

Understand xargs with a minimal example

Before looking into why xargs is useful, let's first make sure that we understand what xargs does with some minimal examples.

When you do either of:

printf '1 2 3 4' | xargs rm
printf '1\n2\n3\n4' | xargs rm

xargs parses the input string coming from stdin, and separates arguments by whitespace, somewhat like Bash, though the details are a bit different. In particular, spaces and newlines are treated differently if you use xargs -L instead of -n: https://stackoverflow.com/questions/6527004/why-does-xargs-l-yield-the-right-format-while-xargs-n-doesnt/6527308#6527308

Because we are not using -L however, both of the above calls are equivalent, and xargs would parse out four arguments: 1, 2, 3 and 4.

Then, xargs takes the arguments it parsed out, and feeds them to the program we are calling with. In our case, it is the executable /usr/bin/rm.

By default, xargs does not specify how many arguments it is going to pass at a time, and unless we pass some flags, and it could be more than one. So the above xargs calls could be equivalent to either:

rm 1 2 3 4

or:

rm 1 2
rm 3 4

or:

rm 1
rm 2
rm 3
rm 4

and we generally don't know which one of the above happened because for rm, the end result would be the same: files 1, 2, 3, and 4 would be removed, so we don't care much about which one xargs is doing anyways, so we just let it do its thing.

It could make a difference for other programs, e.g. /usr/bin/echo however, where a newline is added for every call.

Control how many arguments are passed at a time

We can control how many arguments are passed at once to xargs with certain flags.

The simplest one is -n, which limits the maximum number of arguments to be passed at a time.

Then, we can try to observe what is going on by using /usr/bin/echo instead of /usr/bin/rm, because echo, unlike rm treats echo 1 2 differently than echo 1; echo 2 as it adds a newline for each call.

With this in mind, if we run:

printf '1 2 3 4' | xargs -n2 echo

it supplies 2 arguments at a time to echo and is equivalent to:

echo 1 2
echo 3 4

which produces:

1 2
3 4

And if we instead run:

printf '1 2 3 4' | xargs -n1 echo

it supplies 1 argument at a time to echo and is equivalent to:

echo 1
echo 2
echo 3
echo 4

which produces:

1
2
3
4

Another way is to use -L instead of -n. -L is like -n but only splits by newlines, not spaces: https://stackoverflow.com/questions/6527004/why-does-xargs-l-yield-the-right-format-while-xargs-n-doesnt/6527308#6527308

And another common way to control the number of arguments is -I which implies -L1, e.g.:

printf '1\n2\n3\n4\n' | xargs -I% echo a % b

is equivalent to:

echo a 1 b
echo a 2 b
echo a 3 b
echo a 4 b

and so produces:

a 1 b
a 2 b
a 3 b
a 4 b

Alternative approaches and why xargs is superior

Now that we understand what xargs does, let's consider the alternatives and why xargs is better.

Suppose we have a file:

notes.txt

1
2
3
4

Instead of:

xargs < notes.txt | rm

we might want to use:

rm $(cat notes.txt)

which expands to:

rm 1 2 3 4

However, this is problematic because there is a maximum size for the command line arguments of a Linux program so it could fail if there were too many arguments in notes.txt.

xargs knows about this, and automatically splits arguments intelligently to avoid having too many at a time.

And there is no maximum size to streams like stdin, so things can work to arbitrary sizes like this. The reason why it works is that streams can be read little by little with the read() system call while CLI arguments must be loaded all at once into virtual memory, so there is no need for a hard maximum on stream sizes.

Another simple approach you could try would be:

while IFS="" read -r p || [ -n "$p" ]
do
  rm "$p"
done < notes.txt

from: https://stackoverflow.com/questions/1521462/looping-through-the-content-of-a-file-in-bash but this requires a lot of typing, and could be slower because:

  • it calls the /usr/bin/rm executable once for every argument, rather than fewer times with a bunch of arguments
  • more time is spent on the bash while loop, as opposed to the C-coded xargs code

To make xargs even more interesting, the GNU version that a -P option for parallel operation!

Related: https://unix.stackexchange.com/questions/24954/when-is-xargs-needed

  • "This is exactly the same as if you had a file", not exactly. One is space delimited and one is newline delimited. There are subtle differences – CervEd Dec 05 '22 at 11:03
  • @CervEd are you sure they are different?I think `xargs` is just setting `argv[1]`, `argv[2]` of the called program directly without the delimiter spaces, I don't think the called programs can see any difference. Let me know if you can produce a minimal example that demonstrates otherwise. – Ciro Santilli OurBigBook.com Dec 11 '22 at 12:33
  • in posix xargs, there are subtle differences between newline and other whitespaces. See my comments on this answer for details https://superuser.com/questions/600253/why-is-xargs-necessary/1622535?noredirect=1#comment2728281_600273 – CervEd Dec 11 '22 at 12:45