Why does my "grep" stop filtering a non-ASCII file it thinks is "binary"?

Question

I'm working with a Windows-10 computer, using a WSL.

I'm investigating a logfile, produced by NLog in a C# application. I'm expecting log entries to appear everywhere throughout the file, but I see the following:

Linux prompt> grep "geen mengcontainer" logfile.log
2023-03-07 07:25:08.7971 | Warn | ... | geen mengcontainer.
2023-03-07 07:25:09.8285 | Warn | ... | geen mengcontainer.
2023-03-07 07:25:10.8754 | Warn | ... | geen mengcontainer.
Binary file logfile.log matches

As you see, after 07:25:10, the grep stops, even though the file goes further for the rest of the day. There seems to be some character, telling grep that the file is not a textfile, but a binary file, causing grep to stop working.

Some more information about the file:

Linux prompt>file logfile.log
logfile.log: ASCII text, with CRLF line terminators

Some more information about my Linux WSL installation:

Linux prompt>uname -a
Linux ComputerName 4.4.0-19041-Microsoft
  #2311-Microsoft Tue Nov 08 17:09:00 PST 2022 
  x86_64 x86_64 x86_64 GNU/Linux

Linux prompt> cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
...
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Some more information about my grep installation:

Linux prompt> grep --version
grep (GNU grep) 3.4

What can I do?

Does anybody know how to find and replace the character, which is responsible for grep to stop filtering?
Does anybody know which extra parameter or switch I can add to grep in order not to stop filtering?
Does anybody know about a grep version which does not behave like this? (Please take into account that apt update things don't work on my environment)

Thanks in advance

You've already gotten the answer (use `-a`), but consider that with a different approach you would easily find the answer yourself. You mention you're looking for a grep flag to change handling of binary files; the obvious place to look for that is `man grep`. But it's a long man page, and nobody wants to read it all. So, try searching it for "inary" with `/inary` (this catches both "Binary" and "binary"). This turns up the answer straight away. — amalloy, Mar 09 '23 at 07:54
@Greenonline: I disagree with the rejection of my edit: the answer starts with the idea to always use the `-a` switch, which is indeed a good way to handle my particular problem, hence my acceptance and upvote of the answer. To that, I have added an instruction on **how** to make sure that the `-a` switch is **always** used. Apparently some people don't agree with that, but you can leave it up to the answer author to decide about rejecting or approving the edit. — Dominique, Mar 09 '23 at 11:55
@amalloy: if you configure your pager (typically `less`) to make `/` searches case insensitive, you can just search for `binary`. The less option is `-i`, which you can put in a `LESS=-i` environment variable, or your `~/.lesskey`. I put `LESS = iMRj5X` in that startup file to use a more verbose prompt, and make searching put the target line at line 5 instead of the top so I can see context, etc. I also bind `,` and `.` to prev/next file (less used to compile that init file to a binary to minimize startup time, but that's no longer a thing.) — Peter Cordes, Mar 10 '23 at 09:39

u1686_grawity · Accepted Answer · 2023-03-10T11:42:19.233

33

Use grep -a to force a file to always be treated as text.

The "binary file" detection is codepage-sensitive – if grep expects UTF-8 input as usual on Linux, it will actually end up detecting "ANSI" (Windows-125x, ISO 8859-x) encoded text files as binary files. Running grep under the "C" locale with LC_CTYPE=C grep or LC_ALL=C grep may also avoid this problem.

(Also, what 'file' says about the input being "ASCII" is based entirely on a quick look at the initial bytes within the file; it doesn't actually scan the entire thing, whereas 'grep' of course does.)

Usually the entire file is in the same encoding (i.e. all of it is likely to be non-UTF-8), so an easy way to find the problematic characters is to search for non-ASCII bytes (LC_ALL=C may be needed):

grep -a -P -n --color '[^\x00-\x7F]' logfile.log

perl -ne 'print "Line $.:\t$_" if /[^\0-\177]/' < logfile.log

This would also highlight the bytes in question:

perl -ne 'print "Line $.:\t$_" if s/[^\0-\177]/sprintf"\e[41m<%02X>\e[m",ord$&/ge' < logfile.log

If the file is valid UTF-8 except with some odd lines, use a similar approach to print lines that fail UTF-8 decoding:

perl -MEncode -ne 'print "Line $.:\t$_" if !eval{decode("UTF-8", $_, Encode::FB_CROAK)}' < logfile.log

edited Mar 10 '23 at 11:42

answered Mar 08 '23 at 08:46

u1686_grawity

426,297
64
894
966

2

Thanks a lot for your quick reply: using `grep -a` I can indeed make `grep` do the full filtering. Once my analysis is done, I'll have a look at the `Perl` commands you mentioned in order to find out what might be going wrong with my file. – Dominique Mar 08 '23 at 09:03
3

NUL bytes might also make it detect a binary file, even in the C locale, e.g. `printf 'xyz\0' |LC_ALL=C grep xyz` gives "Binary file (standard input) matches" at least with the grep I have. – ilkkachu Mar 08 '23 at 21:30
Sorry, but I don't find the character, causing the issue, therefore I've decided to alter my "grep" in order for it to treat files **always** as textfiles. I've edited your answer in order to explain how I did this. – Dominique Mar 09 '23 at 07:32
2

@Dominique That may bite you when you (accidentally) grep actual binary files. The resulting binary output may mess with your terminal and have unexpected side effects. That's a big part of the reason grep behaves like this by default. I suspect this may also be why people rejected your edit. – marcelm Mar 09 '23 at 17:07
1

It's more that instructions for customizing shell aliases are quite out of scope for the question or answer, I think. – u1686_grawity Mar 09 '23 at 17:41
2

It's a terrible idea to _alter my "grep" in order for it to treat files always as textfiles_. It means if you accidentally `grep a CHROME.EXE`, you'll be spammed with megabytes and megabytes of binary noise, a chunk for every byte of the Chrome binary that happens to have the value 0x61 (lowercase 'a'). If you have text files that contain binary data, they should be RARE (IOW, that default is expecting zebras), and you're better off figuring out _why_ binary data is creeping into the log (it's always a log), then fixing whatever's doing the logging so it doesn't happen anymore. – FeRD Mar 10 '23 at 04:32
@FeRD: thanks for the concern, but in my mind `grep` is only to be used for textfiles, never for binary files (I didn't even know that `grep` works on binary files). If ever I need to do a `grep` on a binary file, I always do `strings input_file | grep`, so replacing `grep` by `grep -a` in an alias is for my particular situation not such a bad idea :-) – Dominique Mar 10 '23 at 07:31
1

@Dominique: Everything is just a file. `grep` *doesn't* work *well* or usefully on binary files, that's why it stops itself by default. But in terms of binary, it looks for ranges separated by `0xa` newline bytes, and writes them to stdout if they match the regex, same as always. A plausible scenario for accidentally running on a binary file is `grep symbol foo*` where you intended to grep some `.c` and `.h` files, but there are also some `.o` binary object files of compiler output you forgot about. Or `grep foo -r directory/` to recursively grep, counting on grep to filter out binaries. – Peter Cordes Mar 10 '23 at 09:51
1

I adjusted the commands to include \0 as well, in case that's what is making grep consider the file "binary". – u1686_grawity Mar 10 '23 at 11:44

Why does my "grep" stop filtering a non-ASCII file it thinks is "binary"?

1 Answers1