5

Is there a tool that can scan a small text file and look for any character not in the simple ASCII character set?

A simple Java or Groovy script would also do.

Pops
  • 8,393
  • 29
  • 76
  • 95
Marcus Leon
  • 2,955
  • 8
  • 35
  • 38
  • It can be moved there, though would think this would be directly of interest to programmers in the process of certain programming tasks.. (such as where I am at right now) – Marcus Leon Aug 31 '11 at 00:47
  • It's not a programming question, and therefore is off-topic. You've been here long enough to know that, but if not please read the [FAQ](http://stackoverflow.com/faq) for info on what questions are on-topic here. :) – Ken White Aug 31 '11 at 00:49
  • You could of course use `grep` with a negated character class. – Tom Zych Aug 31 '11 at 00:59
  • Anything that isn’t going to go the route of `grep '[^\x00-\xFF]'` or its moral equivalent **using existing tools not writing a new program** is nothing but insane overkill. – tchrist Aug 31 '11 at 02:17
  • @tchrist, good point. Though I'm having an issue with that - http://stackoverflow.com/questions/7258299/grep-regex-doesnt-work-with-cygwin-on-windows – Marcus Leon Aug 31 '11 at 14:47
  • Use `grep -P '[^\x00-\xFF]'` or `perl -ne 'print if /[^\x00-\xFF]/'`. Note that grep’s `-P` option doesn’t actually accept real Perl regexes. – tchrist Aug 31 '11 at 18:45
  • @tchrist: Doesn't ASCII run from 00 to 7F? – Tom Zych Sep 02 '11 at 00:49
  • @Tom: Yup. I was just mimicking what the OP did, which I late realized didn't make sense. – tchrist Sep 02 '11 at 01:31

5 Answers5

2

Well, it's still here after an hour, so I may as well answer it. Here's a simple filter that prints only non-ASCII characters from its input, and gives exit code 0 if there weren't any and 1 if there were. Reads from standard input only.

#include <stdio.h>
#include <ctype.h>

int main(void)
{
    int c, flag = 0;

    while ((c = getchar()) != EOF)
        if (!isascii(c)) {
            putchar(c);
            flag = 1;
        }

    return flag;
}
Tom Zych
  • 1,031
  • 7
  • 18
1

Just run $JDK_HOME/bin/native2ascii on the text file and search for "\u" in the output file. I'm assuming you want to find it so you can escape it anyway and this will save you a step. ;)

jonathan.cone
  • 208
  • 1
  • 2
  • 5
0

I have no idea if this is legit, casting each char to an int and using a catch to identify things that fail. I'm also too lazy to write this in java so have some Groovy

def chars = ['Ã', 'a', 'Â', 'ç', 'x', 'o', 'Ð'];

chars.each{
    try{ def asciiInt = (int) it }
    catch(Exception e){ print it + " "}
}

==> Ã Â ç Ð

awfulHack
  • 101
  • 1
0

In Java (assuming the string is specified as the first command-line argument:

public class Main
{
    public static void main(String[] args)
    {
        String stringToSearch = args[0];
        int len = stringToSearch.length();
        for (int i = 0; i < len; i++)
        {
            char ch = stringToSearch.charAt(i);
            if (ch >= 128) // non-ascii
            {
                System.out.print(ch + " ");
            }
        }
        System.out.println();
    }
}

To make this your own, replace stringToSearch with whatever you need.

Nathan Moos
  • 101
  • 1
0

A simple groovy example:

def str = [ "this doesn't have any unicode", "this one does ±ÁΘ·€ÔÅ" ]

str.each {
    if( it ==~ /[\x00-\x7F]*/ ) {
        println "all ascii: $it"
    } else {
        println "NOT ASCII: $it"
    }
}

It's as simple as this bit here: it ==~ /[\x00-\x7F]*/

Edit: I forgot to include a version for files. Oops:

def text = new File(args[0]).text
if( text ==~ /[\x00-\x7F]*/ ) {
    println "${args[0]} is only ASCII"
    System.exit(0)
} else {
    println "${args[0]} contains non-ASCII characters"
    System.exit(-1)
}

That version can be used as a command line script, and includes an exit status so it can be chained.

OverZealous
  • 109
  • 3
  • It doesn’t make any sense to read the whole file into memory. Note that **EVERY SINGLE STRING EVER CREATED** matches something like `/[\x00-\xFF]*/`, just as every single string also matches `/a*/`, even `"xxx"`. Zero or more means you’re content with 0. And `/[\x80-\xFF]/` is not ASCII! You need to match `/^[\x00-\x7F]+$/` to be all ASCII. A normal regex engine with the very most basic Unicode support would simply use `\p{ASCII}` vs `\P{ASCII}`. – tchrist Aug 31 '11 at 18:49
  • @tchrist I appreciate the feedback. Of course, it would be more efficient to stream the file. However, the original question specifically asked about scanning a **small file**. Your comment about the regex is incorrect, simply due to the fact that I actually tested my code before I posted it. Sorry if my range is incorrect - that might be a valid comment, but your comment is unnecessarily aggressive and rude. I was simply providing a working Groovy-based example, since the question mentioned it. – OverZealous Sep 01 '11 at 03:36
  • Also, you have to match the empty string, or empty files will show up as non-ASCII. I think that is incorrect behavior. – OverZealous Sep 01 '11 at 03:38
  • Nop, ASCII is code points 0 through 127. Your pattern matches 0 through 255. It is therefore wrong. – tchrist Sep 01 '11 at 11:48
  • I shouldn't bother responding, but I need to point out two things: First, you could have simply pointed that out, and suggested a fix, and I would have updated my suggestion. That's how StackExchange works - answers can be edited and cleaned up. Second, it's funny you are making such a big deal about the range, since that's the exact same range you suggested above! It's OK though, I understand that you would rather knock someone down than be helpful. – OverZealous Sep 01 '11 at 17:54