10

I've been using GNU SED on and off for a couple of years now. It spins me out a bit sometimes, but it does a good job... for single-byte char sets!
I now and then notice references to GNU SED being Unicode-aware, but the closest I've seen of this is its "binary" mode.. and binary is not Unicode.
Can GSED process a Unicode text file at CodePoint resolution, including and especially \r\n (Windows)... and if it can, does it expect UTF-8, UTF-16, or what? and how does SED detect the encoding?

Matthew Flaschen
  • 2,540
  • 1
  • 18
  • 27
Peter.O
  • 3,023
  • 1
  • 28
  • 30
  • 1
    Usually unicode is specified with the \uXXXX option. Try this japanease guy's compile http://sky.geocities.jp/hp_gabo200x/room_tool.html – Mikhail Nov 01 '12 at 02:03

1 Answers1

1

I don't know a ton about sed, but after some hard Googling it seems to have support for a variety of code pages through the LANG environment variable. I believe UTF-8 is in fact the default in the absence of LANG. I don't know how the Windows port is set up though. I do have a strong suspicion that sed performs no detection processing at all on the input stream.

Sources: https://stackoverflow.com/questions/67410/why-does-sed-fail-with-international-characters-and-how-to-fix http://omgili.com/mailinglist/cygwin/cygwin/com/20100520123926GA1432onderneming10xs4allnl.html

You could also try escape characters as mentioned here: http://forums.whirlpool.net.au/forum-replies-archive.cfm/841095.html That seems very cumbersome though.

Vanessa Phipps
  • 332
  • 1
  • 4
  • 15
  • @Mathew: Thanks. It's starting to seem that the issue may not lie with SED.exe itself, but in the way that Windows *doesn't* handle code-pages very well in its cmd.exe console. Maybe it works in its PowerShell, but if I have to go there, I'd rather focus on Python instead. As far as I can see, Windows own pride and joy, UTF-16 (code-page 1200, http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx) is available only to *managed applications*, whatever that means, but it surely doesn't work in the console.. Convert to UTF-8 and back? No way! Python here I come. (or 'Nix and Bash) – Peter.O Aug 07 '10 at 02:11
  • Sounds good to me. I'm kind of confused how cmd.exe is causing the problem and how Python solves it (besides the well-known magic powers of Python (http://xkcd.com/353/)), but I'm not going to pretend to understand the vagaries of cmd.exe. Good luck to you! – Vanessa Phipps Aug 18 '10 at 20:56
  • @fred: Read: .NET. – Hello71 May 04 '11 at 22:34
  • 2
    I've solved the problem ... I've dropped MS completely about 6 months ago, and now use Ubuntu.. It is as smooth as silk... (and I'm not looking back... – Peter.O May 05 '11 at 10:33