I've been using GNU SED on and off for a couple of years now. It spins me out a bit sometimes, but it does a good job... for single-byte char sets!
I now and then notice references to GNU SED being Unicode-aware, but the closest I've seen of this is its "binary" mode.. and binary is not Unicode.
Can GSED process a Unicode text file at CodePoint resolution, including and especially \r\n (Windows)... and if it can, does it expect UTF-8, UTF-16, or what? and how does SED detect the encoding?
Asked
Active
Viewed 5,158 times
10
Matthew Flaschen
- 2,540
- 1
- 18
- 27
Peter.O
- 3,023
- 1
- 28
- 30
-
1Usually unicode is specified with the \uXXXX option. Try this japanease guy's compile http://sky.geocities.jp/hp_gabo200x/room_tool.html – Mikhail Nov 01 '12 at 02:03
1 Answers
1
I don't know a ton about sed, but after some hard Googling it seems to have support for a variety of code pages through the LANG environment variable. I believe UTF-8 is in fact the default in the absence of LANG. I don't know how the Windows port is set up though. I do have a strong suspicion that sed performs no detection processing at all on the input stream.
Sources: https://stackoverflow.com/questions/67410/why-does-sed-fail-with-international-characters-and-how-to-fix http://omgili.com/mailinglist/cygwin/cygwin/com/20100520123926GA1432onderneming10xs4allnl.html
You could also try escape characters as mentioned here: http://forums.whirlpool.net.au/forum-replies-archive.cfm/841095.html That seems very cumbersome though.
Vanessa Phipps
- 332
- 1
- 4
- 15
-
@Mathew: Thanks. It's starting to seem that the issue may not lie with SED.exe itself, but in the way that Windows *doesn't* handle code-pages very well in its cmd.exe console. Maybe it works in its PowerShell, but if I have to go there, I'd rather focus on Python instead. As far as I can see, Windows own pride and joy, UTF-16 (code-page 1200, http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx) is available only to *managed applications*, whatever that means, but it surely doesn't work in the console.. Convert to UTF-8 and back? No way! Python here I come. (or 'Nix and Bash) – Peter.O Aug 07 '10 at 02:11
-
Sounds good to me. I'm kind of confused how cmd.exe is causing the problem and how Python solves it (besides the well-known magic powers of Python (http://xkcd.com/353/)), but I'm not going to pretend to understand the vagaries of cmd.exe. Good luck to you! – Vanessa Phipps Aug 18 '10 at 20:56
-
-
2I've solved the problem ... I've dropped MS completely about 6 months ago, and now use Ubuntu.. It is as smooth as silk... (and I'm not looking back... – Peter.O May 05 '11 at 10:33