Linux Shell Util uniq Unicode Bug
Here is a bug of unix/linux GNU shellutil uniq
.
Create a file of the following text:
═ ═ ═ ║ ║ ║ ╒ ╓ ╔ ╕ ╖ ╗ ╘ ╙ ╚ ╛ ╜ ╝ ╞ ╟ ╠ ╡ ╢ ╣ ╤ ╥ ╦ ╧ ╨ ╩ ╪ ╫ ╬
save it as unicode.txt
, then do cat unicode.txt | uniq -c
.
The output is “33 ═”. It thinks i have 33 lines of equal sign. Idiotic unix.
◆ uniq --version uniq (GNU coreutils) 8.13 Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Richard M. Stallman and David MacKenzie.
The man page doesn't mention anything about Unicode. Here's my locale setting anyhow.
◆ locale LANG=en_US.UTF-8 LANGUAGE= LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
i think, since about 2005, unix utils are in frenzed patch trying to be Unicode compatible. It looks like, the state is still shitty.
A related problem is grep. 〔see Problems of grep in Emacs〕 I thought the problem is between the complexity of emacs+cygwin+layer+environment variable. But now i know, it's unix!
am not sure how many unix utils still have Unicode problem.
More About Uniq Unicode Bug
from a discussion on https://plus.google.com/113859563190964307534/posts/8U1vH89aBm8,
[Dario Bertini https://plus.google.com/103384186109211000716/posts] gave great info about the bug.
It appears, the GNU people don't consider it a bug.
I looked at uniq.c (from coreutils) and linebuffer.c (from gnulib) but I couldn't easily see the bug… the code is obviously unicode-oblivious, but in this case it shouldn't matter (you can just check byte-by-byte)
fedora has a huge patch for coreutils i18n... it covers also uniq.c (it uses wchar, so I'd expect it to still have bugs under Cygwin or other environments)
http://pkgs.fedoraproject.org/cgit/coreutils.git/tree/coreutils-i18n.patch?id=6e10f376996b64f538259091a524df2249b653fb;id2=HEADif you want to report it as a bug, this is the email: bug-coreutils@gnu.org
but, they will probably ignore it, as happened before: http://lists.gnu.org/archive/html/bug-coreutils/2012-07/msg00072.html
having LC_ALL set to "C" seems to workaround it
cat /tmp/unicod.txt | env LC_ALL=C uniq -c
here's the quote:
Remember, 'uniq' is required by POSIX to use the same line comparison techniques as 'sort'; and 'sort' is required to use strcoll() (not strcmp) to compare lines. And in your particular choice of locale, strcoll() happens to state that '∨' and '∧' collate identically; hence uniq is correct in stating that you have a duplicated line according to your current locale.
[from http://lists.gnu.org/archive/html/bug-coreutils/2012-07/msg00072.html]
here's StackOverflow on this, dated http://stackoverflow.com/questions/20226851/how-do-locales-work-in-linux-posix-and-what-transformations-are-applied
BSD, OS X versions of uniq
Also, hongjiang_wang [http://weibo.com/woodcafe 2013-05-08] tells me, on Mac, it works fine. (some of Mac's shell tools are from BSD, some are from GNU)