Linux Shell Util uniq Unicode Bug

By Xah Lee. Date:

Here is a bug of unix/linux GNU shellutil uniq.

Create a file of the following text:

═
═
═
║
║
║
╒
╓
╔
╕
╖
╗
╘
╙
╚
╛
╜
╝
╞
╟
╠
╡
╢
╣
╤
╥
╦
╧
╨
╩
╪
╫
╬

save it as unicode.txt, then do cat unicode.txt | uniq -c.

The output is “33 ═”. It thinks i have 33 lines of equal sign. Idiotic unix.

◆ uniq --version
uniq (GNU coreutils) 8.13
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Richard M. Stallman and David MacKenzie.

The man page doesn't mention anything about Unicode. Here's my locale setting anyhow.

◆ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

i think, since about 2005, unix utils are in frenzed patch trying to be Unicode compatible. It looks like, the state is still shitty.

A related problem is grep. 〔see Problems of grep in Emacs〕 I thought the problem is between the complexity of emacs+cygwin+layer+environment variable. But now i know, it's unix!

am not sure how many unix utils still have Unicode problem.

More About Uniq Unicode Bug

from a discussion on https://plus.google.com/113859563190964307534/posts/8U1vH89aBm8, [Dario Bertini https://plus.google.com/103384186109211000716/posts] gave great info about the bug. It appears, the GNU people don't consider it a bug.

I looked at uniq.c (from coreutils) and linebuffer.c (from gnulib) but I couldn't easily see the bug… the code is obviously unicode-oblivious, but in this case it shouldn't matter (you can just check byte-by-byte)

fedora has a huge patch for coreutils i18n... it covers also uniq.c (it uses wchar, so I'd expect it to still have bugs under Cygwin or other environments) http://pkgs.fedoraproject.org/cgit/coreutils.git/tree/coreutils-i18n.patch?id=6e10f376996b64f538259091a524df2249b653fb;id2=HEAD

if you want to report it as a bug, this is the email: bug-coreutils@gnu.org

but, they will probably ignore it, as happened before: http://lists.gnu.org/archive/html/bug-coreutils/2012-07/msg00072.html

having LC_ALL set to "C" seems to workaround it

cat /tmp/unicod.txt | env LC_ALL=C uniq -c

here's the quote:

Remember, 'uniq' is required by POSIX to use the same line comparison techniques as 'sort'; and 'sort' is required to use strcoll() (not strcmp) to compare lines. And in your particular choice of locale, strcoll() happens to state that '∨' and '∧' collate identically; hence uniq is correct in stating that you have a duplicated line according to your current locale.

[from http://lists.gnu.org/archive/html/bug-coreutils/2012-07/msg00072.html]

here's StackOverflow on this, dated http://stackoverflow.com/questions/20226851/how-do-locales-work-in-linux-posix-and-what-transformations-are-applied

BSD, OS X versions of uniq

Also, hongjiang_wang [http://weibo.com/woodcafe 2013-05-08] tells me, on Mac, it works fine. (some of Mac's shell tools are from BSD, some are from GNU)