Unix Shell Util uniq Unicode Bug

,

Here's a bug of unix/linux GNU shellutil uniq.

Create a file of the following text:

═
═
═
║
║
║
╒
╓
╔
╕
╖
╗
╘
╙
╚
╛
╜
╝
╞
╟
╠
╡
╢
╣
╤
╥
╦
╧
╨
╩
╪
╫
╬

save it as unicode.txt, then do cat unicode.txt | unicq -c. You get “33 ═”. Idiotic unix.

◆ uniq --version
uniq (GNU coreutils) 8.13
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later .
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Richard M. Stallman and David MacKenzie.

The man page doesn't mention anything about Unicode. Here's my locale setting anyhow.

◆ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

i think, since about 2005, unix utils are in frenzed patch trying to be Unicode compatible. It looks like, the state is still sh�tty.

A related problem is grep. 〔☛ Problems of Calling Unix grep in Emacs〕 I thought the problem is between the complexity of emacs+cygwin+layer+environment variable. But now i know, it's unix!

am not sure how many unix utils still have Unicode problem.

see also: Complexity & Tedium of Software Engineering

More About Uniq Unicode Bug

from a discussion on g+, Dario Bertini gave great info about the bug. It appears, the GNU people don't consider it a bug.

I looked at uniq.c (from coreutils) and linebuffer.c (from gnulib) but I couldn't easily see the bug… the code is obviously unicode-oblivious, but in this case it shouldn't matter (you can just check byte-by-byte)

fedora has a huge patch for coreutils i18n... it covers also uniq.c (it uses wchar, so I'd expect it to still have bugs under Cygwin or other environments) http://pkgs.fedoraproject.org/cgit/coreutils.git/tree/coreutils-i18n.patch?id=6e10f376996b64f538259091a524df2249b653fb;id2=HEAD

if you want to report it as a bug, this is the email: bug-coreutils@gnu.org

but, they will probably ignore it, as happened before: http://lists.gnu.org/archive/html/bug-coreutils/2012-07/msg00072.html

having LC_ALL set to "C" seems to workaround it

cat /tmp/unicod.txt | env LC_ALL=C uniq -c

BSD, OS X versions of uniq

Also, hongjiang_wang woodcafe tells me, on Mac, it works fine. (some of Mac's shell tools are from BSD, some are from GNU)

blog comments powered by Disqus