Linux Shell Util uniq Unicode Bug

By Xah Lee. Date:

Here's a bug of unix/linux GNU shellutil uniq.

Create a file of the following text:


save it as unicode.txt, then do cat unicode.txt | uniq -c.

The output is “33 ═”. It thinks i have 33 lines of equal sign. Idiotic unix.

◆ uniq --version
uniq (GNU coreutils) 8.13
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later .
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Richard M. Stallman and David MacKenzie.

The man page doesn't mention anything about Unicode. Here's my locale setting anyhow.

◆ locale

i think, since about 2005, unix utils are in frenzed patch trying to be Unicode compatible. It looks like, the state is still sh�tty.

A related problem is grep. 〔➤see Problems of Calling Unix grep in Emacs〕 I thought the problem is between the complexity of emacs+cygwin+layer+environment variable. But now i know, it's unix!

am not sure how many unix utils still have Unicode problem.

see also: Complexity & Tedium of Software Engineering

More About Uniq Unicode Bug

from a discussion on g+, Dario Bertini gave great info about the bug. It appears, the GNU people don't consider it a bug.

I looked at uniq.c (from coreutils) and linebuffer.c (from gnulib) but I couldn't easily see the bug… the code is obviously unicode-oblivious, but in this case it shouldn't matter (you can just check byte-by-byte)

fedora has a huge patch for coreutils i18n... it covers also uniq.c (it uses wchar, so I'd expect it to still have bugs under Cygwin or other environments);id2=HEAD

if you want to report it as a bug, this is the email:

but, they will probably ignore it, as happened before:

having LC_ALL set to "C" seems to workaround it

cat /tmp/unicod.txt | env LC_ALL=C uniq -c

here's the quote:

Remember, 'uniq' is required by POSIX to use the same line comparison techniques as 'sort'; and 'sort' is required to use strcoll() (not strcmp) to compare lines. And in your particular choice of locale, strcoll() happens to state that '∨' and '∧' collate identically; hence uniq is correct in stating that you have a duplicated line according to your current locale.


here's StackOverflow on this, dated

BSD, OS X versions of uniq

Also, hongjiang_wang woodcafe tells me, on Mac, it works fine. (some of Mac's shell tools are from BSD, some are from GNU)