On Unix Filename Characters Problem

By Xah Lee. Date: . Last updated: .

On , someone wrote (paraphrased):

Sometimes i save documents to disk from the web.

I wish to embed the article title and URL in the saved filename.

➢ for example: if article titled “News for the next Century”
at http://www.example.com/news/something.html

i want to save it in the filename such as
“News for the next Century http://www.example.com/news/something.html”.
But the special chars there causes problems.

is there some general char transformation scheme, so that special
chars in URL and title of article are replaced by other chars and can
be used as a filename?

Hmmm.  Maybe “---” for “/”?
What about “:”?
And what about “~”?
Plus other chars I've not thought of?

P.S.: Oh, I forgot.  tar shouldn't barf on the name. 

What you want to do is pretty hopeless. Chars in URL is confusing enough, with its percent-encoding (such as %20 for space and %7E for ~), and when used in HTML as link, there's also another layer of encoding the CDATA (such as & for &) . Depending on the browser, or whatever tool you are using, the URL you get may or may not be processed to eliminate a variety of encoding, and the encoding spec itself is not crystal clear and in practice lots of actually invalid URI anyway. (See: URL Percent Encoding and UnicodeURL Percent Encoding and Ampersand Char.)

Chars in file names itself is also confusing. Different file systems allow different char sets with different special char meanings, and each generation of file system changes slightly. (For example, Windows has C:\\ and \ and if you are using cygwin you also get / … Mac has : in OS9, and / in OSX and there's complex char transform magic underneath. Unix is the worst, they in practice just allow A to Z, 0 to 9, and underscore “_” and not even space. If you have anything like = ( ) , ; ' " " # $ & - ~ etc, you can expect most shell tools to erase you disk.) [see What Characters Are Not Allowed in File Names?]

The best thing to do is just to create a file and name it readme.txt, then in that file put in the URL, date, or keywords and annotation. That's what i do.

What Characters Are Not Allowed in Unix Filenames?

Nikolaj Schumacher wrote:

Actually unix systems allow pretty much every character except / and the null character.

Unix file names, for much of its history up to ~2005, effectively just allow alphabets (A to Z), 0 to 9, FULL STOP ., LOW LINE _. As a contrast for comparison, Mac's file names often contain punctuations and symbols such as , $ # ! * ( ) and space, but also allows non-ASCII such as:

[see Mac Keyboard Viewer]

Some of these chars are widely used throughout the 1990s. For example, it's common to see folder names ending in “ƒ”.

ASCII punctuations chars and non-ASCII chars such as above are also allowed in filenames in Windows since about Microsoft Windows NT in late 1990s. Tools in Mac OS (such as AppleScript) and Windows, support, expect, these chars in file names.

Sure, you can use any of $ % ^ @ - # / | > < = etc chars in unix, but the system is simply not designed for it. Majority of unix tools, including file name listing (ls), will choke if your filename contain these chars. The choking doesn't actually give you a nice error message, but silently break and often resulting in unexpected and unpredictable behavior (such as silently creating new files, or screw-up your display and render your terminal session unusable). In unix shell, it's painful to quote/escape these chars correctly. [see Problem of Calling Unix grep in Emacs]

Issues like these often perpetuate the myth that unix is powerful, but in fact it's just raw and no-design.