On Unix Filename Characters Problem
On , someone wrote (paraphrased):
Sometimes i save documents to disk from the web. I wish to embed the article title and URL in the saved filename. ➢ for example: if article titled “News for the next Century” at http://www.example.com/news/something.html i want to save it in the filename such as “News for the next Century http://www.example.com/news/something.html”. But the special chars there causes problems. is there some general char transformation scheme, so that special chars in URL and title of article are replaced by other chars and can be used as a filename? Hmmm. Maybe “---” for “/”? What about “:”? And what about “~”? Plus other chars I've not thought of? P.S.: Oh, I forgot. tar shouldn't barf on the name.
What you want to do is pretty hopeless. Chars in URL is confusing
enough, with its percent-encoding (such as
%20 for space and
~), and when used in HTML as link,
there's also another layer of encoding the CDATA (such as
&) . Depending on the browser, or whatever tool you are
using, the URL you get may or may not be processed to eliminate a
variety of encoding, and the encoding spec itself is not crystal
clear and in practice lots of actually invalid URI anyway.
(See: URL Percent Encoding and Unicode • URL Percent Encoding and Ampersand Char.)
Chars in file names itself is also confusing. Different file systems
allow different char sets with different special char meanings, and
each generation of file system changes slightly. (For example, Windows has
\ and if you are using cygwin you also get
/ … Mac has
in OS9, and
/ in OSX and there's complex char transform magic
underneath. Unix is the worst, they in practice just allow
A to Z, 0 to 9, and underscore “_” and not even space. If you have anything like
= ( ) , ; ' " " # $ & - ~
etc, you can expect most shell tools to erase you disk.)
[see What Characters Are Not Allowed in File Names?]
The best thing to do is just to create a file and name it
readme.txt, then in that file put in the URL, date, or keywords and
annotation. That's what i do.
What Characters Are Not Allowed in Unix Filenames?
Nikolaj Schumacher wrote:
Actually unix systems allow pretty much every character except / and the null character.
Unix file names, for much of its history up to ~2005, effectively just allow alphabets (A to Z), 0 to 9, FULL STOP ., LOW LINE _. As a contrast for comparison, Mac's file names often contain punctuations and symbols such as
, $ # ! * ( ) and space, but also allows non-ASCII such as:
- euro lang chars: ç ö é
- euro lang punctuations: « » ¡
- printer's symbols: † ‡ °
- common symbols: ™ ® © £ ¢
- math symbols: ∫ µ ∂ ƒ π ∞ ≤ ≥ ≈ ≠
Some of these chars are widely used throughout the 1990s. For example, it's common to see folder names ending in “ƒ”.
ASCII punctuations chars and non-ASCII chars such as above are also allowed in filenames in Windows since about Microsoft Windows NT in late 1990s. Tools in Mac OS (such as AppleScript) and Windows, support, expect, these chars in file names.
Sure, you can use any of
$ % ^ @ - # / | > < = etc chars in unix, but the system is simply not designed for it. Majority of unix tools, including file name listing (
ls), will chock if your filename contain these chars. The chocking doesn't actually give you a nice error message, but silently break and often resulting in unexpected and unpredictable behavior (such as silently creating new files, or screw-up your display and render your terminal session unusable). In unix shell, it's painful to quote/escape these chars correctly.
[see Problem of Calling Unix grep in Emacs]
Issues like these often perpetuate the myth that unix is powerful, but in fact it's just raw and no-design.
Ask me question on patreon