URL Percent Encoding and Unicode
By spec, some characters in url must be percent encoded.
For example,
[]
in url
should be
%5B%5D
Browser's URL field automatically do this transformation of encoding when you paste a url and visit the site. However, different browser do not agree which characters should be encoded.
This page discuss this issue.
Browser Behavior of Percent Encoded URL
copy this
http://en.wikipedia.org/wiki/Saint_Jerome_in_His_Study_%28D%C3%BCrer%29
paste it into browser's url field.
result:
Study_(D%C3%BCrer)
← Google ChromeStudy_%28Dürer%29
← SafariStudy_%28D%C3%BCrer%29
← FirefoxStudy_(Dürer)
← OperaStudy_%28D%C3%BCrer%29
← IE
Another example. Start with:
http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
result:
Sylvester%E2%80%93Gallai_theorem
← Google ChromeSylvester–Gallai_theorem
← SafariSylvester%E2%80%93Gallai_theorem
← FirefoxSylvester–Gallai_theorem
← OperaSylvester%E2%80%93Gallai_theorem
← IE
All results are on Windows Vista, using latest public released version of the browsers as of 2010-05-24.
Browser Behavior of Url Containing Unicode and Parenthesis
Copy this line:
http://en.wikipedia.org/wiki/Saint_Jerome_in_His_Study_(Dürer)
then go to browser, open a new tab, paste the line into the URL field, then press Enter to load the page.
Then, select URL field and copy the URL. Then, paste in a text editor. Here are the results (on Windows browsers):
Study_(D%C3%BCrer)
← Google ChromeStudy_(Dürer)
← Safari, Opera, Internet ExplorerStudy_%28D%C3%BCrer%29
← Firefox
Summary of Browser's Behaviors on URL Percent Encoding
Here is some summary of the behavior.
- Firefox (v 3.6.3), is the most aggressive in turning characters in URL into the percent encoded form.
- Google Chrome (4.1.249.1064 (45376)) will change Unicode chars into percent encoded form, but not parenthesis chars.
- Safari (4.0.5 (531.22.7)) does convert some percent encoded chars into plain Unicode char, but not all.
- Opera (v 10.10, build 1893) is the best, it shows Unicode and paren and en-dash as is.
- IE (8.0.6001.18904), seems to take the approach that it doesn't do anything to the URL. Whatever you pasted in, remains unchanged.
Conclusion
There are several issues going on about this.
First, just what characters in URL needs to be encoded?
URL, by spec, is just a sequence of characters. Originally, it's supposed to just allow a subset of ASCII Characters only, and certain ASCII character are not allowed. The not allowed characters are those encoded by the JavaScript function encodeURI .
Now, what browsers do in the URL field, arguably has nothing to do with the URL spec. Because it's just a user interface appearance issue. Any encoding, if necessary, can be done when browser actually tries to transmit that URL or do whatever. This is why, we see different behaviors in browsers.
Further, IRI (Internationalized Resource Identifier) became popular since say 2010. IRI basically is just URL except it allows NON-ASCII characters to be in URL as is, sans encoding.
Further more, a separate issue that adds to the confusion is that, if the URL is inside a HTML doc, then the
ampersand character
U+26: AMPERSAND
and
U+3C: LESS-THAN SIGN
and
U+3E: GREATER-THAN SIGN
all needs to be encoded in another way, called ampersand encoding.
see URL Percent Encoding and Ampersand Char
Reference
- An Introduction to Multilingual Web Addresses At http://www.w3.org/International/articles/idn-and-iri/
- [ Internationalized Resource Identifier ] [ https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier ]
- [ XRI ] [ https://en.wikipedia.org/wiki/XRI ]
- [ Internationalized domain name ] [ https://en.wikipedia.org/wiki/Internationalized_domain_name ]
- [ Punycode ] [ https://en.wikipedia.org/wiki/Punycode ]
Unicode, Encoding, Escape Sequence, Issues
- Unicode Symbol for “e.g.” (exempli gratia)
- Semantics and Symbols: Examples of Unicode Symbols Usage
- Semantic of Symbol: Unicode Ellipsis Symbol vs Dot Dot Dot
- Problems of Symbol Congestion in Computer Languages; ASCII Jam vs Unicode
- Programing Language Design: String Syntax
- Syntax Design: Use of Unicode Matching Brackets as Specialized Delimiters
- Unicode Semantics: the ∀ in Turn A Gundam
- URL Percent Encoding and Unicode
- URL Percent Encoding and Ampersand Char
- Semantic of Symbols: HTML Entities, Ampersand, Unicode