URL Percent Encoding and Unicode

By Xah Lee. Date: 2010-05-24. Last updated: 2023-02-11.

By spec, some characters in url must be percent encoded. For example, [] in url should be %5B%5D

Browser's URL field automatically do this transformation of encoding when you paste a url and visit the site. However, different browser do not agree which characters should be encoded.

This page discuss this issue.

Browser Behavior of Percent Encoded URL

copy this

http://en.wikipedia.org/wiki/Saint_Jerome_in_His_Study_%28D%C3%BCrer%29

paste it into browser's url field.

result:

Study_(D%C3%BCrer) ← Google Chrome
Study_%28Dürer%29 ← Safari
Study_%28D%C3%BCrer%29 ← Firefox
Study_(Dürer) ← Opera
Study_%28D%C3%BCrer%29 ← IE

Another example. Start with:

http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

result:

Sylvester%E2%80%93Gallai_theorem ← Google Chrome
Sylvester–Gallai_theorem ← Safari
Sylvester%E2%80%93Gallai_theorem ← Firefox
Sylvester–Gallai_theorem ← Opera
Sylvester%E2%80%93Gallai_theorem ← IE

All results are on Windows Vista, using latest public released version of the browsers as of 2010-05-24.

Browser Behavior of Url Containing Unicode and Parenthesis

Copy this line:

http://en.wikipedia.org/wiki/Saint_Jerome_in_His_Study_(Dürer)

then go to browser, open a new tab, paste the line into the URL field, then press Enter to load the page.

Then, select URL field and copy the URL. Then, paste in a text editor. Here are the results (on Windows browsers):

Study_(D%C3%BCrer) ← Google Chrome
Study_(Dürer) ← Safari, Opera, Internet Explorer
Study_%28D%C3%BCrer%29 ← Firefox

Summary of Browser's Behaviors on URL Percent Encoding

Here is some summary of the behavior.

Firefox (v 3.6.3), is the most aggressive in turning characters in URL into the percent encoded form.
Google Chrome (4.1.249.1064 (45376)) will change Unicode chars into percent encoded form, but not parenthesis chars.
Safari (4.0.5 (531.22.7)) does convert some percent encoded chars into plain Unicode char, but not all.
Opera (v 10.10, build 1893) is the best, it shows Unicode and paren and en-dash as is.
IE (8.0.6001.18904), seems to take the approach that it doesn't do anything to the URL. Whatever you pasted in, remains unchanged.

Conclusion

There are several issues going on about this.

First, just what characters in URL needs to be encoded?

URL, by spec, is just a sequence of characters. Originally, it's supposed to just allow a subset of ASCII Characters only, and certain ASCII character are not allowed. The not allowed characters are those encoded by the JavaScript function encodeURI .

Now, what browsers do in the URL field, arguably has nothing to do with the URL spec. Because it's just a user interface appearance issue. Any encoding, if necessary, can be done when browser actually tries to transmit that URL or do whatever. This is why, we see different behaviors in browsers.

Further, IRI (Internationalized Resource Identifier) became popular since say 2010. IRI basically is just URL except it allows NON-ASCII characters to be in URL as is, sans encoding.

Further more, a separate issue that adds to the confusion is that, if the URL is inside a HTML doc, then the ampersand character U+26: AMPERSAND and U+3C: LESS-THAN SIGN and U+3E: GREATER-THAN SIGN all needs to be encoded in another way, called ampersand encoding.

see URL Percent Encoding and Ampersand Char

Reference

An Introduction to Multilingual Web Addresses At http://www.w3.org/International/articles/idn-and-iri/
Internationalized Resource Identifier
XRI
Internationalized domain name
Punycode

Unicode, Encoding, Escape Sequence, Issues

BUY ΣJS JavaScript in Depth