URL Percent Encoding and Unicode

By Xah Lee. Date: . Last updated: .

By spec, some characters in url must be percent encoded. For example, [] in url should be %5B%5D

Browser's URL field automatically do this transformation of encoding when you paste a url and visit the site. However, different browser do not agree which characters should be encoded.

This page discuss this issue.

Browser Behavior of Percent Encoded URL

copy this

http://en.wikipedia.org/wiki/Saint_Jerome_in_His_Study_%28D%C3%BCrer%29

paste it into browser's url field.

result:


Another example. Start with:

http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

result:

All results are on Windows Vista, using latest public released version of the browsers as of 2010-05-24.

Browser Behavior of Url Containing Unicode and Parenthesis

Copy this line:

http://en.wikipedia.org/wiki/Saint_Jerome_in_His_Study_(Dürer)

then go to browser, open a new tab, paste the line into the URL field, then press Enter to load the page.

Then, select URL field and copy the URL. Then, paste in a text editor. Here are the results (on Windows browsers):

Summary of Browser's Behaviors on URL Percent Encoding

Here is some summary of the behavior.

Conclusion

There are several issues going on about this.

First, just what characters in URL needs to be encoded?

URL, by spec, is just a sequence of characters. Originally, it's supposed to just allow a subset of ASCII Characters only, and certain ASCII character are not allowed. The not allowed characters are those encoded by the JavaScript function encodeURI .

Now, what browsers do in the URL field, arguably has nothing to do with the URL spec. Because it's just a user interface appearance issue. Any encoding, if necessary, can be done when browser actually tries to transmit that URL or do whatever. This is why, we see different behaviors in browsers.

Further, IRI (Internationalized Resource Identifier) became popular since say 2010. IRI basically is just URL except it allows NON-ASCII characters to be in URL as is, sans encoding.

Further more, a separate issue that adds to the confusion is that, if the URL is inside a HTML doc, then the ampersand character U+26: AMPERSAND and U+3C: LESS-THAN SIGN and U+3E: GREATER-THAN SIGN all needs to be encoded in another way, called ampersand encoding.

see URL Percent Encoding and Ampersand Char

Reference

Unicode, Encoding, Escape Sequence, Issues

BUY ΣJS JavaScript in Depth