URL Percent Encoding and Unicode

By Xah Lee. Date: . Last updated: .

This page discuss some issues about what characters should be percent encoded in URL, and how different browsers behave.

Browser Behavior

Some test on browser's behavior on URL encoding/decoding. Apparently, some browsers automatically decode parts of the percent encoding.

Copy this line:

http://en.wikipedia.org/wiki/Saint_Jerome_in_His_Study_(Dürer)

then go to browser, open a new tab, paste the line into the URL field, then press Enter to load the page.

Then, select URL field and copy the URL. Then, paste in a text editor. Here are the results (on Windows browsers):


Now, try again, starting with this line:

http://en.wikipedia.org/wiki/Saint_Jerome_in_His_Study_%28D%C3%BCrer%29

result:


Another example. Start with:

http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

result:

All results are on Windows Vista, using latest public released version of the browsers as of .

Summary

Here's some summary of the behavior.

Conclusion

There are several issues going on about this.

First, just what characters in URL needs to be encoded?

URL, by spec, is just a sequence of characters. Originally, it's supposed to just allow a subset of ASCII character only [see ASCII Table], and certain ASCII character are not allowed. The not allowed characters are those encoded by the JavaScript function encodeURI [see JS: encodeURI]

Now, what browsers do in the URL field, arguably has nothing to do with the URL spec. Because it's just a user interface appearance issue. Any encoding, if necessary, can be done when browser actually tries to transmit that URL or do whatever. This is why, we see different behaviors in browsers.

Further, IRI (Internationalized Resource Identifier) became popular since say 2010. IRI basically is just URL except it allows NON-ASCII characters to be in URL as is, sans encoding.

Further more, a separate issue that adds to the confusion is that, if the URL is inside a HTML doc, then the ampersand character U+26: AMPERSAND and U+3C: LESS-THAN SIGN and U+3E: GREATER-THAN SIGN all needs to be encoded in another way, called ampersand encoding.

see URL Percent Encoding and Ampersand Char

Reference

URL Encoding

  1. URL Percent Encoding and Unicode
  2. URL Percent Encoding and Ampersand Char
  3. Semantic of Symbols: HTML Entities, Ampersand, Unicode
Liket it? I spend 2 years writing this tutorial. Help me spread it. Tell your friends. Or, Put $5 at patreon.

Or, Buy JavaScript in Depth

If you have a question, put $5 at patreon and message me.