Google Code Prettify and Ampersand Encoding
when you write blogs that contains programing language source code, and if you want to syntax color the code. There are 2 solutions. One is JavaScript based Google Code Prettify. The other solution is using span tags hardcoded with the html.
here's my experience of using both.
been stuck on a programing problem in last 2 days.
recently i tried to experiment of using google code prettify instead of span tags to color code in html. see this video
but now i realized a big subtle problem. not sure am going to use it anymore.
it's logically impossible to determine if a text is ampersand encoded or decoded.
You might say, if the text contains &
, then it is encoded.
Wrong!
e.g. If we have
str.replace("&", "&")
That's not encoded.
Encoded version would be:
str.replace("&", "&")
the fact that there is no simple logical way to determine if text is ampersand encoded or decoded, it's a problem. because if ampersand gets encoded twice, the code is screwed. won't run correctly anymore.
why would it happen twice? because when you put lots source code in html on the web, and you constantly edit it, i.e. decode edit encode cycle, it's very easy to make mistakes of encoding twice or decoding twice overtime.
note in html4 and #xml, all
< > &
must be encoded.
in #html5, the rule is relaxed. if such char are surrounded by space, then it's ok, no need encoding to entities. (the rule is more complex than this)
but why not just leave
< > &
as is without encoding at all?
That's a problem, because suppose your code process html tags.
Code that process html are actually quite common.
so if are writing programing tutorial or blog, you may have lots code snippets containing lots html tags in string or regex.
so, at this point, the advantage of using JavaScript to syntax color your code in your blog, is half gone. you still need to ampersand encode your code. and if it's web dev code (lots
< > &
), your code is unreadable without decoding. Same as using span tags to syntax color.
using span tags to syntax color code on the web, now has a advantage. i.e. it's easy to determine the encoded/decoded state. (by simply checking existence of <span>) So, not prone to error of encode/decode twice.
here's a example of how unreadable code is, after ampersand encode, even if you are not using span tags for syntax coloring. Note, if you do not ampersand encode it, your html page is totally screwed after the code snippet.
this problem, is a general problem of nesting. It happens with string escape sequence in langs. e.g. have you ever tried to grep a string that is from a code snippet's regex pattern? basically, it is impossible to figure out the backslash escape sequence.
That's why, perl php ruby have here-doc, python has triple quote. golang has ``.
Note, the problem can be solved trivially in XML by using
<![CDATA[like this]]>
unfortunately, it does not work in html4 or 5. it's sad to be reminded the entire coup of xml by wtfg html5 that was Apple and Google for $
see HTML Validation, History and Politics History of Web Tech
some related essays of the thread: Programing Language Design: String Syntax
just took a 1 hour walk to think about a final decision on this problem that's bugging me for a week. I decided not use google code prettify. It is possibly worse a solution than using span tags. Now i need to revert 1 day's work. It'll take half a day.