Syntax Coloring with Google-Code-Prettify

By Xah Lee. Date: . Last updated: .

This page gives some examples using Google-code-prettify technology, and evaluates its quality.

Google-code-prettify (GCP) is a JavaScript library that syntax color computer lang source code in html. It does the job on the fly.

see https://github.com/google/code-prettify

Basic Use

It is very easy to use. All you have to do is to download the JavaScript files. Then, in your web page, add these lines:

<link rel="stylesheet" type="text/css" href="gcp/prettify.css">
<script src="gcp/prettify.js"></script>
<body onload="prettyPrint()">

Then, in places you want your source code to be colored, wrap it with a “pre” tag like this:

<pre class="prettyprint">
x = 1+1;
# something something
</pre>

Advantages:

Disadvantages:

Examples

Here are some examples with different languages. For comparison, each example has a version using HTML span, done by htmlize elisp package with emacs. The GCP version used here is “small-21-Jul-2010”.

Java:

Python:

Perl:

Emacs Lisp:

Conclusion

Google-code-prettify is suitable for small number of lines or for non-critical writing such as on wiki or blogs. It is probably not the right tool for large number of code that needs more precision. (for example: computer language documentation, tech books.)

For code over 2 hundred lines, it also takes a second to load. In comparison, bulky HTML with span wraps that has more coloring and correct syntax is still instaneous.

More Comments

It appears that the concept of a simple JavaScript based parser that syntax color a number of languages on the fly is too much of a dream come true. Note that, usually, syntax coloring algorithm is specific to a language. When a editor syntax colors java code, it has code that deals with java syntax, when it syntax color python, it calls code that deals with python syntax, when it syntax color C#, F#, HTML, CSS, LaTeX, etc, there are code that deals with that particular language's syntax. For some progamer editors, it has one single generic syntax coloring module but reads in a language specific syntax file for dealing with that particular lang. This way, the program knows all the special keywords and their roles of a particular lang, and can thus color it properly according to their semantic role. Doesn't matter how it is implemented, the point here is that they deal with each language specifically. In contrast, GCP is a generic that attempt to deal with all languages, with some special code that acts as helpers for a particular lang that has syntax sufficiently different from C-like langs. This generic approach seems magical, but so far GCP' generic approach does not seem to perform anywhere close to lang-specific approach.

Another thing notable with GCP is that it uses dynamic HTML technology to color text in html. In particular, it seems to me, GCP does not read in the text and replace it with a HTML marked version for your browser to render. I'm not exactly sure how GCP works technically, but this approach seems much simpler and advantageous. However, this approach also has severe problems. When your code contains chars such as < >, they cause a lot problem in browsers when they are not encoded as &lt; &gt;. If you take the time to pre-process your source code to encode these chars before putting it in HTML with GCP, then GCP loses its major advantage of not requiring the pre-process step before pasting the code in a HTML page.

When you have text such as x<y (without space in between), the “less than” char MUST be encoded, else it is not a valid HTML. This issue causes practical problems too. For example, languages such as Perl, PHP, Python uses regex heavily, and often these regexes parse URL or HTML tags. For example:

# Python

import re
text = r'''<p>look at this <img src="./some.gif" width="30" height="20"> ...</p>'''
new = re.sub(r'src\s*=\s*"([^"]+)\.gif"', r'src="\1.png"', text)
print new

If you don't encode those < >, browsers will freak out. Because GCP does not change the text, thus these regexes are passed to browser directly, and browser will freak out when encountering raw regex, resulting in broken links or missing text following that point.

The source code examples in this page are basically randomly picked from my programing language tutorials, and 3 out of 4 examples had coloring problems, but more seriously, GCP damaged the HTML links that comes after the pre block. For example, if you just have this code: if (x<y) {print 5;}, then GCP will render it like this: if (x, and break all your HTML links after this pre block. (this may have been fixed in current version)

Theoretically, a JavaScript based parser that syntax color any number of languages on the fly in reasonable amount of time is not impossible. But GCP hasn't reached that state of maturity. If you look at the main source code of GCP, it is only 1.4k lines, too small to be realistic. In contrast, for example, js2-mode and nxml mode in emacs each contain 10k lines of elisp (of course, they do much more than syntax coloring). The htmlize-mode, which translates emacs's text properties info into html/css (given a already syntax colored buffer in emacs), is 1.7k lines of elisp itself.