2022-01-15

the problems of syntax coloring source code in html

Suppose you have html file of programing language tutorials.
So you have lots
 「pre」 or 「code」 tags containing programing language source code.
You want syntax coloring them, when viewed in browser.

There are complex issues.

There are 2 general solutions. One is using js, the other is using html markup, namely 「span」 tags.

The main problem of using html markup is that the code would not be readable unless you view it in browser. This is a significant point, because if you use html to write lots notes about programing, all the source code will be basically completely unreadable. You must view your notes in html in a browser.
Because just about every other word in source code will be marked up by verbose
span tags.

example, this code

document.getElementById("xyz").style.color="green";

becomes

<span class="function-name">document</span>.<span class="function-name">getElementById</span>(<span class="string">"xyz"</span>).<span class="function-name">style.color</span>=<span class="string">"green"</span>;

Other problem of using html markup is that it's tedious process. You have do it each time you modify the code. You have to unmarkup them first, edit or run the code, then markup again.

Also, another major problem is that the coloring would be hard to change. Suppose you have hundreds of html pages covering different languages.
Let's say, in js you have
<span class="function-name">document</span>
but due to the syntax coloring engine you used to markup, next version it became
<span class="function">document</span>
or
<span class="dom-class">document</span>

Where the class name changed. In general, for a lang x, there's no one standarized classification of the symbols in the lang, so each syntax coloring engine will create its own sets of html span classes. This makes it impossible to change your syntax coloring scheme, also bloats your css in a unmangeable way. And once you switched to a new syntax coloring engine, you have to redo the markup for all your hundreds of html pages covering different languages.

Also, it's impossible to know programatically if a source code is already marked up.
Naively, you can check if it contains span tag or ampersand encoding of ampersand or less than or greater than chars.
However, a source code containg span tag may not mean it's already markedup. For example, source code can often has heredoc mechanism, and that string can contain html code.

-----------

gag, been thinking about this for the past 2 hours. very tasking on head.

and i haven't start to cover js issues and other issues yet.

one of the main issue, is, when the source code contains < or > or &, do u ampersand or not.

if ur doc is xml, then u absolutely have to. no choice.
but in html5, in general u don't have to.
but then, that means, ur html5 will have problem when converting to xml, such as when putting a snippet of ur blog into atom rss webfeed. (edited)

but nevertheless, let's assume for not, our file is just html5, and not to worry about xml.

still, the question remain. whether u gonna ampersand encode < > & or not.

not having to do it is great, because that means, u don't have to touch your source code. It is very significant, because u can run or edit your source code. btw, like in org mode.

in general, it is a great property, that u don't faak with the source code by markup. because any markup, demarkup, have some change of screwing up the source code.

however, the idea, that u don't have to encode < > & is a pipe dream. because, for example, again, if the source code has heredoc containing html code. In that case, even though it's not individual chars like x < 4, but u have lots <span> or <div> in the source code. u have to encode them into &lt;span&gt; else when ur html is shown in browser, it is very likely to be fucked up.

so, i think at this point, we conclude, that any < > chars in source code, must be ampersand encoded. something like that. actually, this is not a conclusion. It is exactly this issue, that this whole thread, is about, that i'm thinking about, trying to analize and hope to have a clear understand and when and when not the source code must or don't have to encode the < > chars, and or &.

this is the heart of the problem of this thread.

so far, our conclusion is just that, u cannot never encode those < >, because of heredoc, and or in general any source code that does text processing of html.

if the source code is html, xml, or code to process html, xml, then, u absolutely have to encode at least the < and > chars in it.

gah. this gets back our original decision problem. naively, we think using js to syntax coloring it lets us not have to diddle source code and source code remain readable and can run as is. but now, apparently, not so. when the source code is html xml or deal with them.

jesssus. one mega problem.

btw, i have spend months dealing with these issues and investigating the google code syntax color js back in 2007 or so.

http://xahlee.info/js/google-code-prettify/index.html