HTML Correctness and Validators

By Xah Lee. Date: . Last updated: .

Condition of Website Correctness

Some notes about HTML correctness and HTML validator.

My website has close to 4000 HTML files, all are valid HTML files. “Valid” here means passing the W3C's validator at http://validator.w3.org/.

Being a programing and correctness nerd, correct HTML is important to me. (correct markup has important, practical, benefits, such as easy parsing and transformation, as picked up by the XML movement. Ultimately, it is a foundation of semantic web.)

In programing language communities, the Tech Geekers are fanatical about their favorate language's superiority, and in the case of functional langs, they are often proud of their correctness features. However, a look at their official docs or websites, they are ALL invalid HTML, with errors lighting up like a neon city.

Major Standards Org Hand Out Invalid HTML

Here is example of major orgs handing out invalid HTML or tools that generate invalid HTML:

In the web development geeker communities, you can see how they are tight-assed about correct use of HTML/CSS, etc, where there are often frequent and heated debates about propriety of semantic markup, while totally, absolutely, ignore any practical issues as if real world doesn't exist. They sneer at the average HTML coders, and they don't hesitate to ridicule Microsoft Internet Explorer browser (which is the first browser to drag Netscape out of proprietary tags back in ~1996). However, a look at the HTML they produced, also almost none are valid.

In about 2006, i spent few hours to research on what major websites produce valid HTML. I found only one major site that produces valid HTML, and that is Wikipedia. This is fantastic. Wikipedia is produced by MediaWiki engine, written in PHP. Many other wiki sites also run MediaWiki, so they undoubtedly are also valid. As far as i know, few other wiki or forum software also produces valid HTML, though they are more the exceptions than norm. (did try to check 7 random pages from “w3.org”, looks like they are all valid today.)

Personal Need For Validator

In 2008, as a experiment, i converted a few of my projects on my site from HTML 4 transitional to HTML 4 strict. The process is labor intensive, even though the files i start with are valid.

Here are some examples. In html4strict:

Lets look at the image tag example. You might think it is trivial to transform because you can simply use regex to wrap a <div> to image tags. However, it's not that simple. Because, for example, often i have this form:

<img src="pretty.jpg" alt="pretty girl" width="565" height="809">
<p>above: A pretty girl.</p>

The “p” tag immediately below a “img” tag, functions as the image's caption. I have CSS setup so that this caption has no gap to the image above it, like this:

img + p {margin-top:0px;width:100%} /* img caption */

I have the width:100% because i have “p” set to a limited width width:80ex for reading. (my website is dominated in a essay format)

Now, if i simply wrap a “div” tag to all my “img” tags, i will end up with this form:

<div><img src="pretty.jpg" alt="pretty girl" width="565" height="809"></div>
<p>above: A pretty girl.</p>

Now this screws up with my caption CSS, and it is not possible for CSS selector to match <p> that comes after a <div><img ></div>.

Also, sometimes i have a sequence of images. They are rendered side by side from left to right. Wrapping each with a “div” would put them vertically.

This is just a simplified example. In short, converting from html4transitional to html4strict while hoping to retain appearance or markup semantics in practical ways is pretty much a manual pain. (the ultimate reason is because html4transitional is far from being a good semantic markup. (html4strict is a bit better))

Validators

In my work i need a batch validator. What i want is a command line utility, that can batch validate all files in a dir. Here are some solutions related to HTML validation.

I heavily relie on the above 2 Firefox tools. Every time i create or edit a page, then view in browser, the validity icon will tell me if the file is not valid. However, the Firefox tools do not let me do batch validation. (which is needed when i do massive regex operations on all files) Over the years i've searched for batch validation tools. Here's some list:

One semi solution for batch validation i found is: “Validator S.A.C.”, at http://habilis.net/validator-sac/. It is basically W3C's validator compiled for OS X with a GUI interface. However, this is not designed for batch operation. If you want to do batch, i run it like this: /Applications/Validator-SAC.app/Contents/Resources/weblet ‹html file path›. However, it output a whole report in HTML on the validation result (same as the page you see in W3C validation). This is not what i want. What i want is simply for it to tell me if a file is valid or not. So, in order to use “Validator SAC” to do batch job, i wrap a perl script, which takes a dir and simply print any file's name if it is invalid.

Here is the perl script:

# perl

# 2008-06-20
# validates a given dir's HTML files recursively
# requires the mac os x app Validator-SAC.app
# at http://habilis.net/validator-sac/
# as of 2008-06

use strict;
use File::Find;

my $dirPath = q(/Users/xah/web/emacs);
my $validator = q(/Applications/Validator-SAC.app/Contents/Resources/weblet);

sub wanted {
  if ($_ =~ m{\.html$} && not -d $File::Find::name) {

    my $output = qx{$validator "$File::Find::name" | head -n 11 | grep 'X-W3C-Validator-Status:'};

    if ($output ne qq(X-W3C-Validator-Status: Valid\n)) {
      print q(Problem: ), $File::Find::name, "\n";
    } else {
      print qq(Good: $_) ,"\n";
    }

  }
}

find(\&wanted, $dirPath);

print q(Done.)

However, for some reason, “Validator S.A.C.” took nearly 2 seconds to check each file, in contrast, the Firefox HTML validator add-on took a fraction of a second while also render the whole page completely. For example, suppose i have 20 files in a dir i need to validate. It's much faster, if i just open all of them in Firefox and eyeball the validity indicator, than running the “Validator SAC” on them.

I wrote to its author Chuck Houpt about this. It seems that the validator uses Perl and loads about 20 heavy duty web related perl modules to do its job, and over all is wrapped as a Common Gateway Interface. Perhaps there is a way to avoid these wrappers and call the parser or validator directly.

I'm still looking for a fast, batch, unix based HTML validation tool.