HTML Correctness and Validators
Condition of Website Correctness
Some notes about HTML correctness and HTML validator.
My website has close to 4000 HTML files, all are valid HTML files. “Valid” here means passing the W3C's validator at http://validator.w3.org/.
Being a programing and correctness nerd, correct HTML is important to me. (correct markup has important, practical, benefits, such as easy parsing and transformation, as picked up by the XML movement. Ultimately, it is a foundation of semantic web.)
In programing language communities, the Tech Geekers are fanatical about their favorate language's superiority, and in the case of functional langs, they are often proud of their correctness features. However, a look at their official docs or websites, they are ALL invalid HTML, with errors lighting up like a neon city.
Major Standards Org Hand Out Invalid HTML
Here is example of major orgs handing out invalid HTML or tools that generate invalid HTML:
- GNU Texinfo Generates Invalid HTML. This means, ALL documentation from GNU are invalid HTML. See: Programing: GNU Texinfo Problems; Invalid HTML.
- Python Doc is Invalid HTML (as of 2008-12-28). See: How to Improve Python Doc; Notes on Rewriting Python Regex Doc.
- Google Earth KML Tutorial gives invalid XML Examples. See: Google Earth KML Invalid.
- {Google, Facebook, Twitter} tells people to add invalid HTML to their sites. See: Google Pushes Invalid HTML to the World.
- Google and Amazon Generates Invalid HTML
- Internet Engineering Task Force (IETF) home page (http://www.ietf.org/) invalid (as of 2008-12-28). There are 32 errors, including “no doctype found”
- Unicode Consortium page invalid. Example: http://www.unicode.org/faq/utf_bom.html (as of 2008-12-18). Pretty ridiculous. (it is valid as of 2010-05-25)
In the web development geeker communities, you can see how they are tight-assed about correct use of HTML/CSS, etc, where there are often frequent and heated debates about propriety of semantic markup, while totally, absolutely, ignore any practical issues as if real world doesn't exist. They sneer at the average HTML coders, and they don't hesitate to ridicule Microsoft Internet Explorer browser (which is the first browser to drag Netscape out of proprietary tags back in ~1996). However, a look at the HTML they produced, also almost none are valid.
In about 2006, i spent few hours to research on what major websites produce valid HTML. I found only one major site that produces valid HTML, and that is Wikipedia. This is fantastic. Wikipedia is produced by MediaWiki engine, written in PHP. Many other wiki sites also run MediaWiki, so they undoubtedly are also valid. As far as i know, few other wiki or forum software also produces valid HTML, though they are more the exceptions than norm. (did try to check 7 random pages from “w3.org”, looks like they are all valid today.)
Personal Need For Validator
In 2008, as a experiment, i converted a few of my projects on my site from HTML 4 transitional to HTML 4 strict. The process is labor intensive, even though the files i start with are valid.
Here are some examples. In html4strict:
<br>
must be inside block level tags.- Image tag
<img …>
needs to be enclosed in a block level tag such as<div>
. - Content inside blockquote must be wrapped with a block level tag. e.g.
<blockquote>Time Flies</blockquote>
would be invalid in html4strict; you must have<blockquote><p>Time Flies</p></blockquote>
Lets look at the image tag example. You might think it is trivial to transform because you can simply use regex to wrap a <div>
to image tags. However, it's not that simple. Because, for example, often i have this form:
<img src="pretty.jpg" alt="pretty girl" width="565" height="809"> <p>above: A pretty girl.</p>
The “p” tag immediately below a “img” tag, functions as the image's caption. I have CSS setup so that this caption has no gap to the image above it, like this:
img + p {margin-top:0px;width:100%} /* img caption */
I have the width:100%
because i have “p” set to a limited width width:80ex
for reading. (my website is dominated in a essay format)
Now, if i simply wrap a “div” tag to all my “img” tags, i will end up with this form:
<div><img src="pretty.jpg" alt="pretty girl" width="565" height="809"></div> <p>above: A pretty girl.</p>
Now this screws up with my caption CSS, and it is not possible for CSS selector to match <p>
that comes after a <div><img …></div>
.
Also, sometimes i have a sequence of images. They are rendered side by side from left to right. Wrapping each with a “div” would put them vertically.
This is just a simplified example. In short, converting from html4transitional to html4strict while hoping to retain appearance or markup semantics in practical ways is pretty much a manual pain. (the ultimate reason is because html4transitional is far from being a good semantic markup. (html4strict is a bit better))
Validators
In my work i need a batch validator. What i want is a command line utility, that can batch validate all files in a dir. Here are some solutions related to HTML validation.
- The standard validator service by W3C: http://validator.w3.org/ (See also: W3C Markup Validation Service.). The problem with this is that it can't validate local files, and can't do in batch. Using it to validate 4000 files thru network would not be acceptable, since it means massive web traffic. (my site is over 600 MB.)
- Firefox has a “Html Validator” add-on by Marc Gueury. https://addons.mozilla.org/en-US/firefox/addon/249. This is based on the same code as W3C validator, can work on local files, is extremely fast. When browsing any page, it shows a green check mark on the window corner when the file is valid.
- Firefox has a “Web Developer” add-on by Chris Pederick. https://addons.mozilla.org/en-US/firefox/addon/60 Since Firefox “v.3”, it has a icon that indicates if a page's CSS and JavaScript are invalid, and also indicates whether the file is using Quirks mode.
I heavily relie on the above 2 Firefox tools. Every time i create or edit a page, then view in browser, the validity icon will tell me if the file is not valid. However, the Firefox tools do not let me do batch validation. (which is needed when i do massive regex operations on all files) Over the years i've searched for batch validation tools. Here's some list:
- HTML Tidy A batch tool primarily for cleanup HTML markup. I didn't find it useful for batch validation purposes, nor for HTML conversion jobs. It doesn't do well for my HTML conversion needs because the tool is incapable of retaining your HTML formatting (i.e. retain your linebreak locations). I do a lot regex based text processing on my HTML files, so i need assumptions about how lines are formatted are in my HTML files. If i use tidy on my site, that means i have to abandon regex based text processing, and instead, have to treat my files using HTML and dom parsers, which makes most practical text processing needs quite more complex and cumbersome.
- A perl module “HTML::Lint”, at http://search.cpan.org/~petdance/HTML-Lint-2.06/lib/HTML/Lint.pm. Seems similar to HTML Tidy.
- http://htmlhelp.com/tools/validator/offline/index.html.en is another validation tool. I haven't looked into yet. Their doc about differences to other validator: http://htmlhelp.com/tools/validator/differences.html.en, is quite interesting, and seems a advantage for my needs.
- OpenJade and OpenSP. http://openjade.sourceforge.net/ Seems a good tool. Haven't looked into.
- Emacs's nxml mode http://www.thaiopensource.com/nxml-mode/, by the XML expert James Clark. This is written in elisp with over 10 thousand lines of code. It indicates whether your XML file is valid as you type. This package is very well received, reputed to make emacs the best XML editor. This is fantastic, but since my files are currently HTML not XHTML, so i haven't used this much. There are emacs HTML modes based on this package, called nxhtml mode, but the code is still pretty alpha and i find it having a lot problems.
One semi solution for batch validation i found is: “Validator S.A.C.”, at
http://habilis.net/validator-sac/. It is basically W3C's validator compiled for OS X with a GUI interface. However, this is not designed for batch operation. If you want to do batch, i run it like this: /Applications/Validator-SAC.app/Contents/Resources/weblet ‹html file path›
. However, it output a whole report in HTML on the validation result (same as the page you see in W3C validation). This is not what i want. What i want is simply for it to tell me if a file is valid or not. So, in order to use “Validator SAC” to do batch job, i wrap a perl script, which takes a dir and simply print any file's name if it is invalid.
Here is the perl script:
# perl # 2008-06-20 # validates a given dir's HTML files recursively # requires the mac os x app Validator-SAC.app # at http://habilis.net/validator-sac/ # as of 2008-06 use strict; use File::Find; my $dirPath = q(/Users/xah/web/emacs); my $validator = q(/Applications/Validator-SAC.app/Contents/Resources/weblet); sub wanted { if ($_ =~ m{\.html$} && not -d $File::Find::name) { my $output = qx{$validator "$File::Find::name" | head -n 11 | grep 'X-W3C-Validator-Status:'}; if ($output ne qq(X-W3C-Validator-Status: Valid\n)) { print q(Problem: ), $File::Find::name, "\n"; } else { print qq(Good: $_) ,"\n"; } } } find(\&wanted, $dirPath); print q(Done.)
However, for some reason, “Validator S.A.C.” took nearly 2 seconds to check each file, in contrast, the Firefox HTML validator add-on took a fraction of a second while also render the whole page completely. For example, suppose i have 20 files in a dir i need to validate. It's much faster, if i just open all of them in Firefox and eyeball the validity indicator, than running the “Validator SAC” on them.
I wrote to its author Chuck Houpt about this. It seems that the validator uses Perl and loads about 20 heavy duty web related perl modules to do its job, and over all is wrapped as a Common Gateway Interface. Perhaps there is a way to avoid these wrappers and call the parser or validator directly.
I'm still looking for a fast, batch, unix based HTML validation tool.