Web Keywords and Taxonomy

By Xah Lee. Date: 2006-10-30

This essay is a commentary on keywords, tagging, categorization, related issues on websites.

My website is growing larger and larger. Currently over 3 thousand pages. (850 pages are from the Elisp Manual , and the other 780 of pages are copies of Literature Classics and the Lojban Language Reference. The rest about 1500 pages are original content.)

The organization and hierarchy of these pages are becoming of a concern as the pages are getting numerous and keep growing. About a year ago i started to use the HTML keywords meta tag. e.g. <meta name="keywords" content="emacs, emacs tutorial">. As i started to add more and more pages with keywords, one problem begin to surface, namely, taxonomy and the semantics of keywords.

The classification is a known problem, in at least organizing subjects of human knowledge. Namely: “how can one device a hierarchy system so that subjects can fit into a position in the tree unambiguously and clearly.”. I recall, Encyclopedia Britannica has a long preface about its system of classifying knowledge, in its Propædia. (In particular, i recall reading, that “Logic” holds a unique place in the Britannica's classification system, in that it being a meta science, apart from the sciences and humanities.)

In any case, the problem i'm facing for my web pages is about the semantics of keywords. For example, suppose i have a page that is a tutorial of emacs. (emacs is a software for editing text). Now, should i use “emacs” and “emacs tutorial” as its keywords, or should i use “computing” and “tutorial”? If the tags are treated as keywords, then “emacs” and “emacs tutorial” would be fitting. However, if one wishes to classify or categorize, then “computing”, “tutorial”, “software” would be more fitting.

Clearly, there are advantages and disadvantages of either approach. Keywords tend to be haphazard, and is somewhat redundant becuase most keywords appear in the content itself and is in fact how search engines index pages. (among other lesser mechanisms) On the other hand, Keywords from a categorization perspective, has more organizing power since it takes some non-trivial analysis of the content to decide what category the content belongs to. Further, category keywords are few and somewhat fixed, so that it serves a purpose of categorization. For example, if pages are tagged by category keywords, then one can list all pages about “computing”. But if pages are tagged by keywords, then “computing” will not turn up my emacs tutorial, nor any other page on any particular computing technology or software. (For example, pages about how to write HTML, or tutorials about the language lisp.) To list computing related pages, one'd have to amass all possible keywords that have to do with computing.

So, as my web pages become numerous and i started to use the HTML keyword meta tag to organize my pages for a eventual need, the problem of taxonomy and tagging semantics came upon me. In recent years, vast number of online social networking sites have cropped up. For example, online diary sites (For example, livejournal.com), online photo sites (flickr.com), online video sites (youtube.com), online bookmark sites (del.icio.us, stumbleupon.com) … etc. They all employ some tag system for mechanized organization of the bewildering unclassifiable pages the users create. One can look at them and learn about the taxonomy problem. For example, many of these sites such as del.icio.us uses a “tag” system. The semantics of tags as used is slightly different from keywords. Basically, tags are just any words the user thinks that associates with the page. So, tags are more freeform then keywords. …

It turns out, wikipedia has some say about Tag (metadata) and “Folksonomy”.

… as i think more about the subject, i think HTML should have another tag for “category keywords”. So, we'll have a meta tag for keywords, and another for category keywords, which are keywords used for categorizing the page. Of course, both would be optional. This way, we provide a basis for mechanical methods to categorize pages according to the subject matter. The Category Keywords is necessary because semantically it is different from keywords for page's content. And as i said before, “keywords of page content” is rather redundant since computing and database technologies are advanced enough that search engines today are already easily scanning entire page's text and automatically creating keywords by statistical means. On the other hand, automatic means of Categorization is still out of reach of AI. This is definitely a element of the futuristic concept of Semantic Web.