Elisp: HTML Processing: Split Annotation

By Xah Lee. Date:

This page shows a example of emacs lisp for processing HTML. The HTML files are classic novels, with annotations. The annotation markups need to change from one format into another. There are hundreds of such pages that need to be processed.

Problem

For all HTML files in a directory, find any annotation markup containing the bullet “•” symbol:

<div class="anote781">A … • B … • C …</div>

Split the annotation into multiple markups, like this:

<div class="anote781">A … </div>
<div class="anote781">B … </div>
<div class="anote781">C … </div>

Detail

If you are a contract web dev programer, then you know that 99.99% of websites are a messy text soup. They are created by hundreds of tools or languages. Word processors, HTML generators, tens of lightweight markup languages, different frameworks from different languages PHP, Perl, Python, from different web era, from different programers in the past. Even emacs has several modes that generate HTML. They are not in any consistent form. Often, they have mis-matched tags too as invalid HTML.

It is in these situations, emacs shines thru, because emacs's powerful embedded language lisp, and its interactive nature, lets you maximize automation. Interactively when you are still feeling the pattern, then by Key Macro or emacs lisp for parts that can be automated.

For my website, i take the time to make sure that all my HTML are consistent. But still, they are written in the span of 15 years. Periodically i take the time to improve the markup. For example, when new versions of CSS or HTML became mature and widely adopted by web browsers. (CSS1 to 2 to 3, HTML 3 to 4 to HTML5.)

I have hundreds of pages of classic novels as HTML documents. These documents contain annotations in a special HTML markup. For example, here's sample annotation from Titus Andronicus, Act 1 Scene 1:

SATURNINUS. 'Tis good, sir.
You are very short with us;
 But if we live we'll be as sharp with you.
• short ⇒ rudely brief. (AHD)
• sharp ⇒ Fierce, impetuous, hash, severe… (AHD)

Here's the raw HTML:

<div class="anote781">• short ⇒ rudely brief. (AHD)<br>
• sharp ⇒ Fierce, impetuous, hash, severe… (AHD)</div>

<pre class="text48074">SATURNINUS. 'Tis good, sir. You are very <span class="xntt380">short</span> with us;
  But if we live we'll be as <span class="xntt380">sharp</span> with you.
</pre>

Here's how the tag works. Each <span class="xntt380"> markup a word in main text. When a word is marked by “span.xntt380”, that means it has a sidebar annotation. The sidebar section is marked by <div class="anote781">. Inside the “div.anote781”, there may be more than one entries. Each entry starts with the bullet symbol “•”. For example, in the above, the words “short” and “sharp” are both entries inside a “div.anote781” sidebar.

But recently, i think it is better to have one entry per sidebar. This way, it makes the logic simpler, and is much easier if i want to add JavaScript functionality. For example, when mouse hovers on a word in main text, the corresponding annotation would be highlighted.

So, i want to write a elisp script to process all my files. If you simply read the spec for this job, of splitting a markup by a particular character, you may think it's trivial and can be done in any language in 10 minutes. Why then the elaborate discussion about text soup situation?

The important thing is that i DO NOT know what needs to be done to begin with. Only after having used emacs power together with lisp script i wrote before to look at and check my existing markup in hundreds of files, then i know what state they are and decide on what i want to do. Also, this change must be done with the ability to visually check that all changes are done correctly, because the input may not be in the format i expect. (it might be missing the bullet “•”.)

For those Scheme Lisp academic computer science folks, you might wonder, when i started with these annotations, why didn't i “design” it well to begin with. The reason is that, when i write a blog article, or my literature annotation project, i really want to focus on the writing first, the content, get it done, rather than get distracted by the CSS/HTML markup design. (one thing i do make sure is that whatever CSS/HTML i device, i made sure that they can be easily changed systematically later by a simple parsing.) I devote significantly more time on design than most people, but many factors necessitate change. For example, CSS in practice is rather complex and it takes years of experience to learn its quirks and tricks. Similarly, the best practices of HTML changes with time. (e.g. see: Are You Intelligent Enough to Understand HTML5?.) Browsers change, standards changes (e.g. HTML → XHTML → HTML5. See: HTML5 Doctype, Validation, X-UA-Compatible, and Why Do I Hate Hackers.), thoughts of best practices change, and my needs for the annotation also changed through-out the years.

Solution

Here's the outline of steps:

Here's the code:

;; -*- coding: utf-8 -*-
;; 2011-08-13
;; process all files in a dir.
;; split any markup like this:
;; <div class="anote781">… • … • …</div>
;; by the bullet •
;; into several anote781 tags

(setq inputDir "~/web/xahlee_org/p/" )

;; add a ending slash if not there
(when (not (string-equal "/" (substring inputDir -1) )) (setq inputDir (concat inputDir "/") ) )

;; files to process
(setq fileList
[
"~/web/xahlee_org/p/arabian_nights/aladdin/aladdin4_1.html"
"~/web/xahlee_org/p/arabian_nights/aladdin/aladdin3.html"
]
)

(defun my-process-file-xnote (fPath)
  "Process the file at FPATH"
  (let (myBuffer (xi 0) p1 p2 xtext
                 xtextNew
                 (changedItems '())
                 (tagBegin "<div class=\"anote781\">" )
                 (tagEnd "</div>" )
                 )

    (require 'sgml-mode)
    (when t

      (setq myBuffer (find-file fPath))
      (goto-char (point-min))
      (while (search-forward "<div class=\"anote781\">" nil t)

        ;; capture the anote781 tag text
        (setq p1 (point))
        (backward-char 1)
        (sgml-skip-tag-forward 1)
        (backward-char 6)
        (setq p2 (point))
        (setq xtext (buffer-substring-no-properties p1 p2))

        ;; if it contains a bullet
        (when (string-match "•" xtext)
          (setq xi (1+ xi))

          ;; clean the text. Remove some newline and <br> that's no longer needed
          (setq xtext (replace-regexp-in-string "\n*• *" "•" xtext t t ) )
          (setq xtext (replace-regexp-in-string "\n$" "" xtext t t ) ) ; delete ending eol
          (setq xtext (replace-regexp-in-string "<br>•" "•" xtext t t ) )

          ;; put the new entries into a list, for later reporting
          (setq changedItems (split-string xtext  "•" t) )

          ;; break the bullet into new end/begin tags
          (setq xtextNew (replace-regexp-in-string "•" (concat tagEnd "\n" tagBegin) xtext t t ) )

          (goto-char p1)
          (delete-region p1 p2)
          (insert xtextNew)

          ;; remove the newline before end tag
          (when (looking-back "\n" (1- (point))) (delete-backward-char 1))
          )
        )

      ;; report if the occurrence is not n times
      (when (not (= xi 0))
          (princ "-------------------------------------------\n")
          (princ (format "%d %s\n\n" xi fPath))

          (mapc (lambda (x) (princ (format "%s\n\n" x)) ) changedItems)
        )

        ;; close buffer if there is no change. Else leave it open.
        (when (not (buffer-modified-p myBuffer)) (kill-buffer myBuffer) )
      )
    ))

(require 'find-lisp)

(let (outputBuffer)
  (setq outputBuffer "*xah anote781 output*" )
  (with-output-to-temp-buffer outputBuffer
    ;; (mapc 'my-process-file-xnote fileList)
    (mapc 'my-process-file-xnote (find-lisp-find-files inputDir "\\.html$"))
  (princ "Done deal!")
    )
  )

Here's a sample output: elisp_text_processing_split_annotation.txt

I've put lots comments in the code. It should be easy to understand. If any part you don't understand, ask me. If you are new to elisp, checkout the first few section of Emacs Lisp Tutorial.

I ♥ emacs.