Elisp: Process HTML, span, code, Key, Title, Markups

By Xah Lee. Date: . Last updated: .

This page is a tutorial, showing a real-world example of using emacs lisp to do many tag transformation.

Problem

I need to transform many HTML tags. Typically, they are of the form begin_delimiterend_delimiter, where the delimiters may be curly quotes “…”, or it may be a HTML tag such as <span class="xyz">…</span>.

I need to apply the transformation on over 4 thousand HTML pages, and needs it to be accurate, mostly on a case-by-case base with human watch.

Also, the delimiters may be nested, so regex won't work. They either getting too much text (using default greedy match) or getting not enough text (using shy group). With a elisp script, you can use if and other emacs functions, to correctly find the matching ending tag, as well automatically skip cases that this transform should not apply, so drastically reduce the need for human watch.

Detail

In the past week, i spend about 2 days and done a lot text processing with elisp on the 4 thousand files of my site. Here's the changes i've made:

The purpose of the change is to make the syntactical markup more semantically precise. Before, they are all marked by double curly quotes. Now, if i want to find all books i cited on my site, i can do so easily by a simple search on a special bracket for book titles. These changes also make the text easier to read. In the future, if i want all book titles to be colored red for example, i can easily do that by changing the 《》 to a HTML markup (e.g. <span class="title">…</span>), or use a JavaScript to do that on the fly. Same for emacs keybinding. For example, with this clear syntax, it's easier to write a JavaScript so that when mouse is hovering over the keybinding notation, it shows a balloon of the command name for that key. 〔see JS: How to Create Tooltip/Balloon

All this is part of the HTML Microformat, which is part of semantic web concept. The basic ideas is that, the syntax encodes semantics. This advantage is part of the major reason XML becomes so useful. (the other reason is its regular syntax.)

For info on various brackets used, see: Intro to Chinese Punctuation and Matching Brackets in Unicode.

Also, much of the HTML markup on my site has been cleaned up. For example:

There are several advantages in these changes. For example, <code> is much shorter than <span class="code">, and it has a standard meaning. It is also more unique than “span” tag, so that reduce parsing complexity when i need to process “span” tags.

〔see Keyboard Notation Design Issues

Solution

To do these tag transformations, simple cases such as

“file path” ⇒ 〔file path〕

, where the delimiters are single characters and there is no nesting, they can be done with emacs's dired-do-query-replace-regexp. 〔see Emacs: Interactive Find Replace Text in Directory

More complicated cases with nested HTML tags, can be done with a elisp script. Here's the general plan.

  1. Open the file
  2. Search for the tag
  3. If found, move to the beginning of tag, mark positions of begin/end of the opening tag
  4. Use sgml-skip-tag-forward to move to the end matching tag
  5. Mark positions of begin/end of the ending tag
  6. Replace the begin/end tags with new tags
  7. Repeat

To open the file, we can use find-file.

To search for the tag, we do:

(while
 (search-forward "<span class=\"code\">"  nil t)
…
)

We give “t” for the third argument. It means don't complain if not found.

The next step is to get the begin/end positions of the opening tag. The end position is simply the current cursor position, because the search-forward automatically place it there. To get the beginning position, we just use search-backward on “<”

Now, we need to get the begin/end positions of the matching end tag. This may be a problem because the tags are nested, so there may be many </span> before the one we want.

The good thing is that emacs's html-mode has sgml-skip-tag-forward function. It will move cursor from a beginning tag to its matching end tag.

Once we got the begin/end positions for the begin/end tags, we can now easily do replacement. Just use delete-region, then use insert to insert the new tag we want. One thing important is that we should do replacement with the ending tag first, because if we replace the beginning tag first, the positions of the ending tag will be changed.

Complete Code

;; -*- coding: utf-8 -*-
;; 2010-08-25

;; change
;; <span class="code">…</span>
;; to
;; 「…」

(setq inputDir "~/web/xahlee_org/" ) ; dir should end with a slash

(defun my-process-file (fPath)
  "process the file at fullpath fPath …"
  (let ( myBuff changedQ p3 p4 p8 p9)

    ;; open the file
    ;; search for the tag
    ;; if found, move to the beginning of tag, mark positions of begin/end of < and >
    ;; use sgml-skip-tag-forward to move to the end matching tag </span>
    ;; mark positions of begin/end of < and >
    ;; replace them with 「 and 」
    ;; repeat
    (setq myBuff (find-file fPath ) )
    (setq changedQ nil )

    (goto-char (point-min))
    (while
        (search-forward "<span class=\"code\">"  nil t)
      (backward-char 1)
      (if (looking-at ">")
          (setq p4 (1+ (point)) )
        (error "expecting <" )
        )

      ;; go to beginning of "<span class="code">"
      (sgml-skip-tag-backward 1)
      (if (looking-at "<")
          (setq p3 (point) )
        (error "expecting <" )
        )
      (forward-char 2)

      ;; go to end of </span>
      (sgml-skip-tag-forward 1)
      (backward-char 1)
      (if (looking-at ">")
          (setq p9 (1+ (point)) )
        (error "expecting >" )
        )

      ;; go to beginning of </span>
      (backward-char 6)
      (if (looking-at "<")
          (setq p8 (point) )
        (error "expecting <" )
        )

      (when (y-or-n-p "change? ")
        (delete-region p8 p9  )
        (insert "」")
        (delete-region p4 p3 )
        (goto-char p3)
        (insert "「")
        (setq changedQ t )
        ))

    ;; if not changed, close it. Else, leave buffer open
    (if changedQ
        (progn (make-backup))                        ; leave it open
      (progn (kill-buffer myBuff))
      )
    ))

(require 'find-lisp)

(let (outputBuffer)
  (setq outputBuffer "*span tag to code tag*" )
  (with-output-to-temp-buffer outputBuffer
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
    (princ "Done deal!")
    )
  )

In the code above, i also put extra checks to make sure that the position of beginning tag is really the < char. Same for ending tag. (probably redundant, but i tend to be extra careful.)

Also, i used a y-or-n-p function, so emacs will prompt me for each change that i can visually check.

For those files that are changed, i leave them open. So, if i decided on a whim i don't want all these to happen on potentially hundreds of files that i've changed, i can simply close all the buffer with 4 keystrokes with ibuffer. Same if i want to save them all. 〔see Emacs: ibuffer tutorial

For files that no change takes place, the buffer is simply closed.

In the above, i also called “make-backup”. I want to make a backup of changed file, but not relying on emacs automatic backup mechanism (i have it turned off). For the code, see: Emacs: Backup Current File 🚀.

Emacs is fantastic!