Process HTML with Emacs Lisp: Transform FAQ Tags

By Xah Lee. Date: 2007-11-30. Last updated: 2012-04-18.

This page shows a example of using emacs lisp to do a text processing job; It shows how emacs buffer type has significant advantage than Perl, Python for processing nested text.

Problem

I want to write a elisp program, that process a HTML file in a somewhat complex way. Specifically, certain strings must be replaced only if they appear inside a tag and or only if they are first child.

Detail

I have many web pages that are in Questions And Answers format. The following is a sample screenshot.

The following is a example of the raw HTML:

<p class="q">Q: Why …</p>
<p class="a">A: Because …</p>
<p class="a">You need to do …</p>
…
<p class="q">Q: How …</p>
<p class="a">A: Do this …</p>
<p class="a">And that …</p>
…

Basically, each Question section is a paragraph of class “q”, and each Answer section is several <p> tags with class “a”.

After a few years with this format, i started to use a better format. Specifically, a Answer section should just be wrapped with a single <div class="a">…</div>. And, the “Q: ” and “A: ” string are removed from content (because CSS can insert that automatically, like this: p.q:before {content:"Q: "}.). Here's a example of the new format:

<p class="q">Why?</p>

<div class="a">
<p>Because this.</p>
<p>You need to that.</p>
</div>

The task i have now, is to transform existing pages to this new format. Here's what needs to be done precisely:

For any consecutive blocks of <p class="a">…</p>, wrap them with a <div class="a"> and </div>, then replace those <p class="a"> by <p>. Also, remove those “Q: ” and “A: ”.

Although this is simple in principle, but without using a HTML parser, it's hard to code it as described. Using a HTML parser has its own problems. The HTML/DOM model would make the code much more complex, and the output will change the placement of whitspaces. Unless we are doing XML transformation on a larger scale, the HTML/DOM parser is usually not what we want. A text-based search-and-replace algorithm to achieve the above is as follows:

For each occurrence of <p class="q">, do the following:

Add a <div class="a"> right after <p class="q">…</p>.
Add a </div> right before <p class="q">.
Replace <p class="q">Q: by <p class="q">, replace <p class="a">A: by <p class="a">

then:

Replace the first occurrence of </div> that happens before the first occurrence of <p class="q">.
Add a </div> that happens after the last <p class="a">…</p> tag.
Replace all <p class="a"> to <p>.

We proceed to write a elisp code to solve this problem.

Solution

The algorithm described sounds simple, but isn't trivial if you do it in Perl or Python. For example, one of the step is:

Add a <div class="a"> right after <p class="q">…</p>.

It would involve some coding to get the meaning of “right after” correct. Similarly, other steps involves finding a string immediately before or after occurrences of another string, with condition such as no more of a string comes after.

With emacs, this is much easier, because emacs has buffer representation of files with a pointer that can move back and forth. So, we can just search by regex forward or backward and freely move our cursor and compare positions to locate the right piece of text.

First, we write a prototype that just works on a single file. Here's the code:

(defun xx ()
  "temp test function"
  (interactive)
  (find-file "elisp_process_html_sample.html")
  (goto-char (point-min))

;; add opening and closing tags for answer section
;; this is done by locating the opening question tag,
;; then move to the end of tag, then insert <div class="a">
;; then, locate the next opening question tag but move backward to </p>,
;; then insert </div>
  (while (search-forward "<p class=\"q\">" nil t)
    (search-forward "<p class=\"a\">")
    (replace-match "<div class=\"a\">\n<p class=\"a\">")
    (if (search-forward "<p class=\"q\">" nil t)
        (progn
          (search-backward "</p>")
          (forward-char 4)
          (insert "\n</div>")
          )
      )
    )

;; add the last closing tag for answer section
  (end-of-buffer)
  (search-backward "<p class=\"a\">")
  (search-forward "</p>")
  (insert "\n</div>")

;; take out the “Q: ” and “A: ”
  (beginning-of-buffer)
  (while (search-forward "<p class=\"q\">Q: " nil t)
    (replace-match "<p class=\"q\">"))

;; replace “<p class="a">” by “<p>”.
  (beginning-of-buffer)
  (while (search-forward "<p class=\"a\">A: " nil t)
    (replace-match "<p>"))
)

This is a simple code. It uses emacs power of buffer data structure for files, by moving a pointer back and forth to a desired place, then do search and replace text or insert. With the ability of moving a point to a particular string, we are able to locate the places we want the tag insertion to happen, without explicitly going by the DOM model of parent-child relationship of tags.

In the above code, the search-forward function moves the cursor to the end of matched text. It returns “nil” if not found. The search-backward works similarly, but put the point on the beginning of matched text.

The replace-match just replaces previously matched text. The end-of-buffer moves the point to the end of buffer. Similarly for beginning-of-buffer.

String Search (ELISP Manual)

Now, if we want to process many files, first we need to change the code to take a file path, and add code to save buffer and close buffer. Like this:

(defun my-process-html (fPath)
  "Process a file at FPATH…"
  (let (myBuffer)
    (setq myBuffer (find-file fPath))
    ; code body here
    (save-buffer)
    (kill-buffer myBuffer)
  )
)

To get the list of files containing the Q and A section, we can simply use unix's “find” and “grep”, like this: find . -name "*\.html" -exec grep -l '<p class="q">' {} \;. (or just use emacs. See: Elisp: Write grep.)

Then, place the list of files into a list and map over the list, like this:

(mapc 'my-process-html
      (list
       "~/web/emacs/x160.html"
       "~/web/emacs/x085.html"
       "~/web/emacs/x493.html"
       ))

The mapc is a lisp idiom of applying a function to all elements in a list. The first argument is a function. The second argument is a list. The single quote in front of the function is necessary. It prevents the function from being evaluated. Otherwise, normally lisp evaluates all arguments in the expression (f a b c …).

(thanks to Ivanov Dmitry for a correction in the elisp code.)

Emacs 🧡