Emacs Lisp: Batch Transform HTML to HTML5 “figure” Tag

By Xah Lee. Date:

Another triumph of using elisp for text processing over {Perl, Python}.

Problem

I want batch transform the image tags in 5 thousand HTML files to use HTML5's new “figure” and “figcaption” tags.

I want to be able to view each change interactively, while optionally give it a “go ahead” to do the whole job in batch.

Interactive eyeball verification on many cases lets me be reasonably sure the transform is done correctly. It also lets me see whether i want to push forward with this change.

Detail

HTML5 has the following new tags: “figure” and “figcaption”. They are used like this:

<figure>
<img src="cat.jpg" alt="my cat" width="167" height="106">
<figcaption>my cat!</figcaption>
</figure>

(For detail, see: HTML5 “figure” and “figurecaption” Tags Browser Support)

On my website, i used a similar structure. They look like this:

<div class="img">
<img src="cat.jpg" alt="my cat" width="167" height="106">
<p class="cpt">my cat!</p>
</div>

So, i want to replace them with the HTML5's new tags. This can be done with a regex. Here's the “find” regex:

<div class="img">
?<img src="\([^"]+?\)" alt="\([^"]+?\)" width="\([0-9]+?\)" height="\([0-9]+?\)">?
<p class="cpt">\([^<]+?\)</p>
?</div>

Here's the replacement string:

<figure>
<img src="\1" alt="\2" width="\3" height="\4">
<figcaption>\5</figcaption>
</figure>

Then, you can use find-dired and dired's dired-do-query-replace-regexp to work on your 5 thousand pages. Nice. [see Emacs: Interactive Find Replace Text in Directory]

However, the problem here is more complicated. There may be more than one image per group. Also, the caption part may also contain complicated HTML. Here's some examples:

<div class="img">
<img src="cat1.jpg" alt="my cat" width="200" height="200">
<img src="turtle.jpg" alt="my turtle" width="200" height="200">
<p class="cpt">my cat and my turtle</p>
</div>
<div class="img">
<img src="jamie_cat.jpg" alt="jamie's cat" width="167" height="106">
<p class="cpt">jamie's cat! Her blog is <a href="http://example.com/jamie/">http://example.com/jamie/</a></p>
</div>

So, a solution by regex is out.

Solution

The solution is pretty simple. Here's the major steps:

  1. Use find-lisp-find-files to traverse a dir. Needes (require 'find-lisp).
  2. For each file, open it.
  3. Search for the string <div class="img">.
  4. Use sgml-skip-tag-forward to jump to its closing tag.
  5. Save the positions of these tag begin/end positions.
  6. Ask user if she wants to replace. If so, do it. (using delete-region and insert)
  7. Repeat.

Here's the code:

;; -*- coding: utf-8 -*-
;; 2011-07-03
;; replace image tags to use HTML5's “figure”  and “figcaption” tags.

;; Example. This:
;; <div class="img">…</div>
;; should become this
;; <figure>…</figure>

;; do this for all files in a dir.

;; rough steps:
;; find the <div class="img">
;; use sgml-skip-tag-forward to move to the ending tag.
;; save their positions.
;; ask user whether to replace, if so, delete them and insert new string

(defun my-process-file (fPath)
  "Process the file at FPATH"
  (let (myBuff p1 p2 p3 p4 )
    (setq myBuff (find-file fPath))

    (widen)
    (goto-char (point-min)) ;; in case buffer already open

    (while (search-forward "<div class=\"img\">" nil t)
      (progn
        (setq p2 (point) )
        (backward-char 17) ; beginning of “div” tag
        (setq p1 (point) )

        (forward-char 1)
        (sgml-skip-tag-forward 1) ; move to the closing tag
        (setq p4 (point) )
        (backward-char 6) ; beginning of the closing div tag
        (setq p3 (point) )
        (narrow-to-region p1 p4)

        (when (y-or-n-p "replace?")
          (progn
            (delete-region p3 p4 )
            (goto-char p3)
            (insert "</figure>")

            (delete-region p1 p2 )
            (goto-char p1)
            (insert "<figure>")
            (widen) ) ) ) )

    (when (not (buffer-modified-p myBuff)) (kill-buffer myBuff) )

    ) )

(require 'find-lisp)

(let (outputBuffer)
  (setq outputBuffer "*xah img/figure replace output*" )
  (with-output-to-temp-buffer outputBuffer
    (mapc 'my-process-file (find-lisp-find-files "~/web/xahlee_org/emacs/" "\\.html$"))
    (princ "Done deal!")
    ) )

Seems pretty simple right?

The “p1” and “p2” variables are the positions of start/end of <div class="img">. The “p3” and “p4” is the start/end of its closing tag </div>.

We also used a little trick with widen and narrow-to-region. It lets me see just the part that i'm interested. It narrows to the beginning/end of the div.img. This makes eyeballing a bit easier.

The real time-saver is the sgml-skip-tag-forward function from html-mode. Without that, one'd have to write a mini-parser to deal with HTML's nested ways to be able to locate the proper ending tag.

Using the above code, i can comfortably eyeball and press “y” at the rate of about 5 per second. That makes 300 replacements per minute. I have 5000+ files. If we assume there are 6k replacement to be made, then at 5 per second means 20 minutes sitting there pressing “y”. Quite tiresome.

So, now, the next step is simply to remove the asking (y-or-n-p "replace?"). Or, if i'm absolutely paranoid, i can make emacs write into a log buffer for every replacement it makes (together with the file path). When the batch replacement is done (probably takes 1 or 2 minutes), i can simply scan thru the log to see if any replacement went wrong. For a example of that, see: Emacs Lisp: Multi-Pair String Replacement with Report.

Also note that i left each changed file unsaved in emacs. If i decided i didn't want to commit the changes, i can exit emacs without saving. Or, i can go to ibuffer and press 3 keys to save and close them all * u S. But if you want them saved with elisp, you can just add (save-buffer). Note that emacs automatically makes a backup~ of the original files if you haven't turned that off.

But what about replacing <p class="cpt">…</p> with <figcaption>…</figcaption>?

I simply copy-pasted the above code into a new file, and make changes in 4 places. So, the replacing figcaption part is done in a separete second batch job. Of course, one could spend extra hour to make the code do them both in one pass, but that extra time of thinking and coding isn't worthwhile for this one-time job.

I ♥ Emacs, do you?

Change in Current Buffer

Here's the code that changes both {div.img, p.cpt} to {figure, figcaption} in one shot, on the current buffer. It output the changes to a temp buffer, so you can scan it.

(defun xah-fix-wrap-img-figure ()
  "Change current buffer's <div class=\"img\"> to <figure> and <p class=\"cpt\"> to <figcaption>."
  (interactive)

  (save-excursion
    (let (p1 p2 p3 p4
             myStr
             $changes
             (changedItems '())
             (myBuff (current-buffer))
             )

      (goto-char (point-min)) ;; in case buffer already open
      (while (search-forward "<div class=\"img\">" nil t)
        (progn
          (setq p2 (point) )
          (backward-char 17)
          (setq p1 (point) )

          (forward-char 1)
          (sgml-skip-tag-forward 1)
          (setq p4 (point) )
          (backward-char 6)
          (setq p3 (point) )

          (when t
            (setq myStr (buffer-substring-no-properties p1 p4))
            (setq changedItems (cons myStr changedItems ) )

            (progn
              (delete-region p3 p4 )
              (goto-char p3)
              (insert "</figure>")

              (delete-region p1 p2 )
              (goto-char p1)
              (insert "<figure>")
               )
            ) ) )

      (goto-char (point-min)) ;; in case buffer already open
      (while (search-forward "<p class=\"cpt\">" nil t)
        (progn
          (setq p2 (point) )
          (backward-char 15)
          (setq p1 (point) )

          (forward-char 1)
          (sgml-skip-tag-forward 1)
          (setq p4 (point) )
          (backward-char 4)
          (setq p3 (point) )

          (when t
            (setq myStr (buffer-substring-no-properties p1 p4))
            (setq changedItems (cons myStr changedItems ) )

            (progn
              (delete-region p3 p4 )
              (goto-char p3)
              (insert "</figcaption>")

              (delete-region p1 p2 )
              (goto-char p1)
              (insert "<figcaption>")
               )
            ) ) )

      (with-output-to-temp-buffer "*changed items*"
        (mapc (lambda ( $changes) (princ $changes) (princ "\n\n") ) changedItems)
        (set-buffer "*changed items*")
        (funcall 'html-mode)
        (set-buffer myBuff)
        ) )) )

PS if you are wondering about that weird char “ξ” in the variable name, don't mind it, it's my personal experiment in variable naming. See: Variable Naming: English Words Considered Harmful.