Elisp: Process HTML Files, Create ZIP Archive

By Xah Lee. Date: 2008-07-04. Last updated: 2013-05-09.

Problem

I create a downloadable version of a website for people to read offline.

This is a fairly complex text processing task. In this tutorial, you'll learn:

How to call shell commands to copy, delete, zip, files or dir.
How to use nested mapc.
How to traverse a directory. [see Elisp: Walk Directory, List Files]
how to use regex and search-forward etc to find relative links, and insert, etc, to change them.
The use of relative/full name, current file dir, etc, to help determine and process links.
Efficiently open and modify thousands of files.

Detail

Consider this emacs tutorial you are reading.

One simple solution is to make a copy of the directory, zip it, and let people download that. This won't work, because many links will be broken.

Here's what we need to do:

All local links, including links to CSS or inline images, need to be processed. If they point to a dir outside the target dir, or to the parent of target dir, either the parent dir must be included in archive, or the linked file be included in the archive and link changed.
All external links need to be looked at. For example, inline images that uses “http://” form. The image file needs to be included, and the link needs to be changed to relative path.
Remove JavaScript code for Google traffic tracker, Google Ads, web comment widget, etc.
{templates, temp files, temp dir, emacs backup files, etc} need to be removed. e.g. {.DS_Store, backup~, temp-notes.txt, scheduled publish log dir, etc}

It would be nice, if i could just press a button in emacs, and have this archive generated automatically. And whenever i have updated my emacs tutorial, i can run the script again to regenerate a fresh downloadable version.

Solution

The general plan is simple:

Copy the directories into a destination directory.
Call shell commands to delete temp files such as emacs backup files in the destination dir.
Have a function that process each HTML, to change relative links and take out Google Analytics's JavaScript code.
Call shell commands to archive this dir.

First, we define some user input parameters for the script:

;; web root dir
(setq webroot "/Users/xah/web/") ; must end in slash

;; list of source dirs i want to make a archive
;; Each is relative to webroot. Must not end in slash.
(setq sourceDirsList (list "emacs" "elisp"))

;; Destination dir path, relative to webroot
;; This is the dir i want the archive to be at
(setq destDirRelativePath "diklo")

;; dest zip archive name (without the “.zip” suffix)
;; for example here, the download file will be xah_emacs_tutorial.zip
(setq zipCoreName "xah_emacs_tutorial")

;; whether to use gzip or zip.
(setq use-gzip-p nil)

Then, we define some convenient constant.

(setq destRoot (concat webroot destDirRelativePath "/"))
(setq destDir (concat destRoot zipCoreName "/"))

So, destRoot would be like /Users/xah/web/diklo/ and destDir would be like /Users/xah/web/diklo/xah_emacs_tutorial. The final download archive would be /Users/xah/web/diklo/xah_emacs_tutorial.tar.gz.

Now, we copy the source dirs to destination.

;;; copy to destination
(mapc
 (lambda (x)
   (let (fromDir toDir)
     (setq fromDir (concat webroot x))
     (setq toDir
           (drop-last-slashed-substring
            (concat webroot destDirRelativePath "/" zipCoreName "/" x)) )
     (make-directory toDir t)
     (shell-command (concat "cp -R " fromDir " " toDir))
     )
   )
 sourceDirsList)

The above code used the function mapc. The function has the form “(mapc 'myfunc myList)”, where the function myfunc will be applied to each element of myList. The function we used above is “(lambda (x) …)”, with “x” being the argument. The source dir and dest dir's paths are constructed inside the lambda function, then command line string is constructed, then make-directory is called. It will create all parent dirs of a given full path. Then, finally we call shell-command to copy the dirs.

Also, we called “drop-last-slashed-substring”, which is defined as follows:

(defun drop-last-slashed-substring (path)
  "Drop the last path separated by “/”.
For example:
“/a/b/c/d” → “/a/b/c”
“/a/b/c/d/” → “/a/b/c/d”
“/” → “”
“//” → “/”
“” → “”"
  (if (string-match "\\(.*/\\)+" path)
      (substring path 0 (1- (match-end 0)))
    path))

Copying a bunch of directories seems a trivial operation, but it actually took me a couple hours to arrive at the final code, due to some “smart” behavior of unix cp -R.

Originally, i thought the code would be something simple like several (shell-command (concat "cp -R " fromDir " " toDir)), one for each source dir, where fromDir and toDir are full paths. However, it turns out the problem is slightly more complex. Here's a summary:

(1) The copying behavior depends on whether the destination node exists. When the destination node exists, it copies the source node itself, otherwise, it copies the source node's children (and creates the non-existant destination node). (2) However, when the destination node's parent doesn't exist, it's a error.

Now, we copy my site's style sheets.

;; copy the style sheets over, and icons dir
(shell-command (concat "cp /Users/xah/web/style1.css " destDir))
(shell-command (concat "cp /Users/xah/web/style2.css " destDir))
(shell-command (concat "cp /Users/xah/web/style3.css " destDir))
(shell-command (concat "cp -R /Users/xah/web/ics " destDir))

Now, do some file cleanup.

; remove emacs backup files, temp files, mac os x files, etc.
(shell-command (concat "find " destDir " -name \"*~\"  -exec rm {} \\;"))
(shell-command (concat "find " destDir " -name \"#*#\"  -exec rm {} \\;"))
(shell-command (concat "find " destDir " -type f -name \"xx*\"  -exec rm {} \\;"))
(shell-command (concat "find " destDir " -type f -name \"\\.DS_Store\"  -exec rm {} \\;"))
(shell-command (concat "find " destDir " -type f -empty -exec rm {} \\;"))
(shell-command (concat "find " destDir " -type d -empty -exec rmdir {} \\;"))
(shell-command (concat "find " destDir " -type d -name \"xx*\" -exec rm -R {} \\;"))

Now, we need to modify the relative links so that, if a link pointing to a file that is not part of the downloadable copy, change it to a http://xahlee.org/… based link.

For example, in my emacs tutorial at /Users/xah/web/emacs/xxx.html it contains the link <a href="../python/index.html">Python tutorial</a>, which points to a file outside the emacs dir. When user download my emacs tutorial, this link will then points to a file that doesn't exist on his disk. The link ../python/index.html should be changed to “http://xahlee.org/python/index.html”.

Also, in my HTML files, they contain a JavaScript for Google Analytics, like this: <script src="http://www.google-analytics.com/urchin.js" type="text/javascript"></script><script type="text/javascript"> _uacct = "UA-104620-2"; urchinTracker();</script>. This allows me to see my web traffic statistics. The downloaded version shouldn't have this line.

Here's the code to process each HTML file for the above problems:

;;; change local links to “http://” links.
;;; Delete the google JavaScript snippet, and other small fixes.
(setq make-backup-files nil)
(require 'find-lisp)
(mapc (lambda (x)
        (mapc
         (lambda (fPath) (clean-file fPath (concat webroot (substring fPath (length destDir)))))
         (find-lisp-find-files (concat destDir "/" x) "\\.html$"))
        )
      sourceDirsList
)

In the above code, we use mapc to apply a function to all HTML files. The “find-lisp-find-files” will generate a list of all files in a dir. Here, we actually calls mapc twice, one inside the other.

The sourceDirsList is a list of dirs. So, the first mapc maps a function to each of the dir. Now, for each dir, we want to apply a function to all HTML files. That's the inner mapc is for. The function that actually does process the HTML file is the “clean-file”. The “clean-file” function takes 2 arguments. The first is the full path to the HTML file to be processed, the second is a full path to the “same” file at source dir. The second argument is necessary, because we need the original directory structure to compute the correct URL for the relative link that needs to be fixed. Here's the code:

(defun clean-file (fPath originalFilePath)
  "Modify the HTML file at fPath, to make it ready for download bundle.

This function change local links to “http://” links,
Delete the google JavaScript snippet, and other small changes,
so that the file is nicer to be viewed offline at some computer
without the entire xahlee.org's web dir structure.

The google JavaScript is the Google Analytics web bug that tracks
 web stat to xahlee.org.

fPath is the full path to the HTML file that will be processed.
originalFilePath is full path to the “same” file in the original web structure.
originalFilePath is used to construct new relative links."
  (let (myBuffer bds p1 p2 linkPath linkPathSansJumper)

    (setq myBuffer (find-file fPath))

    (goto-char (point-min)) ;in case buffer already open
    (while (search-forward "<script src=\"http://www.google-analytics.com/urchin.js\" type=\"text/javascript\"></script><script type=\"text/javascript\"> _uacct = \"UA-104620-2\"; urchinTracker();</script>" nil t)
      (replace-match ""))

    (goto-char (point-min))
    (while (search-forward "<a href=\"http://xahlee.org/PageTwo_dir/more.html\">Xah Lee</a>" nil t)
      (replace-match "<a href=\"http://xahlee.org/PageTwo_dir/more.html\">Xah Lee↗</a>"))

    ;; go thru each link, if the link is local,
    ;;then check if the file exist.
    ;;if not, replace the link with proper http://xahlee.org/ url
    (goto-char (point-min)) ; in case buffer already open

    (while (re-search-forward "<[[:blank:]]*a[[:blank:]]+href[[:blank:]]*=[[:blank:]]*" nil t)
      (forward-char 1)
      (setq bds (bounds-of-thing-at-point 'filename))
      (setq p1 (car bds))
      (setq p2 (cdr bds))
      (setq linkPath (buffer-substring-no-properties p1 p2))

      (when (not (string-match "^http://" linkPath))

        ;; get rid of trailing jumper, for example, “Abstract-Display.html#top”
        (setq linkPathSansJumper (replace-regexp-in-string "^\\([^#]+\\)#.+" "\\1" linkPath t))

        (when (not (file-exists-p linkPathSansJumper))
          (delete-region p1 p2)
          (let (newLinkPath)
            (setq newLinkPath
                  (compute-url-from-relative-link originalFilePath linkPath webroot "xahlee.org"))
            (insert newLinkPath))
          (search-forward "</a>")
          (backward-char 4)
          (insert "↗")
          )
        )
      )
    (save-buffer)
    (kill-buffer myBuffer)))

In the above function “clean-file”, the hard part is to construct the correct URL for a relative link.

Given a file, there are many relative links. The link may or may not be good in the download copy version. For example, if the relative link does not start with ../, then it is still good. However, if it starts with ../, it may or may not be still good. For example, in my emacs tutorial project, both /Users/xah/web/emacs/ and /Users/xah/web/elisp/ are part of the download archive. So, if some file under the emacs dir has a relative link starting with ../elisp/, then it is still a good link. We don't want to replace that with a http:// version. To compute the correct relative link, we actually need to know the original dir structure.

Computing relative links is conceptually trivial. Basically, each occurrence of ../ means one dir level up. But actually coding it correctly took a while due to various little issues. For example, some link will have a trailing jumper of this form Abstract-Display.html#top. The trailing #top will need to be removed if we want to use the string to check if file exists. Theoretically, all it takes to determine a relative link is the file path of the file that contains the link, the relative link string, and the dir tree structure surrounding the file. Specifically, when we move a dir, and wish to construct or fix relative links, we do not need to check if the linked file still exists in the new dir. In practice, it's much simpler, to first determine whether the relative link is still good, by checking if the linked file exists at the new download copy's dir structure.

In the clean-file function, it first grab the relative link string from the HTML file, then determine whether this link needs to be fixed, then calls “compute-url-from-relative-link” that returns the proper “http://” based URL. The function compute-url-from-relative-link takes 4 parameters: fPath, linkPath, webDocRoot, hostName. See the doc string below:

(defun compute-url-from-relative-link (fPath linkPath webDocRoot hostName)
  "returns a “http://” based URL of a given linkPath,
based on its fPath, webDocRoot, hostName.

fPath is the full path to a HTML file.
linkPath is a string that's relative path to another file,
from a “<a href=\"…\"> tag.”
webDocRoot is the full path to a parent dir of fPath.
Returns a URL of the form “http://hostName/‹urlPath›”
that points to the same file as linkPath.

For example, if
fPath is /Users/xah/web/Periodic_dosage_dir/t2/mirrored.html
linkPath is ../../p/demonic_males.html
webDocRoot is /Users/xah/web/
hostName is xahlee.org
then result is http://xahlee.org/p/demonic_males.html

Note that webDocRoot may or may not end in a slash."
  (concat "http://" hostName "/"
          (substring
           (file-truename (concat (file-name-directory fPath) linkPath))
           (length (file-name-as-directory (directory-file-name webDocRoot))))))

Finally, we zip up the dest dir.

;; zip the dir
(let (ff)
  (setq ff (concat webroot destDirRelativePath "/" zipCoreName ".zip"))
  (when (file-exists-p ff) (delete-file ff))
  (setq ff (concat webroot destDirRelativePath "/" zipCoreName ".tar.gz"))
  (when (file-exists-p ff) (delete-file ff)))

(setq default-directory (concat webroot destDirRelativePath "/"))

(when (equal
       0
       (if use-gzip-p
           (shell-command (concat "tar cfz " zipCoreName ".tar.gz " zipCoreName))
         (shell-command (concat "zip -r " zipCoreName ".zip " zipCoreName))
         ))
  (shell-command (concat "rm -R " destDir))
)

In the above code, first we delete the previous archive if it exists.

Now, all is done. With all the code above in a buffer, i can just eval-buffer to generate my downloadable archive, or i can call the script in Operating System's command line like emacs --script make_download_copy.el. I decided to go one step further, by wrapping the whole script into a function. Like this:

(defun make-downloadable-copy (webroot sourceDirsList destDirRelativePath
zipCoreName &optional use-gzip-p)
  "Make a copy of web dir of XahLee.org for download.

This function depends on the structure of XahLee.org,
and is not useful in general.

• webroot is the website doc root dir. (must end in slash)
For example: <code class=\"path-abc\">/Users/xah/web/」

• sourceDirsList is a list of dir paths relative to webroot,
to be copied for download. Must not end in slash.
e.g. (list \"p/time_machine\")

• destDirRelativePath is the destination dir of the download.
it's a dir path, relative to webroot.
e.g. “diklo”

• zipCoreName is the downloable archive name, without the suffix.
e.g. “time_machine”

use-gzip-p means whether to use gzip, else zip for the final archive.
If non-nil, use gzip."
  (let (…)
  ;; all the code above here except functions.
  )
)

Here's how i call it:

;; emacs tutorial and elisp manual in one download archive
;; gzip format
(make-downloadable-copy
"/Users/xah/web/"
(list "emacs" "elisp")
 "diklo" "xah_emacs_tutorial" "gzip")

;; elisp manual. zip format.
(make-downloadable-copy
"/Users/xah/web/"
(list "elisp")
 "diklo" "elisp_manual")

Emacs 🧡