Elisp: Fix Dead Links

By Xah Lee. Date: . Last updated: .

This page shows you how to write a elisp script that checks thousands of HTML files and fix dead links.

Problem

I have 2 thousands HTML files that contains about 70 dead local links. I need to write a elisp script to change these links to non-links. For example, this is a dead link:

<a href="../widget/index.html#Top">Introduction</a>

I need it to be:

<span class="εlink" title="../widget/index.html#Top">Introduction</span>

The script should run in batch. And it should generate a report of all changed links.

Detail

I have a copy of the emacs manuals, at:

These manual sometimes have links to other info files that's not emacs. For example, on this page Changing Files - GNU Emacs Lisp Reference Manual, it contains a link to GNU coreutils like this:

<a href="../coreutils/File-Permissions.html">File Permissions</a>

I need to change these links to non-links.

Solution

Here's outline of steps.

  1. Open each file.
  2. Search for “href=”.
  3. Get the link URL.
  4. Check if the link is a local file and exists.
  5. If not, change the entire link tag into a “span” tag.
  6. Repeat the above, until no link found.

First, we start like this:

(setq inputDir "~/web/xahlee_org/emacs_manual/" )

(defun my-process-file (fPath)
  "Process the file at FPATH"
  …
)

;; traverse the directory on all HTML files
(require 'find-lisp)
(mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))

The important part is the “my-process-file” function. Here's the basic code:

(defun my-process-file (fPath)
  "Process the file at FPATH"
  (let (…)

    ;; open file
    (setq myBuff (find-file fPath))

    (while
        ;; search local link
        (search-forward "href=\"../" nil t)

      ;; get the URL string
      (setq urlStr (thing-at-point 'filename) )

      ;; if the URL is a dead link
      (when (not (file-exists-p urlStr))
        (progn

          ;; set p1 and p2 to be the start/end of the link tag
          ;; and get the entire link string
          (sgml-skip-tag-backward 1)
          (setq p1 (point) ) ; start of link tag
          (sgml-skip-tag-forward 1)
          (setq p2 (point) ) ; end of link tag
          (setq wholeLinkStr (buffer-substring-no-properties p1 p2) )

          ;; get link text
          (search-backward "</a>")
          (setq p4 (point) ) ; end of link text
          (search-backward ">")
          (forward-char 1)
          (setq p3 (point) ) ; start of link text
          (setq linkText (buffer-substring-no-properties p3 p4) )

          ;; remove the link, replace it with a non-link span text.
          (delete-region p1 p2)
          (insert
           "<span class=\"εlink\" title=\""
           urlStr
           "\">"
           linkText
           "</span>"
           )
          )
        )
      )

    ;; close the file if no changes made
    (when (not (buffer-modified-p myBuff)) (kill-buffer myBuff) )

    ) )

Complete Code

Here's the complete code.

;; -*- coding: utf-8 -*-
;; 2011-09-25
;; replace dead links in emacs manual on my website
;;
;; Example. This:
;; <a href="../widget/index.html#Top">Introduction</a>
;;
;; should become this
;;
;; <span class="εlink" title="../widget/index.html#Top">Introduction</span>
;;
;; do this for all files in a dir.

;; rough steps:
;; go thru each file
;; search for link
;; if the link is 「../xx/」 where the file doesn't exist, then replace the whole link tag.

(setq inputDir "~/web/xahlee_org/emacs_manual/" ) ; dir should end with a slash

(defun my-process-file (fPath)
  "Process the file at FPATH"
  (let (
        myBuff
        urlStr
        linkText
        wholeLinkStr
        p1 p2
        p3 p4
        )
    (setq myBuff (find-file fPath))
    (widen) ; in case it's open and narrowed
    (goto-char (point-max)) ; work from bottom, so that changes in point are preserved. (actually, doesn't really matter for this script)

    (while
        (search-backward "href=\"../" nil t)
      (forward-char 7)
      (setq urlStr (replace-regexp-in-string "\\.html#.+" ".html" (thing-at-point 'filename) ) )

      (when (not (file-exists-p urlStr))
        (progn
          (sgml-skip-tag-backward 1)
          (setq p1 (point) )                      ; start of link tag
          (sgml-skip-tag-forward 1)
          (setq p2 (point) )                      ; end of link tag

          (setq wholeLinkStr (buffer-substring-no-properties p1 p2) )

          (search-backward "</a>")
          (setq p4 (point) )                      ; end of link text
          (search-backward ">")
          (forward-char 1)
          (setq p3 (point) )                      ; start of link text

          (setq linkText (buffer-substring-no-properties p3 p4) )

          (princ (buffer-file-name))
          (princ "\n")
          (princ wholeLinkStr)
          (princ "\n")
          (princ "----------------------------\n")

          (delete-region p1 p2)
          (insert
           "<span class=\"εlink\" title=\""
           urlStr
           "\">"
           linkText
           "</span>"
           )
          )
        )
      )

    (when (not (buffer-modified-p myBuff)) (kill-buffer myBuff) )

    ) )

(require 'find-lisp)

(font-lock-mode 0)

(with-output-to-temp-buffer "*xah elisp dead link replace output*"
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
    (princ "Done deal!")
    )

(font-lock-mode 1)

Here's few interesting parts.

Turn Syntax Coloring Off

We turn font lock off, by (font-lock-mode 0). When font lock is on, processing 2 thousand HTML files will take ~50 minutes. With syntax coloring off, it's 3 minutes.

Leave Changed Files Open

If there are changes in the file, we leave it open. This way, we don't have to revert to backup files if there's a mistake. If we like the result, just call ibuffer and press * u to mark all un-saved, then S to save all. Then press D to close them all. If you do not want to save them, simply mark all unsaved * u then press D to close all.

This is extremely useful while you are still working on the code and doing some test runs. This interactive nature of emacs is what beats {Perl, Python, etc} for text processing.

If you do want to save the file in the script, simply call (save-buffer) or (write-file (buffer-file-name))

When the file is not modified, we close it. Like this: (when (not (buffer-modified-p myBuff)) (kill-buffer myBuff) ).

Use sgml-skip-tag-forward

The sgml-skip-tag-forward and sgml-skip-tag-backward are from html-mode. They move the cursor to the beginning or ending of a tag. They are extremely useful. It saves you a lot time in writing code to parse tags, especially when tags are nested. Here's how we used it.

Suppose there's this link in a file:

<a href="../widget/index.html#Top">Introduction</a>

After we did the search with

 (while
  (search-backward "href=\"../" nil t)
  …
 )

the cursor is on the “h”. While the cursor is inside the tag, we call:

(sgml-skip-tag-backward 1)
 (setq p1 (point) ) ; start of link tag
 (sgml-skip-tag-forward 1)
 (setq p2 (point) ) ; end of link tag

 (setq wholeLinkStr (buffer-substring-no-properties p1 p2) )

This sets the value of wholeLinkStr to the entire anchor tag <a …>…</a>.

Print Output to Your Own Buffer

Printing output is done here using with-output-to-temp-buffer and princ. Like this:

(with-output-to-temp-buffer "*xah elisp dead link replace output*"
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
    (princ "Done deal!")
    )

Inside the “my-process-file” function, we write:

(princ (buffer-file-name))
 (princ "\n")
 (princ wholeLinkStr)
 (princ "\n")
 (princ "----------------------------\n")

Here's a output from the script: elisp_fix_dead_links_output.txt. It lets me easily see if there are any errors. There are a total of 68 changes.

For detail about printing in elisp, see: Elisp: Print, Output.