Elisp: Fix Dead Links
This page shows you how to write a elisp script that checks thousands of HTML files and fix dead links.
Problem
I have 2 thousands HTML files that contains about 70 dead local links. I need to write a elisp script to change these links to non-links. For example, this is a dead link:
<a href="../widget/index.html#Top">Introduction</a>
I need it to be:
<span class="εlink" title="../widget/index.html#Top">Introduction</span>
The script should run in batch. And it should generate a report of all changed links.
Detail
I have a copy of the emacs manuals, at:
- GNU Emacs Manual (~690 files)
- GNU Emacs Lisp Reference Manual (~900 files)
These manual sometimes have links to other info files that's not emacs. For example, on this page Changing Files - GNU Emacs Lisp Reference Manual, it contains a link to GNU coreutils like this:
<a href="../coreutils/File-Permissions.html">File Permissions</a>
I need to change these links to non-links.
Solution
Here's outline of steps.
- Open each file.
- Search for “href=”.
- Get the link URL.
- Check if the link is a local file and exists.
- If not, change the entire link tag into a “span” tag.
- Repeat the above, until no link found.
First, we start like this:
(setq inputDir "~/web/xahlee_org/emacs_manual/" ) (defun my-process-file (fPath) "Process the file at FPATH" … ) ;; traverse the directory on all HTML files (require 'find-lisp) (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
The important part is the “my-process-file” function. Here's the basic code:
(defun my-process-file (fPath) "Process the file at FPATH" (let (…) ;; open file (setq myBuff (find-file fPath)) (while ;; search local link (search-forward "href=\"../" nil t) ;; get the URL string (setq urlStr (thing-at-point 'filename) ) ;; if the URL is a dead link (when (not (file-exists-p urlStr)) (progn ;; set p1 and p2 to be the start/end of the link tag ;; and get the entire link string (sgml-skip-tag-backward 1) (setq p1 (point) ) ; start of link tag (sgml-skip-tag-forward 1) (setq p2 (point) ) ; end of link tag (setq wholeLinkStr (buffer-substring-no-properties p1 p2) ) ;; get link text (search-backward "</a>") (setq p4 (point) ) ; end of link text (search-backward ">") (forward-char 1) (setq p3 (point) ) ; start of link text (setq linkText (buffer-substring-no-properties p3 p4) ) ;; remove the link, replace it with a non-link span text. (delete-region p1 p2) (insert "<span class=\"εlink\" title=\"" urlStr "\">" linkText "</span>" ) ) ) ) ;; close the file if no changes made (when (not (buffer-modified-p myBuff)) (kill-buffer myBuff) ) ) )
Complete Code
Here's the complete code.
;; -*- coding: utf-8 -*- ;; 2011-09-25 ;; replace dead links in emacs manual on my website ;; ;; Example. This: ;; <a href="../widget/index.html#Top">Introduction</a> ;; ;; should become this ;; ;; <span class="εlink" title="../widget/index.html#Top">Introduction</span> ;; ;; do this for all files in a dir. ;; rough steps: ;; go thru each file ;; search for link ;; if the link is 「../xx/」 where the file doesn't exist, then replace the whole link tag. (setq inputDir "~/web/xahlee_org/emacs_manual/" ) ; dir should end with a slash (defun my-process-file (fPath) "Process the file at FPATH" (let ( myBuff urlStr linkText wholeLinkStr p1 p2 p3 p4 ) (setq myBuff (find-file fPath)) (widen) ; in case it's open and narrowed (goto-char (point-max)) ; work from bottom, so that changes in point are preserved. (actually, doesn't really matter for this script) (while (search-backward "href=\"../" nil t) (forward-char 7) (setq urlStr (replace-regexp-in-string "\\.html#.+" ".html" (thing-at-point 'filename) ) ) (when (not (file-exists-p urlStr)) (progn (sgml-skip-tag-backward 1) (setq p1 (point) ) ; start of link tag (sgml-skip-tag-forward 1) (setq p2 (point) ) ; end of link tag (setq wholeLinkStr (buffer-substring-no-properties p1 p2) ) (search-backward "</a>") (setq p4 (point) ) ; end of link text (search-backward ">") (forward-char 1) (setq p3 (point) ) ; start of link text (setq linkText (buffer-substring-no-properties p3 p4) ) (princ (buffer-file-name)) (princ "\n") (princ wholeLinkStr) (princ "\n") (princ "----------------------------\n") (delete-region p1 p2) (insert "<span class=\"εlink\" title=\"" urlStr "\">" linkText "</span>" ) ) ) ) (when (not (buffer-modified-p myBuff)) (kill-buffer myBuff) ) ) ) (require 'find-lisp) (font-lock-mode 0) (with-output-to-temp-buffer "*xah elisp dead link replace output*" (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$")) (princ "Done deal!") ) (font-lock-mode 1)
Here's few interesting parts.
Turn Syntax Coloring Off
We turn font lock off, by (font-lock-mode 0)
. When font lock is on, processing 2 thousand HTML files will take ~50 minutes. With syntax coloring off, it's 3 minutes.
Leave Changed Files Open
If there are changes in the file, we leave it open. This way, we don't have to revert to backup files if there's a mistake. If we like the result, just call ibuffer
and press * u to mark all un-saved, then S to save all. Then press D to close them all. If you do not want to save them, simply mark all unsaved * u then press D to close all.
This is extremely useful while you are still working on the code and doing some test runs. This interactive nature of emacs is what beats {Perl, Python, etc} for text processing.
If you do want to save the file in the script, simply call
(save-buffer)
or
(write-file (buffer-file-name))
When the file is not modified, we close it. Like this:
(when (not (buffer-modified-p myBuff)) (kill-buffer myBuff) )
.
Use sgml-skip-tag-forward
The sgml-skip-tag-forward
and sgml-skip-tag-backward
are from html-mode
. They move the cursor to the beginning or ending of a tag. They are extremely useful. It saves you a lot time in writing code to parse tags, especially when tags are nested. Here's how we used it.
Suppose there's this link in a file:
<a href="../widget/index.html#Top">Introduction</a>
After we did the search with
(while (search-backward "href=\"../" nil t) … )
the cursor is on the “h”. While the cursor is inside the tag, we call:
(sgml-skip-tag-backward 1) (setq p1 (point) ) ; start of link tag (sgml-skip-tag-forward 1) (setq p2 (point) ) ; end of link tag (setq wholeLinkStr (buffer-substring-no-properties p1 p2) )
This sets the value of wholeLinkStr to the entire anchor tag <a …>…</a>
.
Print Output to Your Own Buffer
Printing output is done here using with-output-to-temp-buffer
and princ
. Like this:
(with-output-to-temp-buffer "*xah elisp dead link replace output*" (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$")) (princ "Done deal!") )
Inside the “my-process-file” function, we write:
(princ (buffer-file-name)) (princ "\n") (princ wholeLinkStr) (princ "\n") (princ "----------------------------\n")
Here's a output from the script: elisp_fix_dead_links_output.txt. It lets me easily see if there are any errors. There are a total of 68 changes.
For detail about printing in elisp, see: Elisp: Print, Output.