Elisp: Generate Web Links Report

By Xah Lee. Date: . Last updated: .

Here's how to use elisp to process files in a directory, searching for a text pattern, and generate a report.

Problem

You have 7 thousand HTML files. You want to generate a report that lists all links to Wikipedia, and the file path the link is from.

In this tutorial, you'll learn:

Solution

Here are the basic steps we need:

Once we have the data in a hash-table, it is very flexible. We can then generate a report in plain text or HTML, or do other processing.

Hash Table

First, we'll need to know how to use hash table in emacs. 〔see Elisp: Hash Table

Process a Single File

we want the hash table key to be the full URL string, and the value a list. Each element in the list is the full path of the file that contains the link.

Here is the code that processes a single file. It opens the file, search for URL, if found, check if it exist in hash, if not, add it, else append to the existing entry.

(setq myfile "/Users/xah/web/ergoemacs_org/emacs/GNU_Emacs_dev_inefficiency.html")

(setq myhash (make-hash-table :test 'equal))

(defun ff ()
  "test code to process a single file"
  (interactive)
  (let ( myBuff url)

    (setq myBuff (find-file myfile)) ; open file

    (goto-char (point-min))

    ;; search for URL till not found
    (while
        (re-search-forward
         "href=\"\\(https://..\\.wikipedia\\.org/[^\"]+\\)\">\\([^<]+\\)</a>"
         nil t)

      (when (match-string 0)        ; if URL found
        (setq url (match-string 1)) ; set the url to matched string

        (print url)

        ;; if exist in hash, prepend to existing entry, else just add
        (if (gethash url myhash)
            (puthash url (cons myfile (gethash url myhash)) myhash)
          (puthash url (list myfile) myhash))))

    (kill-buffer myBuff) ; close file
    ))

(ff)

(print myhash)

Walk a Directory

(mapc
 'ff
 (directory-files-recursively "~/web/ergoemacs_org/" "\.html$" ))

Elisp: Walk Directory, List Files

Pretty Print Helpers

Given a Wikipedia URL, returns a HTML link string.

For example:

http://en.wikipedia.org/wiki/Emacs

becomes

<a href="http://en.wikipedia.org/wiki/Emacs">Emacs</a>

(require 'gnus-util) ; for gnus-url-unhex-string

(defun wikipedia-url-to-link (url)
  "Return the URL as HTML link string.
Example:
 http://en.wikipedia.org/wiki/Emacs%20Lisp
becomes
 <a href=\"http://en.wikipedia.org/wiki/Emacs%20Lisp\">Emacs Lisp</a>
"
  (let ((linkText url))
    (setq linkText (gnus-url-unhex-string linkText nil)) ; decode percent encoding. For example: %20
    (setq linkText (car (last (split-string linkText "/")))  ) ; get last part
    (setq linkText (replace-regexp-in-string "_" " " linkText )  ) ; low line → space
    (format "<a href=\"%s\">%s</a>" url linkText)
    ))

Given a file path, return a link string.

For example:

/Users/xah/web/ergoemacs_org/index.html

becomes

<a href="../index.html">ErgoEmacs</a>

, where the link text came from the file's “title” tag.

(defun get-html-file-title (fName)
  "Return FNAME <title> tag's text.
Assumes that the file contains the string
“<title>…</title>”."
  (with-temp-buffer
      (insert-file-contents fName nil nil nil t)
      (goto-char (point-min))
      (buffer-substring-no-properties
       (search-forward "<title>") (- (search-forward "</title>") 8))
      ))

Putting It All Together

 ;; -*- coding: utf-8; lexical-binding: t; -*-
;; emacs lisp.

;; started: 2008-01-03.
;; version: 2019-06-11
;; author: Xah Lee
;; url: http://ergoemacs.org/emacs/elisp_link_report.html
;; purpose: generate a report of wikipedia links.

;; traverse a given dir, visiting every html file, find links to Wikipedia in those files, collect them, and generate a html report of these links and the files they are from, then write it to a given file. (overwrite if exist)


;; ssss---------------------------------------------------

(setq InputDir "/Users/xah/web/ergoemacs_org/" )

;; Overwrites existing
(setq OutputPath "/Users/xah/web/xahlee_org/wikipedia_links.html")

;; ssss---------------------------------------------------

;; add a ending slash to InputDir if not there
(when (not (string-equal "/" (substring InputDir -1) )) (setq InputDir (concat InputDir "/") ) )

(when (not (file-exists-p InputDir)) (error "input dir does not exist: %s" InputDir))

(setq XahHeaderText
"<!doctype html><html><head><meta charset=\"utf-8\" />
<title>Links to Wikipedia from Xah Sites</title>
</head>
<body>

<nav class=\"n1\"><a href=\"index.html\">XahLee.org</a></nav>
")

(setq XahFooterText
  "
</body></html>
"
)

;; ssss---------------------------------------------------

;; hash table. key is string Wikipedia url, value is a list of file paths.
(setq LinksHash (make-hash-table :test 'equal :size 8000))

;; ssss---------------------------------------------------

(defun xah-add-link-to-hash (filePath hashTableVar)
  "Get links in filePath and add it to hash table at the variable hashTableVar."
  (let ( xurl)
    (with-temp-buffer
      (insert-file-contents filePath nil nil nil t)
      (goto-char (point-min))
      (while
          (re-search-forward
           "href=\"\\(https://..\\.wikipedia\\.org/[^\"]+\\)\">\\([^<]+\\)</a>"
           nil t)
        (setq xurl (match-string 1))
        (when (and
               xurl ; if url found
               (not (string-match "=" xurl )) ; do not includes links that are not Wikipedia articles. e.g. user profile pages, edit history pages, search result pages
               (not (string-match "%..%.." xurl )) ; do not include links that's lots of unicode
               )

          ;; if exist in hash, prepend to existing entry, else just add
          (if (gethash xurl hashTableVar)
              (puthash xurl (cons filePath (gethash xurl hashTableVar)) hashTableVar) ; not using add-to-list because each Wikipedia URL likely only appear once per file
            (puthash xurl (list filePath) hashTableVar)) )) ) ) )

(defun xah-print-each (ele)
  "Print each item. ELE is of the form (url (list filepath1 filepath2 etc)).
Print it like this:
‹link to url› : ‹link to file1›, ‹link to file2›, …"
  (let (xwplink xfiles)
    (setq xwplink (car ele))
    (setq xfiles (cadr ele))

    (insert "<li>")
    (insert (wikipedia-url-to-linktext xwplink))
    (insert "—")

    (dolist (xx xfiles nil)
      (insert
       (format "<a href=\"%s\">%s</a>•"
               (xahsite-filepath-to-href-value xx OutputPath )
               (xah-html-get-html-file-title xx))))
    (delete-char -1)
    (insert "</li>\n"))
  )

;; ssss---------------------------------------------------

(defun wikipedia-url-to-linktext (P-url)
  "Return the title of a Wikipedia link.
Example:
http://en.wikipedia.org/wiki/Emacs
becomes
Emacs"
  (require 'url-util)
  (decode-coding-string
   (url-unhex-string
    (replace-regexp-in-string
     "_" " "
     (replace-regexp-in-string
      "&" "&"
      (car
       (last
        (split-string
         P-url "/")))))) 'utf-8))

(defun wikipedia-url-to-link (P-url)
  "Return the P-url as html link string.\n
Example:
http://en.wikipedia.org/wiki/Emacs
becomes
<a href=\"http://en.wikipedia.org/wiki/Emacs\">Emacs</a>"
  (format "<a href=\"%s\">%s</a>" P-url (wikipedia-url-to-linktext P-url)))

(defun xah-hash-to-list (HashTable)
  "Return a list that represent the hashtable HashTable.
Each element is a proper list: (key value).

URL `http://xahlee.info/emacs/emacs/elisp_hash_table_to_list.html'
Created: 2019-06-11
Version: 2022-05-28 2022-09-22"
  (let (xx)
    (maphash
     (lambda (k v)
       (push (list k v) xx))
     HashTable)
    xx))

;; ssss---------------------------------------------------
;;;; main

;; fill LinksHash
(mapc
   (lambda (xx) (xah-add-link-to-hash xx LinksHash ))
   (directory-files-recursively InputDir  "\.html$" ))

;; fill LinksList
(setq LinksList
      (sort (xah-hash-to-list LinksHash)
            ;; (hash-table-keys LinksHash)
            (lambda (a b) (string< (downcase (car a)) (downcase (car b))))))

;; write to file
(with-temp-file OutputPath
  (insert XahHeaderText)
  (goto-char (point-max))

  (insert
   "<h1>Links To Wikipedia from XahLee.org</h1>\n\n"
   "<p>This page contains all existing links from xah sites to Wikipedia, as of ")

  (insert (format-time-string "%Y-%m-%d"))

  (insert
   ". There are a total of " (number-to-string (length LinksList)) " links.</p>\n\n"
   "<p>This file is automatically generated by a <a href=\"http://ergoemacs.org/emacs/elisp_link_report.html\">emacs lisp script</a>.</p>

"
   )

  (insert "<ol>\n")

  (mapc 'xah-print-each LinksList)

  (insert "</ol>

")

  (insert XahFooterText)
  (goto-char (point-max)))

;; clear memory
;; (clrhash LinksHash)
;; (setq LinksList nil)

;; open the file
(find-file OutputPath )

Emacs 🧡