Elisp: Generate Web Links Report
Here's how to use elisp to process files in a directory, searching for a text pattern, and generate a report.
Problem
You have 7 thousand HTML files. You want to generate a report that lists all links to Wikipedia, and the file path the link is from.
In this tutorial, you'll learn:
- how to walk a directory. 〔see Elisp: Walk Directory, List Files〕
- how to build a hash-table. 〔see Elisp: Hash Table〕
- elisp idiom for fast opening large number of files.
Solution
Here are the basic steps we need:
- Given a file, extract links and put them into a hash table.
- Use elisp to traverse a given directory, open each file.
- Some pretty printing functions. For example, convert a URL string into HTML link string.
Once we have the data in a hash-table, it is very flexible. We can then generate a report in plain text or HTML, or do other processing.
Hash Table
First, we'll need to know how to use hash table in emacs. 〔see Elisp: Hash Table〕
Process a Single File
we want the hash table key to be the full URL string, and the value a list. Each element in the list is the full path of the file that contains the link.
Here is the code that processes a single file. It opens the file, search for URL, if found, check if it exist in hash, if not, add it, else append to the existing entry.
(setq myfile "/Users/xah/web/ergoemacs_org/emacs/GNU_Emacs_dev_inefficiency.html") (setq myhash (make-hash-table :test 'equal)) (defun ff () "test code to process a single file" (interactive) (let ( myBuff url) (setq myBuff (find-file myfile)) ; open file (goto-char (point-min)) ;; search for URL till not found (while (re-search-forward "href=\"\\(https://..\\.wikipedia\\.org/[^\"]+\\)\">\\([^<]+\\)</a>" nil t) (when (match-string 0) ; if URL found (setq url (match-string 1)) ; set the url to matched string (print url) ;; if exist in hash, prepend to existing entry, else just add (if (gethash url myhash) (puthash url (cons myfile (gethash url myhash)) myhash) (puthash url (list myfile) myhash)))) (kill-buffer myBuff) ; close file )) (ff) (print myhash)
Walk a Directory
(mapc 'ff (directory-files-recursively "~/web/ergoemacs_org/" "\.html$" ))
Elisp: Walk Directory, List Files
Pretty Print Helpers
Given a Wikipedia URL, returns a HTML link string.
For example:
http://en.wikipedia.org/wiki/Emacs
becomes
<a href="http://en.wikipedia.org/wiki/Emacs">Emacs</a>
(require 'gnus-util) ; for gnus-url-unhex-string (defun wikipedia-url-to-link (url) "Return the URL as HTML link string. Example: http://en.wikipedia.org/wiki/Emacs%20Lisp becomes <a href=\"http://en.wikipedia.org/wiki/Emacs%20Lisp\">Emacs Lisp</a> " (let ((linkText url)) (setq linkText (gnus-url-unhex-string linkText nil)) ; decode percent encoding. For example: %20 (setq linkText (car (last (split-string linkText "/"))) ) ; get last part (setq linkText (replace-regexp-in-string "_" " " linkText ) ) ; low line → space (format "<a href=\"%s\">%s</a>" url linkText) ))
Given a file path, return a link string.
For example:
/Users/xah/web/ergoemacs_org/index.html
becomes
<a href="../index.html">ErgoEmacs</a>
, where the link text came from the file's “title” tag.
(defun get-html-file-title (fName) "Return FNAME <title> tag's text. Assumes that the file contains the string “<title>…</title>”." (with-temp-buffer (insert-file-contents fName nil nil nil t) (goto-char (point-min)) (buffer-substring-no-properties (search-forward "<title>") (- (search-forward "</title>") 8)) ))
Putting It All Together
;; -*- coding: utf-8; lexical-binding: t; -*- ;; emacs lisp. ;; started: 2008-01-03. ;; version: 2019-06-11 ;; author: Xah Lee ;; url: http://ergoemacs.org/emacs/elisp_link_report.html ;; purpose: generate a report of wikipedia links. ;; traverse a given dir, visiting every html file, find links to Wikipedia in those files, collect them, and generate a html report of these links and the files they are from, then write it to a given file. (overwrite if exist) ;; ssss--------------------------------------------------- (setq InputDir "/Users/xah/web/ergoemacs_org/" ) ;; Overwrites existing (setq OutputPath "/Users/xah/web/xahlee_org/wikipedia_links.html") ;; ssss--------------------------------------------------- ;; add a ending slash to InputDir if not there (when (not (string-equal "/" (substring InputDir -1) )) (setq InputDir (concat InputDir "/") ) ) (when (not (file-exists-p InputDir)) (error "input dir does not exist: %s" InputDir)) (setq XahHeaderText "<!doctype html><html><head><meta charset=\"utf-8\" /> <title>Links to Wikipedia from Xah Sites</title> </head> <body> <nav class=\"n1\"><a href=\"index.html\">XahLee.org</a></nav> ") (setq XahFooterText " </body></html> " ) ;; ssss--------------------------------------------------- ;; hash table. key is string Wikipedia url, value is a list of file paths. (setq LinksHash (make-hash-table :test 'equal :size 8000)) ;; ssss--------------------------------------------------- (defun xah-add-link-to-hash (filePath hashTableVar) "Get links in filePath and add it to hash table at the variable hashTableVar." (let ( xurl) (with-temp-buffer (insert-file-contents filePath nil nil nil t) (goto-char (point-min)) (while (re-search-forward "href=\"\\(https://..\\.wikipedia\\.org/[^\"]+\\)\">\\([^<]+\\)</a>" nil t) (setq xurl (match-string 1)) (when (and xurl ; if url found (not (string-match "=" xurl )) ; do not includes links that are not Wikipedia articles. e.g. user profile pages, edit history pages, search result pages (not (string-match "%..%.." xurl )) ; do not include links that's lots of unicode ) ;; if exist in hash, prepend to existing entry, else just add (if (gethash xurl hashTableVar) (puthash xurl (cons filePath (gethash xurl hashTableVar)) hashTableVar) ; not using add-to-list because each Wikipedia URL likely only appear once per file (puthash xurl (list filePath) hashTableVar)) )) ) ) ) (defun xah-print-each (ele) "Print each item. ELE is of the form (url (list filepath1 filepath2 etc)). Print it like this: ‹link to url› : ‹link to file1›, ‹link to file2›, …" (let (xwplink xfiles) (setq xwplink (car ele)) (setq xfiles (cadr ele)) (insert "<li>") (insert (wikipedia-url-to-linktext xwplink)) (insert "—") (dolist (xx xfiles nil) (insert (format "<a href=\"%s\">%s</a>•" (xahsite-filepath-to-href-value xx OutputPath ) (xah-html-get-html-file-title xx)))) (delete-char -1) (insert "</li>\n")) ) ;; ssss--------------------------------------------------- (defun wikipedia-url-to-linktext (P-url) "Return the title of a Wikipedia link. Example: http://en.wikipedia.org/wiki/Emacs becomes Emacs" (require 'url-util) (decode-coding-string (url-unhex-string (replace-regexp-in-string "_" " " (replace-regexp-in-string "&" "&" (car (last (split-string P-url "/")))))) 'utf-8)) (defun wikipedia-url-to-link (P-url) "Return the P-url as html link string.\n Example: http://en.wikipedia.org/wiki/Emacs becomes <a href=\"http://en.wikipedia.org/wiki/Emacs\">Emacs</a>" (format "<a href=\"%s\">%s</a>" P-url (wikipedia-url-to-linktext P-url))) (defun xah-hash-to-list (HashTable) "Return a list that represent the hashtable HashTable. Each element is a proper list: (key value). URL `http://xahlee.info/emacs/emacs/elisp_hash_table_to_list.html' Created: 2019-06-11 Version: 2022-05-28 2022-09-22" (let (xx) (maphash (lambda (k v) (push (list k v) xx)) HashTable) xx)) ;; ssss--------------------------------------------------- ;;;; main ;; fill LinksHash (mapc (lambda (xx) (xah-add-link-to-hash xx LinksHash )) (directory-files-recursively InputDir "\.html$" )) ;; fill LinksList (setq LinksList (sort (xah-hash-to-list LinksHash) ;; (hash-table-keys LinksHash) (lambda (a b) (string< (downcase (car a)) (downcase (car b)))))) ;; write to file (with-temp-file OutputPath (insert XahHeaderText) (goto-char (point-max)) (insert "<h1>Links To Wikipedia from XahLee.org</h1>\n\n" "<p>This page contains all existing links from xah sites to Wikipedia, as of ") (insert (format-time-string "%Y-%m-%d")) (insert ". There are a total of " (number-to-string (length LinksList)) " links.</p>\n\n" "<p>This file is automatically generated by a <a href=\"http://ergoemacs.org/emacs/elisp_link_report.html\">emacs lisp script</a>.</p> " ) (insert "<ol>\n") (mapc 'xah-print-each LinksList) (insert "</ol> ") (insert XahFooterText) (goto-char (point-max))) ;; clear memory ;; (clrhash LinksHash) ;; (setq LinksList nil) ;; open the file (find-file OutputPath )
Emacs 🧡