Elisp: Create Sitemap

By Xah Lee. Date: . Last updated: .

This page shows how to use emacs lisp to create a sitemap.

Problem

Write a elisp script to generate a sitemap. That is: create a file of sitemap format that lists all files in a directory.

Detail

A sitemap is a XML file that lists URLs of all files in a website for web crawlers to crawl.

A sitemap file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>

   …

</urlset>
  1. The file can have many <url>…</url> item.
  2. Each <url> container represent a file and other info.
  3. The <loc> is a URL of the file.
  4. The <lastmod>, <changefreq>, <priority> are optional.
  5. A sitemap file can list a max of 50k URLs.

The purpose of sitemap file is for web crawlers to easily know all files that exist on your site.

Solution

The general plan is very simple. Here's one way to do it.

  1. Create a new file, insert XML header tags.
  2. Traverse the web root dir. For each file, determine whether it should be listed in the sitemap.
  3. If so, generate the proper URL tag and insert it into the new file.
  4. When done visiting files, insert the XML footer tags. Save the file.
;; -*- coding: utf-8; lexical-binding: t; -*-
;; version: 2019-06-11
;; home page: /Users/xah/web/ergoemacs_org/emacs/make_sitemap.html

(require 'seq)

(setq xah-web-root-path "/Users/xah/web/" )

(defvar xahsite-external-docs nil "A vector of dir paths.")
(setq  xahsite-external-docs
       [
        "ergoemacs_org/emacs_manual/"
        "xahlee_info/REC-SVG11-20110816/"
        "xahlee_info/clojure-doc-1.8/"
        "xahlee_info/css_2.1_spec/"
        "xahlee_info/css_transitions/"
        "xahlee_info/js_es2011/"
        "xahlee_info/js_es2015/"
        "xahlee_info/js_es2015_orig/"
        "xahlee_info/js_es2016/"
        "xahlee_info/js_es2018/"
        "xahlee_info/node_api/"
        ])

(defun xahsite-generate-sitemap (@domain-name)
  "Generate a sitemap.xml.gz file of xahsite at doc root.
@domain-name must match a existing one.
Version 2018-09-17"
  (interactive
   (list (ido-completing-read "choose:" '( "ergoemacs.org" "wordyenglish.com" "xaharts.org" "xahlee.info" "xahlee.org" "xahmusic.org" "xahsl.org" ))))
  (let (
        ($sitemapFileName "sitemap.xml" )
        ($websiteDocRootPath (concat xah-web-root-path (replace-regexp-in-string "\\." "_" @domain-name "FIXEDCASE" "LITERAL") "/")))
    ;; (print (concat "begin: " (format-time-string "%Y-%m-%dT%T")))
    (let (
          ($filePath (concat $websiteDocRootPath $sitemapFileName ))
          ($sitemapBuffer (generate-new-buffer "sitemapbuff")))
      (with-current-buffer $sitemapBuffer
        (set-buffer-file-coding-system 'unix)
        (insert "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">
"))
      (mapc
       (lambda ($f)
         (setq $pageMoved-p nil)
         (when (not (or
                     (string-match "/xx" $f) ; ; dir/file starting with xx are not public
                     (string-match "403error.html" $f)
                     (string-match "404error.html" $f)))
           (with-temp-buffer
             (insert-file-contents $f nil 0 100)
             (when (search-forward "page_moved_64598" nil t)
               (setq $pageMoved-p t)))
           (when (not $pageMoved-p)
             (with-current-buffer $sitemapBuffer
               (insert "<url><loc>"
                       "http://" @domain-name "/" (substring $f (length $websiteDocRootPath))
                       "</loc></url>\n"
                       )))))
       (seq-filter
        (lambda (path)
          (not (seq-some
                (lambda (x) (string-match x path))
                xahsite-external-docs
                )))
        (directory-files-recursively $websiteDocRootPath "\\.html$" )))
      (with-current-buffer $sitemapBuffer
        (insert "</urlset>")
        (write-region (point-min) (point-max) $filePath nil 3)
        (kill-buffer ))
      (find-file $filePath)
      )
    ;; (print (concat "done: " (format-time-string "%Y-%m-%dT%T")))
    ))

(defun xahsite-generate-sitemap-all ()
  "do all
2016-08-15"
  (interactive)
  (xahsite-generate-sitemap "ergoemacs.org" )
  (xahsite-generate-sitemap "wordyenglish.com" )
  (xahsite-generate-sitemap "xaharts.org" )
  (xahsite-generate-sitemap "xahlee.info" )
  (xahsite-generate-sitemap "xahlee.org" )
  (xahsite-generate-sitemap "xahmusic.org" )
  (xahsite-generate-sitemap "xahsl.org"  ))

On a site with 3515 html files (10 times more if counting image files etc), the script takes 5 seconds to run. (e.g. timing based on running it a second time, thus not counting disk reading time. )

See also: Golang: Generate Sitemap

Emacs Lisp Examples

Text Transform Under Cursor

Commands Do thing-at-point

Command to Insert Things

Script Examples

Misc