ELisp: Create Sitemap

By Xah Lee. Date: . Last updated: .

This page shows how to use emacs lisp to create a sitemap.

Problem

Write a elisp script to generate a sitemap. That is: create a file of sitemap format that lists all files in a directory.

Detail

A sitemap is a XML file that lists URLs of all files in a website for web crawlers to crawl.

A sitemap file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>

   …

</urlset>
  1. The file can have many <url>…</url> item.
  2. Each <url> container represent a file and other info.
  3. The <loc> is a URL of the file.
  4. The <lastmod>, <changefreq>, <priority> are optional.
  5. A sitemap file can list a max of 50k URLs.

The purpose of sitemap file is for web crawlers to easily know all files that exist on your site.

Solution

The general plan is very simple. Here's one way to do it.

  1. Create a new file, insert XML header tags.
  2. Traverse the web root dir. For each file, determine whether it should be listed in the sitemap.
  3. If so, generate the proper URL tag and insert it into the new file.
  4. When done visiting files, insert the XML footer tags. Save the file.
;; -*- coding: utf-8; lexical-binding: t; -*-
;; version: 2019-06-11 2022-05-26
;; http://xahlee.info/emacs/emacs/make_sitemap.html

(require 'seq)

(setq xah-web-root-path "~/web/" )

;; require
;; xahsite-external-docs

(defun xahsite-generate-sitemap (DomainName)
  "Generate a sitemap.xml.gz file of xahsite at doc root.
DomainName must match a existing one.
Version: 2018-09-17 2021-08-05 2022-05-26"
  (interactive
   (list (completing-read "choose:" '( "xahlee.info" "xahlee.org" ))))
  (let (
        $pageMovedQ
        ($sitemapFileName "sitemap.xml" )
        ($websiteDocRootPath (concat xah-web-root-path (replace-regexp-in-string "\\." "_" DomainName t t) "/")))
    ;; (print (concat "begin: " (format-time-string "%Y-%m-%dT%T")))
    (let (
          ($filePath (concat $websiteDocRootPath $sitemapFileName ))
          ($sitemapBuffer (generate-new-buffer "sitemapbuff")))
      (with-current-buffer $sitemapBuffer
        (set-buffer-file-coding-system 'unix)
        (insert "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">
"))
      (mapc
       (lambda ($f)
         (setq $pageMovedQ nil)
         (when (not (or
                     (string-match "/xx" $f) ; ; dir/file starting with xx are not public
                     (string-match "403error.html" $f)
                     (string-match "404error.html" $f)))
           (with-temp-buffer
             (insert-file-contents $f nil 0 100)
             (when (search-forward "page_moved_64598" nil t)
               (setq $pageMovedQ t)))
           (when (not $pageMovedQ)
             (with-current-buffer $sitemapBuffer
               (insert "<url><loc>"
                       "http://" DomainName "/" (substring $f (length $websiteDocRootPath))
                       "</loc></url>\n"
                       )))))
       (seq-filter
        (lambda (path)
          (not (seq-some
                (lambda (x) (string-match x path))
                xahsite-external-docs
                )))
        (directory-files-recursively $websiteDocRootPath "\\.html$" )))
      (with-current-buffer $sitemapBuffer
        (insert "</urlset>\n")
        (write-region (point-min) (point-max) $filePath nil 3)
        (kill-buffer ))
      (find-file $filePath))
    ;; (print (concat "done: " (format-time-string "%Y-%m-%dT%T")))
    ))

On a site with 7091 html files (10 times more if counting image files etc), the script takes 3 seconds to run. (e.g. timing based on running it a second time, thus not counting disk reading time. )

See also: Golang: Generate Sitemap