Elisp: Parsing HTML, XML

By Xah Lee. Date: .

Parsing HTML, XML

libxml-available-p

variable. value is true if your emacs built has libxml-parse-html-region

libxml-parse-html-region

(libxml-parse-html-region &optional START END BASE-URL)

  • parse HTML.
  • START END defaults to point-min and point-max.
  • return the parse tree.

💡 TIP: use xml-remove-comments to remove HTML comments first.

In the parse tree, each HTML node is represented by a list in which the first element is a symbol representing the node name, the second element is an alist of node attributes, and the remaining elements are the subnodes.

sample return value

(html nil
 (head nil)
 (body ((width . "101"))
  (div ((class . "thing"))
   "Foo"
   (div nil
    "Yes"))))
libxml-parse-xml-region

similar to libxml-parse-html-region but uses strict XML syntax.

shr-insert-document

(shr-insert-document DOM)

  • insert parsed html as fontified text into current buffer.
  • DOM is the result from libxml-parse-html-region or similar.

Example libxml-parse-html-region

;; -*- coding: utf-8; lexical-binding: t; -*-
;; test libxml-parse-html-region

(let ((xhtml "<!DOCTYPE html>
<html>
<head>
<meta charset=\"utf-8\" />
<meta name=viewport content=\"width=device-width, initial-scale=1\" />

<title>untitled</title>
</head>
<body>

<main>

<h1>untitled</h1>

<p>
some
</p>

</main>

</body>
</html>")
      xtree
      (xoutbuf (generate-new-buffer "*html parse out*")))

  (setq xtree
        (with-temp-buffer
          (insert xhtml)
          (libxml-parse-html-region (point-min) (point-max))))

  (with-current-buffer xoutbuf (shr-insert-document xtree))
  (pop-to-buffer xoutbuf))

Reference