Elisp: Parsing HTML, XML
Parsing HTML, XML
libxml-available-p-
variable. value is true if your emacs built has
libxml-parse-html-region
libxml-parse-html-region-
(libxml-parse-html-region &optional START END BASE-URL)- parse HTML.
- START END defaults to
point-minandpoint-max. - return the parse tree.
💡 TIP: use
xml-remove-commentsto remove HTML comments first.In the parse tree, each HTML node is represented by a list in which the first element is a symbol representing the node name, the second element is an alist of node attributes, and the remaining elements are the subnodes.
sample return value
(html nil (head nil) (body ((width . "101")) (div ((class . "thing")) "Foo" (div nil "Yes"))))
libxml-parse-xml-region-
similar to
libxml-parse-html-regionbut uses strict XML syntax.
shr-insert-document-
(shr-insert-document DOM)- insert parsed html as fontified text into current buffer.
- DOM is the result from
libxml-parse-html-regionor similar.
Example libxml-parse-html-region
;; -*- coding: utf-8; lexical-binding: t; -*- ;; test libxml-parse-html-region (let ((xhtml "<!DOCTYPE html> <html> <head> <meta charset=\"utf-8\" /> <meta name=viewport content=\"width=device-width, initial-scale=1\" /> <title>untitled</title> </head> <body> <main> <h1>untitled</h1> <p> some </p> </main> </body> </html>") xtree (xoutbuf (generate-new-buffer "*html parse out*"))) (setq xtree (with-temp-buffer (insert xhtml) (libxml-parse-html-region (point-min) (point-max)))) (with-current-buffer xoutbuf (shr-insert-document xtree)) (pop-to-buffer xoutbuf))