Emacs: HTML, Extract URL 🚀

By Xah Lee. Date: . Last updated: .

Here's a command to extract all URLs in a HTML file.

put this in your Emacs Init File:

(defun xah-html-extract-url (Begin End &optional RootDIR FullPathQ)
  "Extract URLs in current block or selection to `kill-ring'.

When called interactively, copy result to `kill-ring', each URL in a line.
If the URL is a local file relative path, convert it to full path.
If `universal-argument' is called first, don't convert relative URL to full path.

This command extracts all text of the forms
 <‹letter› … href=‹path› …>
 <‹letter› … src=‹path› …>
The quote for ‹path› may be double or single quote.

When called in lisp code,
• Begin End are region begin/end positions.
• RootDIR is a dir path. relative paths are computed from.
• Optional FullPathQ, if true, convert local links to full path, with respect to XDIR.

Returns a list of strings.

URL `http://xahlee.info/emacs/emacs/elisp_extract_url_command.html'
Version: 2021-02-20 2022-09-30 2023-09-04"
  (interactive
   (let (xp1 xp2)
     (let ((xbds (xah-get-bounds-of-thing-or-region 'block))) (setq xp1 (car xbds) xp2 (cdr xbds)))
     (list xp1 xp2 default-directory
           (not current-prefix-arg))))
  (let ((xregionText (buffer-substring-no-properties Begin End))
        (xurlList (list)))
    (with-temp-buffer
      (insert xregionText)
      (goto-char (point-min))
      (while (search-forward "<" nil t)
        (replace-match "\n<" t t))
      (goto-char (point-min))
      (while (re-search-forward
              "<[A-Za-z]+.+?\\(href\\|src\\)[[:blank:]]*?=[[:blank:]]*?\\([\"']\\)\\([^\"']+?\\)\\2" nil t)
        (push (match-string-no-properties 3) xurlList)))
    (setq xurlList (reverse xurlList))
    (when FullPathQ
      (setq xurlList
            (mapcar
             (lambda (xx)
               (if (string-match "^http:\\|^https:" xx)
                   xx
                 (expand-file-name
                  xx
                  (file-name-directory
                   (if buffer-file-name
                       buffer-file-name
                     (or RootDIR default-directory))))))
             xurlList)))
    (when (called-interactively-p 'any)
      (let ((xprintedResult (mapconcat 'identity xurlList "\n")))
        (kill-new xprintedResult)
        (message "%s" xprintedResult)))
    xurlList))

requires package Emacs: xah-get-thing.el

this is part of Emacs: Xah HTML Mode.

HTML