Emacs: HTML, Extract URL 🚀
Here's a command to extract all URLs in a HTML file.
put this in your Emacs Init File:
(defun xah-html-extract-url (Begin End &optional RootDIR FullPathQ) "Extract URLs in current block or selection to `kill-ring'. When called interactively, copy result to `kill-ring', each URL in a line. If the URL is a local file relative path, convert it to full path. If `universal-argument' is called first, don't convert relative URL to full path. This command extracts all text of the forms <‹letter› … href=‹path› …> <‹letter› … src=‹path› …> The quote for ‹path› may be double or single quote. When called in lisp code, • Begin End are region begin/end positions. • RootDIR is a dir path. relative paths are computed from. • Optional FullPathQ, if true, convert local links to full path, with respect to XDIR. Returns a list of strings. URL `http://xahlee.info/emacs/emacs/elisp_extract_url_command.html' Version: 2021-02-20 2022-09-30 2023-09-04" (interactive (let (xp1 xp2) (let ((xbds (xah-get-bounds-of-thing-or-region 'block))) (setq xp1 (car xbds) xp2 (cdr xbds))) (list xp1 xp2 default-directory (not current-prefix-arg)))) (let ((xregionText (buffer-substring-no-properties Begin End)) (xurlList (list))) (with-temp-buffer (insert xregionText) (goto-char (point-min)) (while (search-forward "<" nil t) (replace-match "\n<" t t)) (goto-char (point-min)) (while (re-search-forward "<[A-Za-z]+.+?\\(href\\|src\\)[[:blank:]]*?=[[:blank:]]*?\\([\"']\\)\\([^\"']+?\\)\\2" nil t) (push (match-string-no-properties 3) xurlList))) (setq xurlList (reverse xurlList)) (when FullPathQ (setq xurlList (mapcar (lambda (xx) (if (string-match "^http:\\|^https:" xx) xx (expand-file-name xx (file-name-directory (if buffer-file-name buffer-file-name (or RootDIR default-directory)))))) xurlList))) (when (called-interactively-p 'any) (let ((xprintedResult (mapconcat 'identity xurlList "\n"))) (kill-new xprintedResult) (message "%s" xprintedResult))) xurlList))
requires package Emacs: xah-get-thing.el
this is part of Emacs: Xah HTML Mode.