Elisp: Command to Extract URL
Here's a command to extract all URLs in a HTML file.
For example, if you have:
<a href="../cats.html">cats</a>, <a href="http://en.wikipedia.org/wiki/Idiom">Idiom</a>, <a class="b" href="computing.html"></a>
After calling the command, the following is copied to kill-ring:
/Users/xah/web/ergoemacs_org/cats.html http://en.wikipedia.org/wiki/Idiom /Users/xah/web/ergoemacs_org/emacs/computing.html
If there's no text selection, current text block is used.
If universal-argument
is called first, no conversion to full path. Example:
../cats.html http://en.wikipedia.org/wiki/Idiom computing.html
Solution
(defun xah-html-extract-url (@begin @end &optional @not-full-path-p) "Extract URLs in current block or region to `kill-ring'. When called interactively, copy result to `kill-ring'. Each URL in a line. If the URL is a local file relative path, convert it to full path. If `universal-argument' is called first, don't convert relative URL to full path. This command extracts all text of the forms <‹letter› … href=‹path› …> <‹letter› … src=‹path› …> The quote for ‹path› may be double or single quote. When called in lisp code, @begin @end are region begin/end positions. Returns a list of strings. URL `http://xahlee.info/emacs/emacs/elisp_extract_url_command.html' Version 2020-01-22" (interactive (let ($p1 $p2) ;; set region boundary $p1 $p2 (if (use-region-p) (setq $p1 (region-beginning) $p2 (region-end)) (save-excursion (if (re-search-backward "\n[ \t]*\n" nil "NOERROR") (progn (re-search-forward "\n[ \t]*\n") (setq $p1 (point))) (setq $p1 (point))) (if (re-search-forward "\n[ \t]*\n" nil "NOERROR") (progn (re-search-backward "\n[ \t]*\n") (setq $p2 (point))) (setq $p2 (point))))) (list $p1 $p2 (not current-prefix-arg)))) (let (($regionText (buffer-substring-no-properties @begin @end)) ($urlList (list))) (with-temp-buffer (insert $regionText) (goto-char (point-min)) (while (re-search-forward "<" nil t) (replace-match "\n<" "FIXEDCASE" "LITERAL")) (goto-char (point-min)) (while (re-search-forward "<[A-Za-z]+.+?\\(href\\|src\\)[[:blank:]]*?=[[:blank:]]*?\\([\"']\\)\\([^\"']+?\\)\\2" nil t) (push (match-string 3) $urlList))) (setq $urlList (reverse $urlList)) (when @not-full-path-p (setq $urlList (mapcar (lambda ($x) (if (string-match "^http:\\|^https:" $x ) $x (expand-file-name $x (file-name-directory (if (buffer-file-name) (buffer-file-name) default-directory ))))) $urlList))) (when (called-interactively-p 'any) (let (($printedResult (mapconcat 'identity $urlList "\n"))) (kill-new $printedResult) (message "%s" $printedResult))) $urlList ))
for latest updates of this code, see Emacs: Xah HTML Mode.