Elisp: Replacing HTML Entities with Unicode Characters
Problem
I have a file with content like this:
pound sign £ and ¤ ¥ but also ¦, §, ©
I need it to be like this:
pound sign £ and ¤ ¥ but also ¦, §, ©
How do you do it?
Solution
do find replace with elisp function as replacement.
- Open the file.
- Alt+x
query-replace-regexp
- Give the regex
&#\([0-9]+\);
. This will match HTML entity and capture the decimal code. - In the replacement input, tell emacs to use a elisp function, like this:
\,(ff)
, where the “ff” is my function name. - Then, type y or n for each match, or type ! to replace all occurrences in the file.
The key here is writing the replacement function ff.
Your function ff will take the matched string, then return a Unicode character that has the codepoint of the matched string. For example, if the matched string is "945"
, then ff should return the string "α"
.
Here's the code:
(defun ff () "temp function. Returns a string based on current regex match. This is for the regex: &#\\([0-9]+\\);" (char-to-string (string-to-number (match-string 1))) )
A Shortcut
Once you become familiar with using a lisp expression for regex replacement, you can simply use this code for the replacement:
\,(char-to-string (string-to-number \1))
.
Carlos at comp.lang.lisp, and Jon Snader (jcs) on his blog (irreal.org) gave the following nice solutions:
\,(char-to-string \#1)
or
\,(format "%c" \#1)
When using a lisp expression in query-replace-regexp
, the \1
is the 1st captured string.
The \#1
is the first captured string as a number.