Elisp: Replacing HTML Entities with Unicode Characters
This page shows you how to use emacs to replace HTML entities to corresponding Unicode characters.
(e.g. α
⇒ α
)
Problem
I have a file with content like this:
… <tr><td>pound</td><td>£</td></tr> <tr><td>curren</td><td>¤</td></tr> <tr><td>yen</td><td>¥</td></tr> <tr><td>brvbar</td><td>¦</td></tr> <tr><td>sect</td><td>§</td></tr> <tr><td>copy</td><td>©</td></tr> …
I need it to be like this:
… <tr><td>pound</td><td>£</td></tr> <tr><td>curren</td><td>¤</td></tr> <tr><td>yen</td><td>¥</td></tr> <tr><td>brvbar</td><td>¦</td></tr> <tr><td>sect</td><td>§</td></tr> <tr><td>copy</td><td>©</td></tr> …
How do you do it using emacs's power?
Note: the syntax &#n;
in HTML represents a character in Unicode with codepoint of the integer n. This mechanism is called HTML entities.
〔see Character Sets and Encoding in HTML〕
〔see HTML XML Entities〕
Solution 1
Write a emacs lisp command. See: Emacs: Replace HTML Entities 🚀
Solution 2
Emacs lets you do find replace with replacement being a elisp function. Here's a outline of the solution.
- Open the file.
- Alt+x
query-replace-regexp
. - Give the regex
&#\([0-9]+\);
. This will match HTML entity and capture the decimal code. - In the replacement input, tell emacs to use a elisp function, like this:
\,(ff)
, where the “ff” is my function name. - Then, type y or n for each match, or type ! to replace all occurrences in the file.
The key here is writing the replacement function ff.
Your function ff will take the matched string, then return a Unicode character that has the codepoint of the matched string. For example, if the matched string is "945"
, then ff should return the string "α"
.
Here's the code:
(defun ff () "temp function. Returns a string based on current regex match. This is for the regex: &#\\([0-9]+\\);" (char-to-string (string-to-number (match-string 1))) )
Let's go thru the code. The code
(match-string 1)
gives me the 1st captured string. Let's say the captured string is "945"
.
In emacs, character datatype are just integers. A character is just its Unicode decimal codepoint. For example, if you run this code: (insert 945)
, it'll insert “α”. (try it now)
So, i change the matched string into a character datatype (integer) by
(string-to-number (match-string 1))
,
then i change this char to a string, by
(char-to-string …)
.
A Shortcut
Once you become familiar with using a lisp expression for regex replacement, you can simply use this code for the replacement:
\,(char-to-string (string-to-number \1))
.
No need to write a function ff. But writing out function makes it clear what we are doing. It is easier if the transformation you need is a bit complex.
Carlos at comp.lang.lisp, and Jon Snader (jcs) on his blog (irreal.org) gave the following nice solutions:
\,(char-to-string \#1)
\,(format "%c" \#1)
When using a lisp expression in query-replace-regexp
, the \1
is the 1st captured string. The \#1
is the first captured string as a number. Alt+x describe-function
on query-replace-regexp
for detail.
Using a elisp function as replacement has many uses. For several examples, see: Elisp: Using a Elisp Function for Replacement String.
〔see Emacs: Regular Expression〕