Elisp: Replacing HTML Entities with Unicode Characters

By Xah Lee. Date: . Last updated: .

This page shows you how to use emacs to replace HTML entities to corresponding Unicode characters. (e.g. αα)

Problem

I have a file with content like this:

…
<tr><td>pound</td><td>&#163;</td></tr>
<tr><td>curren</td><td>&#164;</td></tr>
<tr><td>yen</td><td>&#165;</td></tr>
<tr><td>brvbar</td><td>&#166;</td></tr>
<tr><td>sect</td><td>&#167;</td></tr>
<tr><td>copy</td><td>&#169;</td></tr>
…

I need it to be like this:

…
<tr><td>pound</td><td>£</td></tr>
<tr><td>curren</td><td>¤</td></tr>
<tr><td>yen</td><td>¥</td></tr>
<tr><td>brvbar</td><td>¦</td></tr>
<tr><td>sect</td><td>§</td></tr>
<tr><td>copy</td><td>©</td></tr>
…

How do you do it using emacs's power?

Note: the syntax &#n; in HTML represents a character in Unicode with codepoint of the integer n. This mechanism is called HTML entities. 〔see Character Sets and Encoding in HTML〕 〔see HTML XML Entities

Solution 1

Write a emacs lisp command. See: Emacs: Replace HTML Entities 🚀

Solution 2

Emacs lets you do find replace with replacement being a elisp function. Here's a outline of the solution.

  1. Open the file.
  2. Alt+x query-replace-regexp.
  3. Give the regex &#\([0-9]+\);. This will match HTML entity and capture the decimal code.
  4. In the replacement input, tell emacs to use a elisp function, like this: \,(ff), where the “ff” is my function name.
  5. Then, type y or n for each match, or type ! to replace all occurrences in the file.

The key here is writing the replacement function ff.

Your function ff will take the matched string, then return a Unicode character that has the codepoint of the matched string. For example, if the matched string is "945", then ff should return the string "α".

Here's the code:

(defun ff ()
  "temp function. Returns a string based on current regex match.
This is for the regex: &#\\([0-9]+\\);"
  (char-to-string (string-to-number (match-string 1)))
  )

Let's go thru the code. The code (match-string 1) gives me the 1st captured string. Let's say the captured string is "945".

In emacs, character datatype are just integers. A character is just its Unicode decimal codepoint. For example, if you run this code: (insert 945), it'll insert “α”. (try it now)

Character Type (ELISP Manual)

So, i change the matched string into a character datatype (integer) by (string-to-number (match-string 1)), then i change this char to a string, by (char-to-string …).

A Shortcut

Once you become familiar with using a lisp expression for regex replacement, you can simply use this code for the replacement:
\,(char-to-string (string-to-number \1)).

No need to write a function ff. But writing out function makes it clear what we are doing. It is easier if the transformation you need is a bit complex.

Carlos at comp.lang.lisp, and Jon Snader (jcs) on his blog (irreal.org) gave the following nice solutions:

\,(char-to-string \#1)
\,(format "%c" \#1)

When using a lisp expression in query-replace-regexp, the \1 is the 1st captured string. The \#1 is the first captured string as a number. Alt+x describe-function on query-replace-regexp for detail.

Using a elisp function as replacement has many uses. For several examples, see: Elisp: Using a Elisp Function for Replacement String.

〔see Emacs: Regular Expression

Function as Replacement String