Elisp: Replacing HTML Entities with Unicode Characters

By Xah Lee. Date: . Last updated: .

Problem

I have a file with content like this:

pound sign £ and ¤ ¥ but also ¦, §, ©

I need it to be like this:

pound sign £ and ¤ ¥ but also ¦, §, ©

How do you do it?

Solution

do find replace with elisp function as replacement.

  1. Open the file.
  2. Alt+x query-replace-regexp
  3. Give the regex &#\([0-9]+\);. This will match HTML entity and capture the decimal code.
  4. In the replacement input, tell emacs to use a elisp function, like this: \,(ff), where the “ff” is my function name.
  5. Then, type y or n for each match, or type ! to replace all occurrences in the file.

The key here is writing the replacement function ff.

Your function ff will take the matched string, then return a Unicode character that has the codepoint of the matched string. For example, if the matched string is "945", then ff should return the string "α".

Here's the code:

(defun ff ()
  "temp function. Returns a string based on current regex match.
This is for the regex: &#\\([0-9]+\\);"
  (char-to-string (string-to-number (match-string 1)))
  )

A Shortcut

Once you become familiar with using a lisp expression for regex replacement, you can simply use this code for the replacement:
\,(char-to-string (string-to-number \1)).


Carlos at comp.lang.lisp, and Jon Snader (jcs) on his blog (irreal.org) gave the following nice solutions:

\,(char-to-string \#1)

or

\,(format "%c" \#1)

When using a lisp expression in query-replace-regexp, the \1 is the 1st captured string. The \#1 is the first captured string as a number.

Function as Replacement String