Emacs: Remove Accent Marks
Here's a emacs command that removes accent marks, or, convert some Unicode characters into ASCII. (aka Zap Gremlins)
For example:
- café → cafe
- naïve → naive
(defun xah-asciify-text (&optional Begin End) "Remove accents in some letters. e.g. café → cafe. Change European language characters into equivalent ASCII ones. When called interactively, work on current line or text selection. URL `http://xahlee.info/emacs/emacs/emacs_zap_gremlins.html' Version 2018-11-12 2021-09-17" (interactive) (let (($charMap [ ["ß" "ss"] ["á\\|à\\|â\\|ä\\|ā\\|ǎ\\|ã\\|å\\|ą\\|ă\\|ạ\\|ả\\|ả\\|ấ\\|ầ\\|ẩ\\|ẫ\\|ậ\\|ắ\\|ằ\\|ẳ\\|ặ" "a"] ["æ" "ae"] ["ç\\|č\\|ć" "c"] ["é\\|è\\|ê\\|ë\\|ē\\|ě\\|ę\\|ẹ\\|ẻ\\|ẽ\\|ế\\|ề\\|ể\\|ễ\\|ệ" "e"] ["í\\|ì\\|î\\|ï\\|ī\\|ǐ\\|ỉ\\|ị" "i"] ["ñ\\|ň\\|ń" "n"] ["ó\\|ò\\|ô\\|ö\\|õ\\|ǒ\\|ø\\|ō\\|ồ\\|ơ\\|ọ\\|ỏ\\|ố\\|ổ\\|ỗ\\|ộ\\|ớ\\|ờ\\|ở\\|ợ" "o"] ["ú\\|ù\\|û\\|ü\\|ū\\|ũ\\|ư\\|ụ\\|ủ\\|ứ\\|ừ\\|ử\\|ữ\\|ự" "u"] ["ý\\|ÿ\\|ỳ\\|ỷ\\|ỹ" "y"] ["þ" "th"] ["ď\\|ð\\|đ" "d"] ["ĩ" "i"] ["ľ\\|ĺ\\|ł" "l"] ["ř\\|ŕ" "r"] ["š\\|ś" "s"] ["ť" "t"] ["ž\\|ź\\|ż" "z"] [" " " "] ; thin space etc ["–" "-"] ; dash ["—\\|一" "--"] ; em dash etc ]) ($p1 (if Begin Begin (if (region-active-p) (region-beginning) (line-beginning-position)))) ($p2 (if End End (if (region-active-p) (region-end) (line-end-position))))) (let ((case-fold-search t)) (save-restriction (narrow-to-region $p1 $p2) (mapc (lambda ($pair) (goto-char (point-min)) (while (re-search-forward (elt $pair 0) (point-max) t) (replace-match (elt $pair 1)))) $charMap)))))
(defun xah-asciify-string (String) "Returns a new string. e.g. café → cafe. See `xah-asciify-text' Version 2015-06-08" (with-temp-buffer (insert String) (xah-asciify-text (point-min) (point-max)) (buffer-string)))
[see Accent Marks: Trema, Umlaut, Macron, Circumflex]
( thanks to robert_nagy for adding chars)
Accumulator vs Parallel Programing
This problem makes a good parallel programing exercise. See: Parallel Programing Exercise: asciify-string.
Alternative Solution with “iconv” or perl
Yuri Khan and Teemu Likonen suggested using the “iconv” shell command. See man iconv
. Here's Teemu's code.
(defun asciify-string (string) "Convert STRING to ASCII string. For example: “passé” becomes “passe”" ;; Code originally by Teemu Likonen (with-temp-buffer (insert string) (call-process-region (point-min) (point-max) "iconv" t t nil "--to-code=ASCII//TRANSLIT") (buffer-substring-no-properties (point-min) (point-max))))
Julian Bradfield suggested Perl. Here's his one-liner, it removes chars with accent marks.
perl -e 'use encoding utf8; use Unicode::Normalize; while ( <> ) { $_ = NFKD($_); s/\pM//g; print; }'
http://groups.google.com/group/comp.emacs/msg/8d58b6e9b2bd07fd
Though, it would be nice to have a pure elisp solution, because “iconv” is not in Windows or Mac OS X as of .