Annoying Invisible ZERO WIDTH NO-BREAK SPACE Character from Google Plus, Twitter

, , …,

These days, when copying text from Google Plus or Twitter, often you'll get a invisible ZERO WIDTH NO-BREAK SPACE (aka BYTE ORDER MARK) (Unicode #65279). If you write blogs, that's really annoying. It taints your blog. When in the future, when you apply regex to systematically process your site, it may silently fail due to invisible character. Also, more common is the NO-BREAK SPACE (Unicode #160).

So, i use these emacs lisp code to solve the problem:

;; count ZERO WIDTH NO-BREAK SPACE
(xah-find-count (char-to-string 65279) ">" "0" "~/web/" "\\.html$")

;; list text that contains ZERO WIDTH NO-BREAK SPACE
(xah-find-text (char-to-string 65279) "~/web/" "\\.html$" "fixed-case-search" "print-context")

;; remove ZERO WIDTH NO-BREAK SPACE (Unicode #65279)
(xah-find-replace-text (char-to-string 65279) "" "~/web/" "\\.html$" "fixed-case-search" "fixed-casereplace")

;; count RIGHT-TO-LEFT MARK
(xah-find-count (char-to-string 8207) ">" "0" "~/web/" "\\.html$")
(defun replace-BOM-mark-etc ()
  "Query replace Unicode some invisible Unicode chars.
The chars to be searched are:
 RIGHT-TO-LEFT MARK 8207 x200f
 ZERO WIDTH NO-BREAK SPACE 65279 xfeff

start on cursor position to end.
    "
  (interactive)
  (let ()
    (query-replace-regexp "\u200f\\|\ufeff" "")
    ))

These commands are from xah_file_util.el at http://code.google.com/p/ergoemacs/source/checkout

You can write a {Perl, Python, Ruby, Bash} script to solve the problem. See:

see also Unicode BOM Byte Order Mark Hack

blog comments powered by Disqus