Hunting for Unicode in Emacs

<nerdy>

Emacs has wonderful Unicode support. Copy and paste text from a Word document and Emacs will happily preserve your smart quotes, ellipses, and em dashes. There isn’t a canonical way, however, to strip these “special” characters into their more sane ASCII counterparts.

The unix command tidy does a good job of converting Unicode characters but you are left with ugly HTML equivalents like € instead of the usual quote character. We’ll need an alternative for Emacs, preferably written in Emacs lisp.

Targeting special characters

Since Emacs understands Unicode you can search for special characters by typing C-s and then the character. For example you can search for the Unicode TRADE MARK SIGN (\u2122) with C-s ™. That’s all fine and dandy but what if you don’t know how to type ™ and instead want to search for any arbitrary Unicode character by its code? Emacs provides quoted-insert (C-q) for typing special control characters like newline (C-q C-j) and ™.

Unicode characters can be typed using quoted-insert and the character code but there is a bit of a catch. Emacs defaults the read-quoted-char-radix to 8 which means you’ll need to type character codes in octal. As this isn’t ideal you can default to hexadecimal with:

(setq read-quoted-char-radix 16) #or 10, if you want to type in decimal

Once that’s done you can find the trade mark sign with C-s C-q 2 1 2 2 <enter>.

If you’d like to find more information about the character under the point in Emacs, type C-u C-x =

A “tidy” function in Emacslisp

When you’re writing Emacs lisp you need not fool with quote characters. A simple

(replace-string "\u2122" "TM")

will suffice. Here is a version of “tidy” to replace all the tricky characters in your buffer. You’ll need to “(require ‘cl)” to access the loop macro.

(defun tidy ()

  "Tidy up a buffer by replacing all special Unicode characters

   (smart quotes, etc.) with their more sane cousins"

  (interactive)

  (let ((unicode-map '(("[\u2018\|\u2019\|\u201A\|\uFFFD]" . "'")

                       ("[\u201c\|\u201d\|\u201e]" . "\"")

                       ("[\u2013\|\u2014]" . "-")

                       ("\u2026" . "...")

                       ("\u00A9" . "(c)")

                       ("\u00AE" . "(r)")

                       ("\u2122" . "TM")

                       ("[\u02DC\|\u00A0]" . " "))))

    (save-excursion

      (loop for (key . value) in unicode-map

            do

            (goto-char (point-min))

            (replace-regexp key value)))))

I was surprised to discover how powerful the Emacs quoted-insert (C-q) can be when dealing with Unicode. For the curious you can also see a table of commonly used characters.