Emacs: Regular Expression Syntax

By Xah Lee. Date: . Last updated: .

Regular Expression syntax blow are shown without the string delimiters.

In lisp code, backslash may need a backslash in front. Elisp: Regex Backslash in Lisp Code

Character Alternative (Wildcard)

.

Any single character except newline char.

[chars]

Any of the chars. e.g. [12x] means any of 1, 2, or x.

[a-z]

Any char in the range a to b.

A hyphen inside the square bracket has special meaning. There can be multiple ranges, e.g. [0-9a-z] means digits and letters.

It also can mix with other chars, e.g. [😍0-9a-z] means including the emoji 😍.

The range of characters is ordered by Unicode: Codepoint . e.g. a is 97, z is 122, [a-z] means any char from codepoint 97 to codepoint 122, inclusive.

To include a hyphen, put it in the beginning or end, e.g. [-a-c] means any a b c and hyphen.

[^chars_or_range]

not any of the chars_or_range. e.g. [^12x] means any char that is not 1, 2, or x.

To include a CIRCUMFLEX ACCENT ^, add it as second character or at the end. e.g. [^0-9^] means not any digit nor CIRCUMFLEX ACCENT.

;; match html links

(progn
  (re-search-forward "<a href=\"\\([^>]+\\)\">\\([^<]+\\)</a>")
  (replace-match "URL: \\1 , Link text: \\2" t))

;; <a href="http://example.com/big.html">xyz</a>

;; becomes

;; URL: http://example.com/big.html , Link text: xyz

Repetition

regex+

Match previous pattern 1 or more times. e.g. a+ means 1 or more occurrence of “a”. [0-9]+ means 1 or more occurrence of digit.

(re-search-forward "x+")

;; abc xxx
;; examples of Common Wildcard with Repetition

;; HHHH---------------------------------------------------
;; One or more digits

(re-search-forward "[0-9]+")

;; 100

;; HHHH---------------------------------------------------
;; One or more letters of English alphabet.

(re-search-forward "[A-Za-z]+")

;; SOME or some

;; HHHH---------------------------------------------------
;; One or more {letter, digit, hyphen}

(re-search-forward "[-A-Za-z0-9]+")

;; x-list5

;; HHHH---------------------------------------------------
;; One or more {letter, digit, underscore}

(re-search-forward "[_A-Za-z0-9]+")

;; x_list5

;; HHHH---------------------------------------------------
;; One or more {letter, digit, hyphen, underscore}

(re-search-forward "[-_A-Za-z0-9]+")

;; x-list5 or
;; x_list5
regex+?

Match previous pattern 1 or more times, but with minimal match (aka non-greedy).

(re-search-forward "x+?")

;; abc xxx
regex*

Match previous pattern 0 or more times.

(list-matching-lines "https*")

;; http://example.com/
;; https://example.com/
regex*?

Match previous pattern 0 or more times, but with minimal match (aka non-greedy).

regex?

Match previous pattern 0 or 1 time.

regex??

Match previous pattern 0 or 1 time, but with minimal match (aka non-greedy).

regex\{m\}

Match previous pattern m times.

(re-search-forward "x\\{2\\}")

;; x and xx
regex\{m,n\}

Match previous pattern m to n times.

(list-matching-lines "x\\{3,4\\}")
;; x
;; xx
;; xxx
;; xxxx

Capture Match

A pattern that occur in string can be named by a number, for later reference of the occurrence of that pattern.

In interactive commands, captured text can be represented as \1 for first group, \2 for second group, etc. and \0 for the entire match, etc.

In emacs lisp code, to get captured group, use match-string (or match-string-no-properties) , match-beginning, etc. 〔See Elisp: Regex Functions

\(regex\)

Capture. Captured text can be used later for text replacement, or be referenced in the same regex for a pattern that occur multiple times.

(progn
  ;; capture digit sequence
  (re-search-forward "\\([0-9]+\\)")
  (match-string-no-properties 1))

;; there are 10 cats

;; return
;; "10"
;; Examples of Capture

;; HHHH---------------------------------------------------
;; Capture digit sequence.

(progn
  (re-search-forward "\\([0-9]+\\)")
  (match-string-no-properties 1))

;; there are 10 cats


;; HHHH---------------------------------------------------
;; Capture English letter sequence. Do not use [A-z], because that'll match some punctuation chars too.

(progn
  (re-search-forward "\\([A-Za-z]+\\)")
  (match-string-no-properties 1))

;; some thing

;; HHHH---------------------------------------------------
;; Capture English letter sequence plus hyphen.

(progn
  (re-search-forward "\\([-A-Za-z]+\\)")
  (match-string-no-properties 1))

;; non-breaking

;; HHHH---------------------------------------------------
;; Capture English letter sequence plus hyphen and low line.

(progn
  (re-search-forward "\\([-_A-Za-z]+\\)")
  (match-string-no-properties 1))

;; some_thing

;; HHHH---------------------------------------------------
;; Capture alphanumeric sequence plus hyphen and low line.

(progn
  (re-search-forward "\\([-_A-Za-z0-9]+\\)")
  (match-string-no-properties 1))

;; a-mm_x55

;; HHHH---------------------------------------------------
;; capture between quotes

(progn
  (re-search-forward "\"\\([^\"]+\\)\"")
  (match-string-no-properties 1))

;; he said "how are you"
\(?n:regex\)
  • Explicit numbered caputure.
  • Capture and name it by a explicit number n.
  • (Normally, capture are automatically named in order. First capture is 1, second is 2, etc.)
  • If multiple capture are named by the named number, the named number refers to last matched pattern.
;; capture. named it 2
(progn
  (re-search-forward "\\(?2:[0-9]+\\)")
  (match-string-no-properties 2))

;; some 99 cats

;; return
;; "99"
;; if multiple capture have the same name, that name refers to the last match
(progn
  ;; capture both digits and word, both named 2
  (re-search-forward "\\(?2:[0-9]+\\) \\(?2:[a-z]+\\)")
  (match-string-no-properties 2))

;; some 99 cats and 71 dogs

;; return
;; "cats"
\n
  • Reference the nth capture that occured to the left of it.
  • This lets you match a pattern that is repeated.
;; capture a repeated number
(re-search-forward "\\([0-9]+\\).+\\1")

;; 766 44 212 20 099 44 519

(match-string-no-properties 1)
;; "44"

Grouping

\(?:regex\)
  • This is called shy group.
  • Group for precedence, but no capture.
(re-search-forward "\\(?:[0-9]+\\)\\|\\(?:[a-z]+\\)" )
;; 10 some

Alternative

a\|b

Match either pattern.

The alternative operator has very low precedence, that all expressions on either side are considered connected.

You can use the capture \(expr\) or shy group \(?:expr\) to specify precedence.

(re-search-forward "jpg\\|jpeg")

;; cat.jpeg
;; match ax or bx
;; using a alternative with shy group
(re-search-forward "\\(?:a\\|b\\)x")

;; ax

Character Classes

Note, these square bracketed character classes must be enclosed in square brackets.

(re-search-forward "[[:ascii:]]+" )
;; search one or more ascii chars
[:ascii:]

any ASCII Characters . (codepoint 0 to 127, inclusive)

[:nonascii:]

any Character that's not ASCII.

[:alnum:]
  • Any letter or digit.
  • It matches characters whose Unicode general-category property (see Character Properties) indicates they are alphabetic or decimal number characters.
[:digit:]

any 0 to 9.

[:xdigit:]

Hexadecimal digits. 0 to 9, a to f, A to F.

[:alpha:]

Matches characters whose Unicode general-category property indicates they are alphabetic characters.

[:graph:]
  • This matches graphic characters.
  • Anything except whitespace, control characters, surrogates, and codepoints unassigned by Unicode, as indicated by the Unicode general-category property.
[:print:]

This matches any printing character—either whitespace, or a graphic character matched by [:graph:].

[:blank:]

any horizontal whitespace, as defined by Unicode. Includes ASCII spaces and tab characters.

[:cntrl:]

Any ASCII control character. 〔see ASCII Characters

[:lower:]
  • Any lower-case letter, as determined by the current case table.
  • If case-fold-search is true, this also matches any upper-case letter.
[:upper:]
  • This matches any upper-case letter, as determined by the current case table.
  • If case-fold-search is non-nil, this also matches any lower-case letter.
[:multibyte:]

matches any multibyte character.

[:unibyte:]

This matches any unibyte character.

Character Classes of Emacs Syntax Table

Some Character Classes in emacs regular expression have a unique feature, that are based on emacs Syntax Table. This effectively means, these character classes may have different meaning depending on which major mode is the current buffer.

〔see Regex Named Character Class and Syntax Table

The following character classes are based on Syntax Table.

[:punct:]
  • This matches any punctuation character.
  • (At present, for multibyte characters, it matches anything that has non-word syntax in Syntax Table.)
[:space:]

This matches any character that has whitespace syntax in Syntax Table

🛑 WARNING: often it does not include newline.

[:word:]

This matches any character that has word syntax in Syntax Table

\w

This matches any character that has word syntax in Syntax Table

\W

This matches any character that is not a word syntax in Syntax Table

\scode

matches any character whose Syntax Class is code

;; check if next char is a string delimiter (of current syntax table)
(looking-at "\\s\"")
\Scode

matches any character whose Syntax Class is not code

\c
matches any character whose category is c. Categories (ELISP Manual)
\C

Boundary Anchors

^regex

The pattern must match starting from beginning of {line, string, buffer}

\`regex

The pattern must match starting from beginning of {string, buffer}

regex$

The pattern must match to end of {line, string, buffer}

regex\'

The pattern must match to end of {string, buffer}

\=
\b

word boundary marker

\B

marker for: not a word boundary

\<
  • marker for: beginning of word.
  • Word characters based on current Syntax Table
\>
  • marker for: end of word.
  • Word characters based on current Syntax Table
\_<
\_>

Matching Unicode Characters

Unicode character can be used literally, e.g. "♥", or it can be represented by Elisp: Unicode Escape Sequence

Reference

Emacs, commands using regex

Emacs Lisp, Regex in Lisp Code