Emacs: Regular Expression Syntax

By Xah Lee. Date: 2022-07-27. Last updated: 2024-05-27.

Regular Expression syntax blow are shown without the string delimiters.

In lisp code, backslash may need a backslash in front. Elisp: Regex Backslash in Lisp Code

Character Alternative (Wildcard)

.

Any single character except newline char.

[chars]

Any of the chars. e.g. [12x] means any of 1, 2, or x.

[a-z]

Any char in the range a to b.

A hyphen inside the square bracket has special meaning. There can be multiple ranges, e.g. [0-9a-z] means digits and letters.

It also can mix with other chars, e.g. [♥0-9a-z] means digits and letters and ♥.

The range of characters is considered to be the range of corresponding unicode codepoints. [see Unicode: Character Set, Encoding, UTF-8, Codepoint] For example, a has codepoint 97, z has codepoint 122, [a-z] means any char from codepoint 97 to codepoint 122, inclusive.

To include a hyphen, put it in the beginning or end, e.g. [-a-c] means any a b c and hyphen.

[^chars_or_range]

not any of the chars_or_range. e.g. [^12x] means any char that is not 1, 2, or x.

To include a CIRCUMFLEX ACCENT ^, add it as second character or at the end. e.g. [^0-9^] means not any digit nor CIRCUMFLEX ACCENT.

Repetition

regex+

Match previous pattern 1 or more times. e.g. a+ means 1 or more occurrence of “a”. [0-9]+ means 1 or more occurrence of digit.

(let ((case-fold-search nil))
  (re-search-forward "x+"))
;; abc xxx

regex+?

Match previous pattern 1 or more times, but with minimal match (aka non-greedy).

(let ((case-fold-search nil))
  (re-search-forward "x+?"))
;; abc xxx

regex*

Match previous pattern 0 or more times.

regex*?

Match previous pattern 0 or more times, but with minimal match (aka non-greedy).

regex?

Match previous pattern 0 or 1 time.

regex??

Match previous pattern 0 or 1 time, but with minimal match (aka non-greedy).

regex\{m\}

Match previous pattern m times.

(let ((case-fold-search nil))
  (re-search-forward "x\\{2\\}"))
;; x and xx

regex\{m,n\}

Match previous pattern m to n times.

(let ((case-fold-search nil))
  (re-search-forward "x\\{2,3\\}"))
;; x xx xxx

Examples of Common Wildcard with Repetition

[0-9]+

One or more digits

(let ((case-fold-search nil))
  (re-search-forward "[0-9]+"))
;; 100 cats

[A-Za-z]+

One or more letters of English alphabet.

[-A-Za-z0-9]+

One or more {letter, digit, hyphen}

[_A-Za-z0-9]+

One or more {letter, digit, underscore}

[-_A-Za-z0-9]+

One or more {letter, digit, hyphen, underscore}

Capture

In interactive commands, captured text can be represented as \1 for first group, \2 for second group, etc. and \0 for the entire match, etc.

In emacs lisp code, to get captured group, use match-string, match-beginning, etc. See Elisp: Regex Functions

$regex$

Capture. Captured text can be used later for text replacement, or be referenced in the same regex for a pattern that occur multiple times.

;; capture digit sequence
(let ((case-fold-search nil))
  (re-search-forward "\\([0-9]+\\)")
  ;; there are 10 cats
  (match-string-no-properties 1))
;; 10

"$?n:regex$"

Explicit numbered caputure.
Capture and name it by a explicit number n.
(Normally, capture are automatically named in order. First capture is 1, second is 2, etc.) If multiple capture has the same number, the last match win.

(let ((case-fold-search nil))
  (re-search-forward "\\(?2:[a-z]+\\)")
  (match-string-no-properties 2))
  ;; match

\n

reference the nth capture that occured to the left of it.
this lets you match a sub-pattern that are repeated in different places.

;; capture a repeated word on both sides of and
(let ((case-fold-search nil))
  (re-search-forward "\\(\\b[a-z]+\\b\\) and \\1")
  (match-string-no-properties 0))

;; this and that and some and else and and and

;; result is
;; "and and and"

Examples of Capture

$[0-9]+$

Capture digit sequence.

(re-search-forward "\\([0-9]+\\)" )
;; there are 10 cats
(match-string-no-properties 1 )
;; 10

$[A-Za-z]+$

Capture English letter sequence. Do not use [A-z], because that'll match some punctuation chars too.

$[-A-Za-z]+$

Capture English letter sequence plus hyphen.

$[-_A-Za-z]+$

Capture English letter sequence plus hyphen and low line.

$[-_A-Za-z0-9]+$

Capture alphanumeric sequence plus hyphen and low line.

"$[^"]+$"

Capture text between quotes, including quotes.

(let ((case-fold-search nil))
  (re-search-forward "\"\\([^\"]+\\)\"")
  (match-string-no-properties 1))
;; he said "how are you"

Gouping

$?:regex$

This is called shy group. Group for precedence, but no capture.

(re-search-forward "\\(?:[0-9]+\\)\\|\\(?:[a-z]+\\)" )
;; 10 some

Alternative

a\|b

Match either pattern.
The alternative operator has very low precedence. that all expressions on either side are considered connected. You can use the capture $expr$ or shy group $?:expr$ to specify precedence.

(let ((case-fold-search nil))
  (re-search-forward "jpg\\|jpeg"))
;; cat.jpeg

;; match ax or bx
;; using a alternative with shy group
(let ((case-fold-search nil))
  (re-search-forward "\\(?:a\\|b\\)x"))
;; ax

Character Classes

Note, these square bracketed character classes must be enclosed in square brackets.

(re-search-forward "[[:ascii:]]+" )
;; search one or more ascii chars

[:ascii:]: any ASCII Characters . (codepoint 0 to 127, inclusive)
[:nonascii:]: any Character that's not ASCII.

[:alnum:]: any letter or digit. For multibyte characters, it matches characters whose Unicode general-category property (see Character Properties) indicates they are alphabetic or decimal number characters.
[:digit:]: any 0 to 9.
[:xdigit:]: for hexadecimal digits. 0 to 9, a to f, A to F.

[:alpha:]: any letter. For multibyte characters, it matches characters whose Unicode general-category property indicates they are alphabetic characters.
[:graph:]: This matches graphic characters. Anything except whitespace, ASCII and non-ASCII control characters, surrogates, and codepoints unassigned by Unicode, as indicated by the Unicode general-category property.
[:print:]: This matches any printing character—either whitespace, or a graphic character matched by [:graph:].

[:blank:]: any horizontal whitespace, as defined by Unicode. Includes ASCII spaces and tab characters.
[:cntrl:]: any any character whose code is in the range 0–31. [see ASCII Characters]

[:lower:]: any lower-case letter, as determined by the current case table. If case-fold-search is true, this also matches any upper-case letter.
[:upper:]: This matches any upper-case letter, as determined by the current case table. If case-fold-search is non-nil, this also matches any lower-case letter. Case Tables (ELISP Manual)

[:multibyte:]: This matches any multibyte character. Text Representations (ELISP Manual)
[:unibyte:]: This matches any unibyte character. Text Representations (ELISP Manual)

Character Classes of Emacs Syntax Table

Some Character Classes in emacs regular expression have a unique feature, that are based on emacs Syntax Table. This effectively means, these character classes may have different meaning depending on which major mode is the current buffer.

[see Regex Named Character Class and Syntax Table]

The following character classes are based on Syntax Table.

[:punct:]

This matches any punctuation character. (At present, for multibyte characters, it matches anything that has non-word syntax in Syntax Table.)

[:space:]

This matches any character that has whitespace syntax in Syntax Table

[:word:]

This matches any character that has word syntax in Syntax Table

\w

This matches any character that has word syntax in Syntax Table

\W

This matches any character that is not a word syntax in Syntax Table

\scode

matches any character whose Syntax Class is code

;; check if next char is a string delimiter (of current syntax table)
(looking-at "\\s\"")

\Scode

matches any character whose Syntax Class is not code

\c

matches any character whose category is c. Categories (ELISP Manual)

\C

matches any character whose category is not c. Categories (ELISP Manual)

Boundary Anchors

^regex: The pattern must match starting from beginning of {line, string, buffer}
regex$: The pattern must match to end of {line, string, buffer}
\`regex: The pattern must match starting from Beginning of {string, buffer}
regex\': The pattern must match to end of {string, buffer}
\=: marker for the current cursor position.
[see Cursor Position Functions]
\b: word boundary marker
\B: marker for: not a word boundary
\<: marker for: beginning of word.
Word characters based on current Syntax Table
\>: marker for: end of word.
Word characters based on current Syntax Table
\_<: marker for: beginning of Symbol.
Symbol characters based on current Syntax Table
\_>: marker for: end of Symbol.
Symbol characters based on current Syntax Table

Matching Unicode Characters

Unicode character can be used literally, e.g. "♥", or it can be represented by Elisp: Unicode Escape Sequence

Reference

Syntax of Regexps (ELISP Manual)