Python: Regex Syntax

By Xah Lee. Date: . Last updated: .

this page is under revision

A Regex pattern is a string, used as a pattern to match other strings. For examples, 'a+' matches repeated occurrences of letter “a”, such as "aaahh!".

For another example, r' \w+@\[A-Za-z]\.com' is a pattern that matches the email address xyz@somewhere.com.

In the example 'a+', the plus char “+” means “one or more” of the previous char. In regex, many chars have special meanings. Other chars that have special meaning include: [ ] ( ) \ . ^ $ * ? and more.

This page documents the meaning and special construct of all special characters.

Wildcards

\d Digits

Represents any digit. That is, any of 0 1 2 3 4 5 6 7 8 9.

\D Not Digits

Represents any non-digit character. That is, any char not \d.

\w alphanumeric plus underscore

Represents any alphanumeric character and the underscore. That is, any of a to z, A to Z, 0 to 9, or '_'. (\w is equivalent to the regex pattern r'[a-zA-Z0-9_]'.)

With re.LOCALE flag set, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. 〔see Python: Regex Flags

If re.UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

todo: NOTE TO DOC WRITERS: need a explicit example here.

\W

Represents any char that is not the regex \w. Note: re.LOCALE and re.UNICODE flags applies the same way as with \w.

\s

Represents any whitespace characters. They are: space, tab, linefeed, carriage return, form feed, vertical tab. Their ASCII values are: 32, 9, 10, 13, 12, 11. These character are represented in Python by '\t\n\r\f\v'. The regex \s is equivalent to r'[ \t\n\r\f\v]'.

\S

Represents any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v].

.

A period matches any character except a newline. e.g. .h matches both “oh” and “ah”. If the re.DOTALL flag has been specified, this matches any character including a newline. See regex functions about setting flags.

[]

A square bracket is used to represent a set of characters. e.g. [aeiou] matches any string containing any “a e i o u”.

[] can also be used to represent a set of characters not listed inside [], by placing a caret “^” in the beginning. e.g. [^aeiou] will match any character that is not any of “a e i o u”.

A range of characters can be specified by a hypen. Typically, [0-9] matches any digit, and [a-z] matches any lower case letters, and [A-Z] matches any capital letters. This syntax can be combined, for example [A-Za-z] to mean any capitical or lowercase letters.

Regex of character class such as \w or \s can also be used inside square brackets. e.g. [\w,. ] will match any alphanumeric char or underscore, or one of comma, period, space.

Characters that have special meanings in regex do not have special meanings when used inside []. e.g. [b+] does not mean one or more b; It just matches “b” or “+”.

To include characters such as bracket “]” or dash “-” or backslash “\”, put a backslash before the char. e.g. r'[\\a\-]' will match any of “\ a -”. For historical reasons, if one of []-\ appears as the first char in the braket, then they are treated literally. e.g. '[]b]' is legal syntax. It will match “]” or “b”.

Repetition Qualifiers

*

The asterisk char represents zero or more repetitions of the preceding character or pattern. For example, ah* will match any of “a”, “ah”, “ahh”.

Note that a pattern group can be used in front of “*” or any repetition qualifiers such as “+” or “?”. e.g. a(xy)*b will match any “ab”, “axyb”, “axyxyb”, “axyxyxyb”.

+

The plus char represents one or more repetitions of the preceding regex. For example, ab+ will match “abc” or “abbc”, but will not match “ac”.

?

A question mark represents 0 or 1 repetitions of the preceding regex. For example, “ab?” will match “a” or “ab”.

*?, +?, ??

These *?, +?, ?? are non-greedy version of * + ?.

That is, they match as short number of chars as possible.

The repetition qualifiers * + ? are all “greedy”. That is, they will match as much as possible.

For example, if the regex '<.*>' is matched against '<h1>Cat Physiology</h1>' , it will match the entire string, and not just '<h1>'.

Now, '<.*?>' will now match only '<h1>' in '<h1>Cat Physiology</h1>'.

To capture the title in '<h1>title</h1>', one can either use '<.+?>(.+?)<.+?>' or '<[^>]+>([^<]+)<'.

{m}

Specifies that exactly m copies of the previous regex should be matched; fewer matches cause the entire regex not to match. e.g. a{6} will match exactly six “a” characters, but not five.

{m, n}

Represents m to n repetitions of the preceding regex, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 "a" characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match aaaab or a thousand “a” characters followed by “b”, but not “aaab”.

{m, n}?

Represents m to n repetitions of the preceding regex, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', 'a{3,5}' will match 5 "a" characters, while 'a{3,5}?' will only match 3 characters.

Backslash escapes

\

A backslash followed by a character will in general represent the character literally.

In regex, many chars has special meaning. For example: “()[]{}*+?.\” and more. Sometimes you want to search for these chars exactly. This can be done by adding a backslash in front of the char that has special meaning. e.g. to match a string containing the question mark, use the regex r'\?'.

If a char does not have special meaning, adding a backslash in front may or may not represent the character literally. e.g. many of the “character class” wildcards start with a backslash. e.g. \w \W \d \D. (see Wildcards section above.) Also, a backslash followed by a number represents the captured pattern. (see Captures section below.)

todo: NOTE TO DOC WRITERS: the following are not clear in meaning.

Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser:

\a      \b      \f      \n
\r      \t      \v      \x
\\

Octal escapes are included in a limited form: If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length.

Boundary Anchors

^

The caret matches the start of the string. e.g. '^aha gocha' matches 'aha gocha!, but does not match 'haha gocha!'. The ^ char is always used as the first char in a pattern. If the MULTILINE flag is set, the charact also matches immediately after each newline. For example, in the following:

re.search(r'^aha', 'why not?\naha, i see.', re.MULTILINE)

it will return a MatchObject because 'aha' appeared in the beginning of second line. If the re.MULTILINE flag is not given, then None is returned since no 'aha' appears at the beginning of the string.

$

The dollar sign matches the end of the string or just before the newline at the end of the string. e.g. 'gocha$' matches 'aha gocha' and 'aha gocha\n' but does not match 'gocha!'.

If re.MULTILINE flag is given, then it also matches before any newline.

For example, re.compile('gocha$',re.MULTILINE) will now also match “gocha\nthis time”.

Regex such as ^ and & are called anchors, because they force a regex pattern to match at the start or end of a string or lines of strings.

\A

Matches only at the start of the string.

\Z

Matches only at the end of the string.

\b

Matches the beginning or end of a word. A word is a sequence alphanumeric chars plus underscore “_”. More specifically, \b matches the boundary between the regex \w and \W. This means, the set of alphanumeric chars are effected by re.UNICODE and re.LOCALE flags. e.g. if Unicode flag is set and locale is set to Chinese, then Chinese chars are considered alphanumeric.

Inside a character set [], \b means backspace character.

\B

Matches the empty string, but only when it is not at the beginning or end of a word. This is just the opposite of \b, so is also subject to the settings of re.LOCALE and re.UNICODE.

Alternatives

|

The vertical bar is used to express alternatives in regex. For example, r'regex1|regex2|regex3' will match any of the regexes, starting from left to right. For example, if regex2 is found in the target string, regex3 will not be tried even if the pattern is also in the target string and match more substring than regex2.

Alternatives can be used inside capture groups as well (see Captures below).

To match the vertical bar | exactly, use \|.

Captures

()

If a regex is enclosed in parenthesis, then any string matching the enclosed regex will be “captured”, and can be referred to later using the form \n. This is often used in replacement.

in the following example, the quote is captured and referenced as \1, and the source after the double dash is captured and referenced as \2.

import re

print re.sub(r'<a href="([^"]+?)">([^<]+)</a>', r'link: \1, text \2', r'<a href="cat.jpg">my cat</a>')
# prints
# link: cat.jpg, text my cat

To match parenthisis literaly, use \( and \).

\n

\n represents the nth captured match. n is a positive integer from 1 to 99.

Captures are numbered starting from 1. The pattern \n is mostly used in the replacement string, but can also be used as a regex pattern. e.g. r'(.+) \1' matches "the the" or "55 55", but not "the end".

(?P<name>…)

This form is similar to parentheses, but the captured substring is given the name name, so that they can be referred by name as well as \n.

Naming captured groups has advandages. In particular, when a complex regex is edited with captures added or deleted, references using the named form will remain stable.

To refer a named captured group in a replacement string, use the form r'\g<name>'. To refer a named captured group in a regex, use the form (?P=name).

In the following example, a file name and link string are extracted from a HTML document's link archor:

import re

# named capture example

patternObj = re.compile(r'([^<]+)<a href="(?P<fileName>[^"]+)">(?P<linkStr>[^<]+)</a>')

matchObj = patternObj.search('look: <a href="some.jpg">my cat</a>.')

print matchObj.expand(r'file name is: \g<fileName>, link string is: \g<linkStr>')
# prints: file name is: some.jpg, link string is: my cat

(?P=name)

Matches whatever text was matched by the earlier group named name.

Other Advanced Constructs

(?…)

This is an extension notation. The first character after the ? determines what the meaning. Extensions do not create a new group, except (?P<name>…).

(?iLmsux)

(One or more letters from the set "i", "L", "m", "s", "u", "x".) The group matches the empty string; the letters set the corresponding flags (re.I, re.L, re.M, re.S, re.U, re.X) for the entire regular expression. This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the compile() function.

Note that the (?x) flag changes how the expression is parsed. It should be used first in the expression string, or after one or more whitespace characters. If there are non-whitespace characters before the flag, the results are undefined.

(?:…)

A non-grouping version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

(?#…)

A comment; the contents of the parentheses are simply ignored.

(?=…)

Matches if matches next, but doesn't consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it's followed by 'Asimov'.

(?!…)

Matches if doesn't match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by 'Asimov'.

(?<=…)

Matches if the current position in the string is preceded by a match for that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in "abcdef", since the lookbehind will back up 3 characters and check if the contained pattern matches. The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Note that patterns which start with positive lookbehind assertions will never match at the beginning of the string being searched; you will most likely want to use the search() function rather than the match() function:

# -*- coding: utf-8 -*-
# python 2
import re

m = re.search('(?<=abc)def', 'abcdef')
print m.group(0) # prints def

This example looks for a word following a hyphen:

# -*- coding: utf-8 -*-
# python 2
import re

m = re.search('(?<=-)\w+', 'spam-egg')
print m.group(0) # prints egg

(?<!…)

Matches if the current position in the string is not preceded by a match for . This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

(?(id/name)yes-pattern|no-pattern)

Will try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn't. |no-pattern is optional and can be omitted. For example, (<)?(\w+@\w+(?:\.\w+)+)(?(1)>) is a poor email matching pattern, which will match with '<user@host.com>' as well as 'user@host.com', but not with '<user@host.com'.

Python, Regular Expression