Python: Regex Syntax
this page is under revision
A Regex pattern is a string, used as a pattern to match other strings. For
examples, 'a+'
matches repeated occurrences of letter “a”, such as "aaahh!"
.
For another example, r' \w+@\[A-Za-z]\.com'
is a pattern that matches the email address xyz@somewhere.com
.
In the example 'a+'
, the plus char “+” means “one or more”
of the previous char. In regex, many chars have special
meanings. Other chars that have
special meaning include: [ ] ( ) \ . ^ $ * ?
and more.
This page documents the meaning and special construct of all special characters.
Wildcards
\d
Digits
Represents any digit. That is, any of 0 1 2 3 4 5 6 7 8 9.
\D
Not Digits
Represents any non-digit character. That is, any char not \d
.
\w
alphanumeric plus underscore
Represents any alphanumeric character and the
underscore. That is, any of a to z, A to Z, 0 to 9, or '_'.
(\w
is equivalent to the regex pattern r'[a-zA-Z0-9_]'
.)
With re.LOCALE
flag set, it will match the set [0-9_]
plus whatever characters are defined as alphanumeric for the current locale. 〔see Python: Regex Flags〕
If re.UNICODE
is set, this will match the characters [0-9_]
plus whatever is classified as alphanumeric in the Unicode character properties database.
todo: NOTE TO DOC WRITERS: need a explicit example here.
\W
Represents any char that is not the regex \w.
Note: re.LOCALE
and re.UNICODE
flags applies the same way as with \w
.
\s
Represents any whitespace characters. They are:
space, tab, linefeed, carriage return, form feed, vertical tab.
Their ASCII values are: 32, 9, 10, 13, 12, 11.
These character are represented in Python by '\t\n\r\f\v'
.
The regex \s
is equivalent to r'[ \t\n\r\f\v]'
.
\S
Represents any non-whitespace character; this is
equivalent to the set [^ \t\n\r\f\v]
.
.
A period matches any character except a newline. e.g. .h
matches both “oh” and “ah”. If the re.DOTALL
flag has been specified, this matches any character including a newline. See regex functions about setting flags.
[]
A square bracket is used to represent a set of characters. e.g. [aeiou]
matches any string containing any “a e i o u”.
[]
can also be used to represent a set of characters not listed inside [], by placing a caret “^” in the beginning. e.g. [^aeiou]
will match any character that is not any of “a e i o u”.
A range of characters can be specified by a hypen. Typically, [0-9]
matches any digit, and [a-z]
matches any lower case letters, and [A-Z]
matches any capital letters. This syntax can be combined, for example [A-Za-z]
to mean any capitical or lowercase letters.
Regex of character class such as \w
or \s
can also be used inside square brackets. e.g. [\w,. ]
will match any alphanumeric char or underscore, or one of comma, period, space.
Characters that have special meanings in regex do not have special meanings when used inside []. e.g. [b+]
does not mean one or more b; It just matches “b” or “+”.
To include characters such as bracket “]” or dash “-” or backslash “\”, put a backslash before the char. e.g. r'[\\a\-]'
will match any of “\ a -”. For historical reasons, if one of []-\
appears as the first char in the braket, then they are treated literally. e.g. '[]b]'
is legal syntax. It will match “]” or “b”.
Repetition Qualifiers
*
The asterisk char represents zero or more repetitions of the preceding character or pattern.
For example, ah*
will match any of “a”, “ah”, “ahh”.
Note that a pattern group can be used in front of “*” or any repetition
qualifiers such as “+” or “?”. e.g. a(xy)*b
will match any “ab”, “axyb”, “axyxyb”, “axyxyxyb”.
+
The plus char represents one or more repetitions of the preceding regex.
For example, ab+
will match “abc” or “abbc”, but will not match “ac”.
?
A question mark represents 0 or 1 repetitions of the preceding regex. For example, “ab?” will match “a” or “ab”.
*?
, +?
, ??
These *?
, +?
, ??
are non-greedy version of *
+
?
.
That is, they match as short number of chars as possible.
The repetition qualifiers *
+
?
are all “greedy”. That is, they will match as much as possible.
For example, if the regex '<.*>'
is matched against '<h1>Cat Physiology</h1>'
, it will match the entire string, and not just
'<h1>'
.
Now, '<.*?>'
will now match only '<h1>'
in '<h1>Cat Physiology</h1>'
.
To capture the title in '<h1>title</h1>'
, one can either use '<.+?>(.+?)<.+?>'
or '<[^>]+>([^<]+)<'
.
{m}
Specifies that exactly m copies of the previous regex should be matched; fewer matches cause the entire regex not to match. e.g. a{6}
will match exactly six “a” characters, but not five.
{m, n}
Represents
m to n repetitions of the preceding regex, attempting to
match as many repetitions as possible. For example, a{3,5}
will match from 3 to 5 "a
" characters. Omitting m
specifies a lower bound of zero,
and omitting n specifies an infinite upper bound. As an
example, a{4,}b
will match aaaab
or a thousand
“a” characters followed by “b”, but not “aaab”.
{m, n}?
Represents m to n repetitions of the preceding regex,
attempting to match as few repetitions as possible. This is
the non-greedy version of the previous qualifier. For example, on the
6-character string 'aaaaaa'
, 'a{3,5}'
will match 5
"a
" characters, while 'a{3,5}?'
will only match 3
characters.
Backslash escapes
\
A backslash followed by a character will in general represent the character literally.
In regex, many chars has special meaning. For example: “()[]{}*+?.\” and more. Sometimes you want to search for these chars exactly. This can be done by adding a backslash in front of the char that has special meaning. e.g. to match a string containing the question mark, use the regex r'\?'
.
If a char does not have special meaning, adding a backslash in front may or may not represent the character literally. e.g. many of the “character class” wildcards start with a backslash. e.g. \w \W \d \D. (see Wildcards section above.) Also, a backslash followed by a number represents the captured pattern. (see Captures section below.)
todo: NOTE TO DOC WRITERS: the following are not clear in meaning.
Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser:
\a \b \f \n \r \t \v \x \\
Octal escapes are included in a limited form: If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length.
Boundary Anchors
^
The caret matches the start of the
string. e.g. '^aha gocha' matches 'aha gocha!, but does not match 'haha gocha!'.
The ^ char is always used as the first char in a pattern.
If the MULTILINE
flag is set, the charact also matches immediately
after each newline.
For example, in the following:
re.search(r'^aha', 'why not?\naha, i see.', re.MULTILINE)
it will return a MatchObject because 'aha' appeared in the beginning of second line. If the re.MULTILINE flag is not given, then None is returned since no 'aha' appears at the beginning of the string.
$
The dollar sign matches the end of the string or just before the newline at the end of the string. e.g. 'gocha$' matches 'aha gocha' and 'aha gocha\n' but does not match 'gocha!'.
If re.MULTILINE
flag is given, then it
also matches before any newline.
For example, re.compile('gocha$',re.MULTILINE)
will now also match “gocha\nthis time”.
Regex such as ^ and & are called anchors, because they force a regex pattern to match at the start or end of a string or lines of strings.
\A
Matches only at the start of the string.
\Z
Matches only at the end of the string.
\b
Matches the beginning or end of a word. A word is a sequence
alphanumeric chars plus underscore “_”. More specifically, \b matches the boundary between the regex \w and \W. This means, the set of alphanumeric chars are effected by re.UNICODE
and re.LOCALE
flags. e.g. if Unicode flag is set and locale is set to Chinese, then Chinese chars are considered alphanumeric.
Inside a character set [], \b means backspace character.
\B
Matches the empty string, but only when it is not
at the beginning or end of a word. This is just the opposite of \b
, so is also subject to the settings of re.LOCALE
and re.UNICODE
.
Alternatives
|
The vertical bar is used to express alternatives in regex. For example, r'regex1|regex2|regex3' will match any of the regexes, starting from left to right. For example, if regex2 is found in the target string, regex3 will not be tried even if the pattern is also in the target string and match more substring than regex2.Alternatives can be used inside capture groups as well (see Captures below).
To match the vertical bar | exactly, use \|.
Captures
(…)
If a regex is enclosed in parenthesis, then any string matching the enclosed regex will be “captured”, and can be referred to later using the form \n
. This is often used in replacement.
in the following example, the quote is captured and referenced as \1
, and the source after the double dash is captured and referenced as \2
.
import re print re.sub(r'<a href="([^"]+?)">([^<]+)</a>', r'link: \1, text \2', r'<a href="cat.jpg">my cat</a>') # prints # link: cat.jpg, text my cat
To match parenthisis literaly, use \(
and \)
.
\n
\n
represents the nth captured match. n is a positive integer from 1 to 99.
Captures are numbered starting from 1. The pattern \n
is mostly used in the replacement string, but can also be used as a regex pattern. e.g. r'(.+) \1'
matches "the the"
or "55 55"
, but not "the end"
.
(?P<name>…)
This form is similar to parentheses, but the captured substring is given the name name, so that they can be referred by name as well as \n
.
Naming captured groups has advandages. In particular, when a complex regex is edited with captures added or deleted, references using the named form will remain stable.
To refer a named captured group in a replacement string, use the form r'\g<name>'
. To refer a named captured group in a regex, use the form (?P=name)
.
In the following example, a file name and link string are extracted from a HTML document's link archor:
import re # named capture example patternObj = re.compile(r'([^<]+)<a href="(?P<fileName>[^"]+)">(?P<linkStr>[^<]+)</a>') matchObj = patternObj.search('look: <a href="some.jpg">my cat</a>.') print matchObj.expand(r'file name is: \g<fileName>, link string is: \g<linkStr>') # prints: file name is: some.jpg, link string is: my cat
(?P=name)
Matches whatever text was matched by the earlier group named name.
Other Advanced Constructs
(?…)
This is an extension notation. The first
character after the ?
determines what the meaning.
Extensions do not create a new group,
except (?P<name>…)
.
(?iLmsux)
(One or more letters from the set "i
", "L
", "m
", "s
", "u
", "x
".) The group matches the empty string; the letters set the corresponding flags (re.I
, re.L
, re.M
, re.S
, re.U
, re.X
) for the entire regular expression. This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the compile()
function.
Note that the (?x)
flag changes how the expression is parsed. It should be used first in the expression string, or after one or more whitespace characters. If there are non-whitespace characters before the flag, the results are undefined.
(?:…)
A non-grouping version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
(?#…)
A comment; the contents of the parentheses are simply ignored.
(?=…)
Matches if …
matches next, but doesn't
consume any of the string. This is called a lookahead assertion. For
example, Isaac (?=Asimov)
will match 'Isaac '
only if it's
followed by 'Asimov'
.
(?!…)
Matches if …
doesn't match next. This
is a negative lookahead assertion. For example,
Isaac (?!Asimov)
will match 'Isaac '
only if it's not
followed by 'Asimov'
.
(?<=…)
Matches if the current position in the string
is preceded by a match for …
that ends at the current
position. This is called a positive lookbehind assertion.
(?<=abc)def
will find a match in "abcdef
", since the
lookbehind will back up 3 characters and check if the contained
pattern matches. The contained pattern must only match strings of
some fixed length, meaning that abc
or a|b
are
allowed, but a*
and a{3,4}
are not. Note that
patterns which start with positive lookbehind assertions will never
match at the beginning of the string being searched; you will most
likely want to use the search()
function rather than the
match()
function:
# -*- coding: utf-8 -*- # python 2 import re m = re.search('(?<=abc)def', 'abcdef') print m.group(0) # prints def
This example looks for a word following a hyphen:
# -*- coding: utf-8 -*- # python 2 import re m = re.search('(?<=-)\w+', 'spam-egg') print m.group(0) # prints egg
(?<!…)
Matches if the current position in the string
is not preceded by a match for …
. This is called a
negative lookbehind assertion. Similar to positive lookbehind
assertions, the contained pattern must only match strings of some
fixed length. Patterns which start with negative lookbehind
assertions may match at the beginning of the string being searched.
(?(id/name)yes-pattern|no-pattern)
Will try to match
with yes-pattern
if the group with given id or name
exists, and with no-pattern
if it doesn't. |no-pattern
is optional and can be omitted. For example,
(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)
is a poor email matching
pattern, which will match with '<user@host.com>'
as well as
'user@host.com'
, but not with '<user@host.com'
.