this page is under revision.
A Regex pattern is a string, used as a pattern to match other strings. For
examples, 'a+' matches repeated occurences of letter “a”, such as "aaahh!".
For another example, r' \w+@\[A-Za-z]\.com' is a pattern that matches the email form xyz@somewhere.com.
In the example 'a+', the plus char “+” means “one or more”
of the previous char. In regex, many chars have special
meanings. Other chars that have
special meaning include: [ ] ( ) \ . ^ $ * ? and more.
This page documents the meaning and special construct of all special characters.
\dRepresents any digit. That is, any of 0 1 2 3 4 5 6 7 8 9.
\D Represents any non-digit character. That is, any char not \d.
\wRepresents any alphanumeric character and the
underscore. That is, any of a to z, A to Z, 0 to 9, or '_'.
(\w is equivalent to the regex pattern r'[a-zA-Z0-9_]'.)
With LOCALE flag set, it will match the set
[0-9_] plus whatever characters are defined as alphanumeric for
the current locale.
todo: NOTE TO DOC WRITERS: need a explicit example here about locale, illustrating exactly where or how to set locale and how it effects the code. Also, possibly include a link to the doc about locale.
If UNICODE is set, this will match the
characters [0-9_] plus whatever is classified as alphanumeric
in the Unicode character properties database.
todo: NOTE TO DOC WRITERS: need a explicit example here.
\WRepresents any char that is not the regex \w.
Note: LOCALE and UNICODE
flags applies the same way as with \w.
\sRepresents any whitespace characters. They are:
space, tab, linefeed, carriage return, form feed, vertical tab.
Their ascii values are: 32, 9, 10, 13, 12, 11.
These character are represented in Python by '\t\n\r\f\v'.
The regex \s is equivalent to r'[ \t\n\r\f\v]'.
\SRepresents any non-whitespace character; this is
equivalent to the set [^ \t\n\r\f\v].
.A period matches any
character except a newline. For example, .h matches both “oh” and “ah”. If the DOTALL flag has been
specified, this matches any character including a newline. See regex functions about setting flags.
[]A square bracket is used to represent a set of characters. For example, [aeiou] matches any string containing any “a e i o u”.
[] can also be used to represent a set of characters not listed inside [], by placing a caret “^” in the beginning. For example, [^aeiou] will match any character that is not any of “a e i o u”.
A range of characters can be specified by a hypen. Typically, [0-9] matches any digit, and [a-z] matches any lower case letters, and [A-Z] matches any capital letters. This syntax can be combined, for example [A-Za-z] to mean any capitical or lowercase letters.
Regex of character class such as \w or \s can also be used inside square brakets. For example, [\w,. ] will match any alphanumeric char or underscore, or one of comma, period, space.
Characters that have special meanings in regex do not have special meanings when used inside []. For example, [b+] does not mean one or more b; It just matches “b” or “+”.
To include characters such as bracket “]” or dash “-” or backslash “\”, put a backslash before the char. For example, r'[\\a\-]' will match any of “\ a -”. For historical reasons, if one of []-\ appears as the first char in the braket, then they are treated literally. For example, '[]b]' is legal syntax. It will match “]” or “b”.
*The asterisk char represents zero or more repetitions of the preceding character or pattern.
For example, ah* will match any of “a”, “ah”, “ahh”.
Note that a pattern group can be used in front of “*” or any repetition
qualifiers such as “+” or “?”. For example: a(xy)*b will match any “ab”, “axyb”, “axyxyb”, “axyxyxyb”.
+The plus char represents one or more repetitions of the preceding regex.
For example, ab+ will match “abc” or “abbc”, but will not match “ac”.
?A question mark represents 0 or 1 repetitions of the preceding regex. For example, “ab?” will match “a” or “ab”.
*?, +?, ??The repetition qualifiers * + ? are all “greedy”. That is, they will match as much as possible. Sometimes this behaviour isn't desired.
For example
if the regex '<.*>' is matched against
'<H1>title</H1>'
, it will match the entire string, and not just
'<H1>'.
This would not be useful if you want to capture the title by the pattern '<.*>(.*)<.*>'.
One can specify a non-greedy, minimal match behavior, by
adding "?" after the qualifier. For example,
'<.*?>'
will now match only '<H1>' in '<H1>title</H1>'.
To capture the title in '<H1>title</H1>', one can either use
'<.+?>(.+?)<.+?>' or
'<[^>]+>([^<]+)<'
.
{‹m›}Specifies that exactly ‹m› copies of the previous regex should be
matched; fewer matches cause the entire regex not to match. For example,
a{6} will match exactly six “a” characters, but
not five.
{‹m›,‹n›}Represents
‹m› to ‹n› repetitions of the preceding regex, attempting to
match as many repetitions as possible. For example, a{3,5}
will match from 3 to 5 "a" characters. Omitting ‹m›
specifies a lower bound of zero,
and omitting ‹n› specifies an infinite upper bound. As an
example, a{4,}b will match aaaab or a thousand
“a” characters followed by “b”, but not “aaab”.
{‹m›,‹n›}?Represents ‹m› to ‹n› repetitions of the preceding regex,
attempting to match as few repetitions as possible. This is
the non-greedy version of the previous qualifier. For example, on the
6-character string 'aaaaaa', 'a{3,5}' will match 5
"a" characters, while 'a{3,5}?' will only match 3
characters.
\A backslash followed by a character will in general represent the character literally.
In regex, many chars has special meaning. For example: “()[]{}*+?.\” and more. Sometimes you want to search for these chars exactly. This can be done by adding a backslash in front of the char that has special meaning. For example, to match a string containing the question mark, use the regex r'\?'.
If a char does not have special meaning, adding a backslash in front may or may not represent the character literally. For example, many of the “character class” wildcards start with a backslash. ⁖ \w \W \d \D. (see Wildcards section above.) Also, a backslash followed by a number represents the captured pattern. (see Captures section below.)
todo: NOTE TO DOC WRITERS: the following are not clear in meaning.
Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser:
\a \b \f \n \r \t \v \x \\
Octal escapes are included in a limited form: If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length.
^The caret matches the start of the
string. For example, '^aha gocha' matches 'aha gocha!, but does not match 'haha gocha!'.
The ^ char is always used as the first char in a pattern.
If the MULTILINE flag is set, the charact also matches immediately
after each newline.
For example, in the following:
re.search(r'^aha', 'why not?\naha, i see.', re.MULTILINE)
it will return a MatchObject because 'aha' appeared in the beginning of second line. If the re.MULTILINE flag is not given, then None is returned since no 'aha' appears at the beginning of the string.
$The dollar sign matches the end of the string or just before the newline at the end of the string. For example, 'gocha$' matches 'aha gocha' and 'aha gocha\n' but does not match 'gocha!'.
If re.MULTILINE flag is given, then it
also matches before any newline.
For example, re.compile('gocha$',re.MULTILINE) will now also match “gocha\nthis time”.
Regex such as ^ and & are called archors, because they force a regex pattern to match at the start or end of a string or lines of strings.
\AMatches only at the start of the string.
\ZMatches only at the end of the string.
\bMatches the beginning or end of a word. A word is a sequence
alphanumeric chars plus underscore “_”. More specifically, \b matches the boundary between the regex \w and \W. This means, the set of alphanumeric chars are effected by UNICODE and LOCALE flags. For example, if Unicode flag is set and locale is set to Chinese, then Chinese chars are considered alphanumeric.
Inside a character set [], \b means backspace character.
\BMatches the empty string, but only when it is not
at the beginning or end of a word. This is just the opposite of \b, so is also subject to the settings of LOCALE and UNICODE.
Alternatives can be used inside capture groups as well (see Captures below).
To match the vertical bar | exactly, use \|.
()If a regex is enclosed in parenthesis, then any string matching the enclosed regex will be “captured”, and can be referred to later using the form \number. This is often used in replacement.
in the following example, the quote is captured and referenced as \1, and the source after the double dash is captured and referenced as \2.
newstr=re.sub(r'([^-]+)--(.+)$', r'\1--Me, not \2','"what do you mean?" --A Sage') # returns: "what do you mean?" --Me, not A Sage
To match parenthisis literaly, use \( and \).
\‹number›A backslash followed by a number n, represents the nth captured match.
Captures are numbered starting from 1. The pattern \n is mostly used in the replacement string, but can also be used as a regex pattern.
For example,
r'(.+) \1' matches “the the” or “55 55”, but not
“the end”.
Note: For historical reasons, the n in \n can be a number from 1 to 99 only. That is to say, there can be no more than 99 captures.
NOTE TO DOC WRITER: what happens with “\562” for example? And what happens with “\09” for example? or '\23456'? Needs a clear explanation on this here.
(?P<‹name›>…)This form is similar to parentheses, but the captured substring is given the name ‹name›, so that they can be referred by name as well as \number.
Naming captured groups has advandages. In particular, when a complex regex is edited with captures added or deleted, references using the named form will remain stable. To refer a named captured group in a replacement string, use the form r'\g<name>'. To refer a named captured group in a regex, use the form “(?P=name)”.
In the following example, a file name and link string are extracted from a HTML document's link archor:
patternObj=re.compile(r'([^<]+)<a href="(?P<fileName>[^"]+)">(?P<linkStr>[^<]+)</a>') matchObj=patternObj.search('look: <a href="some.jpg">my cat</a>.') print matchObj.expand(r'file name is: \g<fileName>, link string is: \g<linkStr>') # prints: file name is: some.jpg, link string is: my cat
(?P=‹name›)Matches whatever text was matched by the earlier group named ‹name›.
(?…)This is an extension notation (a "?"
following a "(" is not meaningful otherwise). The first
character after the "?"
determines what the meaning and further syntax of the construct is.
Extensions usually do not create a new group;
(?P<‹name›>…) is the only exception to this rule.
Following are the currently supported extensions.
(?iLmsux)(One or more letters from the set "i",
"L", "m", "s", "u",
"x".) The group matches the empty string; the letters set
the corresponding flags (re.I, re.L,
re.M, re.S, re.U, re.X)
for the entire regular expression. This is useful if you wish to
include the flags as part of the regular expression, instead of
passing a ‹flag› argument to the compile() function.
Note that the (?x) flag changes how the expression is parsed.
It should be used first in the expression string, or after one or more
whitespace characters. If there are non-whitespace characters before
the flag, the results are undefined.
(?:…)A non-grouping version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
(?#…)A comment; the contents of the parentheses are simply ignored.
(?=…)Matches if … matches next, but doesn't
consume any of the string. This is called a lookahead assertion. For
example, Isaac (?=Asimov) will match 'Isaac ' only if it's
followed by 'Asimov'.
(?!…)Matches if … doesn't match next. This
is a negative lookahead assertion. For example,
Isaac (?!Asimov) will match 'Isaac ' only if it's not
followed by 'Asimov'.
(?<=…)Matches if the current position in the string
is preceded by a match for … that ends at the current
position. This is called a positive lookbehind assertion.
(?<=abc)def will find a match in "abcdef", since the
lookbehind will back up 3 characters and check if the contained
pattern matches. The contained pattern must only match strings of
some fixed length, meaning that abc or a|b are
allowed, but a* and a{3,4} are not. Note that
patterns which start with positive lookbehind assertions will never
match at the beginning of the string being searched; you will most
likely want to use the search() function rather than the
match() function:
# -*- coding: utf-8 -*- # python import re m = re.search('(?<=abc)def', 'abcdef') print m.group(0) # prints def
This example looks for a word following a hyphen:
# -*- coding: utf-8 -*- # python import re m = re.search('(?<=-)\w+', 'spam-egg') print m.group(0) # prints egg
(?<!…)Matches if the current position in the string
is not preceded by a match for …. This is called a
negative lookbehind assertion. Similar to positive lookbehind
assertions, the contained pattern must only match strings of some
fixed length. Patterns which start with negative lookbehind
assertions may match at the beginning of the string being searched.
(?(‹id/name›)yes-pattern|no-pattern)Will try to match
with yes-pattern if the group with given ‹id› or ‹name›
exists, and with no-pattern if it doesn't. |no-pattern
is optional and can be omitted. For example,
(<)?(\w+@\w+(?:\.\w+)+)(?(1)>) is a poor email matching
pattern, which will match with '<user@host.com>' as well as
'user@host.com', but not with '<user@host.com'.