Python: Regex Flags
Many Python Regex Functions and Regex Methods take a optional argument called “flags”. The flags modifies the meaning of the given regex pattern.
The flags can be any of:
syntax | long syntax | meaning |
---|---|---|
re.I | re.IGNORECASE | ignore case. |
re.M | re.MULTILINE | make begin/end {^ , $ } consider each line. |
re.S | re.DOTALL | make . match newline too. |
re.U | re.UNICODE | make {\w , \W , \b , \B } follow Unicode rules. |
re.L | re.LOCALE | make {\w , \W , \b , \B } follow locale. |
re.X | re.VERBOSE | allow comment in regex. |
To specify more than one of them, use |
operator to connect them. e.g. re.search(pattern, string,flags=re.IGNORECASE|re.MULTILINE|re.UNICODE)
.
re.IGNORECASE or re.I
Indicates case-insensitive matching.
re.MULTILINE or re.M
When specified, the pattern character ^
match the beginning of the string and the beginning of each line (immediately following each newline); and the pattern character $
match at the end of the string and at the end of each line (immediately preceding each newline).
Normally, ^
and $
only match at the beginning/end of the string. 〔see Python: Regex Syntax〕
# -*- coding: utf-8 -*- # python 2 # example of regex flag re.MULTILINE import re ss = """abc def ghi""" r1 = re.findall(r"^\w", ss) r2 = re.findall(r"^\w", ss, flags = re.MULTILINE) print r1 # ['a'] print r2 # ['a', 'd', 'g']
re.DOTALL or re.S
Make the dot character .
match any character, including a newline. Without this flag, a dot will match anything except a newline.
# -*- coding: utf-8 -*- # python 2 # example of regex flag re.DOTALL import re ss = """once upon a time, there lived a king""" r1 = re.findall(r".+", ss) r2 = re.findall(r".+", ss, re.DOTALL) print r1 # ['once upon a time,', 'there lived a king'] print r2 # ['once upon a time,\nthere lived a king']
re.UNICODE or re.U
Make the pattern characters {\w
, \W
, \b
, \B
} dependent on the Unicode character properties database.
# -*- coding: utf-8 -*- # example of regex re.UNICODE flag import re x1 = re.search(r"\w+", u"♥αβγ!", re.U) x2 = re.search(r"\w+", u"♥αβγ!") if x1: print x1.group().encode("utf8") # → 「αβγ」 else: print "no match" print x2 # → 「None」
Note that Unicode string can be in the pattern string. Just be sure to use the Unicode prefix u
to the pattern string.
# -*- coding: utf-8 -*- import re result = re.findall(ur"β", u"αβγ", re.U) print result[0].encode("utf8") # prints β
re.LOCALE or re.L
Make the word pattern {\w
, \W
} and boundary pattern {\b
, \B
}, dependent on the current locale. 〔see Python: Regex Syntax〕
re.VERBOSE or re.X
This flag changes the regex syntax, to allow you to add annotations
in regex. Whitespace within the pattern is ignored, except when in a character
class or preceded by an unescaped backslash, and, when a line contains
a #
neither in a character class or preceded by an unescaped
backslash, all characters from the leftmost such #
through
the end of the line are ignored.
# -*- coding: utf-8 -*- import re # example of the regex re.VERBOSE flag # matching a decimal number p1 = re.compile(r"""\d + # the integral part \. # the decimal point \d * # some fractional digits""", re.X) p2 = re.compile(r"\d+\.\d*") # pattern p2 is same as p1 r1 = re.findall(p1, u"a3.45") r2 = re.findall(p2, u"a3.45") print r1[0].encode("utf8") # 3.45 print r2[0].encode("utf8") # 3.45