Python Regex Functions

, , …,

Here's a summary of regex functions.

regex functions summary
syntaxmeaning
re.search(p,str)return match object if found, else None
re.match(p,str)similar to re.search(), but match starts at beginning of string.
re.split(p,str)return a list.
re.findall(p,str)return a list of non-overlapping (repeated) matches.
re.finditer(…)similar to re.findall(), but returns a iterator.
re.sub(p,replacement,str)does replacement. Returns the new string.
re.subn(…)similar to re.sub(), but returns a tuple. 1st element is the new string, 2nd is number of replacement.
re.escape(str)add backslash to string for feeding it to regex as pattern. Return the new string.

Note: optional parameters not shown above. See below for detail.

re.search(…)

re.search(pattern, str) → Return MatchObject if pattern matches (part or whole of a string), else return None. Note: A successful match does not necessarily mean it contains part of the given string. For example, these patterns matches any string: '' and 'y*'.

re.search(pattern, str, flags=flags) → use flags flags

# -*- coding: utf-8 -*-
# python

# example of re.search()

import re

xx = re.search(r"\w+@\w+\.com", "from xyz@example.com address")

if xx:
    print "yes"
    print xx.group() # → xyz@xyz.com
else:
    print "no"

Note: pattern string should be enclosed using raw quotes, like this r"…". Otherwise, backslashes in it must be escaped. For example, to search for a sequence of tabs, use re.search(r"\t+") or re.search("\\t+").

The optional parameter “flags” modifies the meaning of the given pattern. The flags can be any of:

To specify more than one of them, use | operator to connect them. For example, re.search(pattern,string,flags=re.IGNORECASE|re.MULTILINE|re.UNICODE).

For detail, see: Python Regex Flags

re.match(…)

re.match(pattern, string) → Similar to re.search() except that the match must start at the beginning of string. For example, re.search('me','somestring') matches, but re.match('me','somestring') returns None.

re.match(pattern, string, flags=flags) → use flag.

# -*- coding: utf-8 -*-
# python

import re

my_result = re.match('so','somestring') # succeed

if my_result == None:
    print "no match"
else:
    print "yes match"

Note: re.match() is not exactly equivalent to re.search() with ^. Example:

re.search(r'^B', 'A\nB',re.M) # succeeds
re.match(r'B', 'A\nB',re.M)   # fails

re.split(…)

re.split(pattern, string) → Returns a list of splitted string with pattern as boundary.

# -*- coding: utf-8 -*-
# python

import re

print re.split(r' +', 'what   do  you think')
#                    ['what', 'do', 'you', 'think']

If the boundary pattern is enclosed in parenthesis, then it is included in the returned list. For Example:

# -*- coding: utf-8 -*-
# python

import re

print re.split(r'( +)', 'what   do  you think')
#     ['what', '   ', 'do', '  ', 'you', ' ', 'think']

If there are more than one capturing parenthesis in pattern, they are all included in the returned list in sequence. For Example:

# -*- coding: utf-8 -*-
# python

import re

print re.split(r'( +)(@+)', 'what   @@do  @@you @@think')
# ⇒ ['what', '   ', '@@', 'do', '  ', '@@', 'you', ' ', '@@', 'think']

re.split(pattern, string, maxsplit = n) → split, at most n times.

# -*- coding: utf-8 -*-
# python
import re
print re.split(r' ', 'a b c d e', maxsplit = 2)
# ['a', 'b', 'c d e']

re.findall(…)

re.findall(pattern, string) → Return a list of all non-overlapping matches of pattern in string.

re.findall(pattern, string, flags=flags)

# -*- coding: utf-8 -*-
# python
import re
print re.findall(r'@+', 'what   @@@do  @@you @think') # ['@@@', '@@', '@']

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Example:

# -*- coding: utf-8 -*-
# python
import re
print re.findall(r'( +)(@+)', 'what   @@@do  @@you @think')
# ⇒ [('   ', '@@@'), ('  ', '@@'), (' ', '@')]

Empty matches are included in the result unless they touch the beginning of another match. Example:

# -*- coding: utf-8 -*-
# python
import re
print re.findall(r'\b', 'what   @@@do  @@you @think')
# ['', '', '', '', '', '', '', '']

TODO: need another example here showing what is meant by “unless they touch the beginning of another match.”

re.finditer(…)

re.finditer(pattern, string) → Similar to re.findall(), except an “iterator” is returned with MatchObject as members. This is to be used in a loop.

re.finditer(pattern, string, flags=flags)

# -*- coding: utf-8 -*-
# python
import re

for matched in re.finditer(r'(\w+)', 'what   do  you think'):
    print matched.group()       # prints each word in a line

re.sub(…)

re.sub(pattern, repl, string) → Substitute pattern in string by the replacement repl. If the pattern isn't found, string is returned unchanged. Returns a new string.

# -*- coding: utf-8 -*-
# python

# example of using re.sub( )

import re

# add alt to image tag
t1 = '<img src="cat.jpg">'
t2 = re.sub(r'src="([a-z]+)\.jpg">', r'src="\1.jpg" alt="\1">', t1)

print t1                    # <img src="cat.jpg">
print t2                    # <img src="cat.jpg" alt="cat">

repl can also be a function for more complicated replacement. The function must take a MatchObject as argument. For each occurrence of match, the function is called and its return value used as the replacement string. Example:

# -*- coding: utf-8 -*-
# python

# example of using re.sub(pattern, rep, str ) where rep is a function

import re

def ff(xx):
    if xx.group(0) == "ea":
        return "æ"
    elif xx.group(0) == "oo":
        return "u"
    else:
        return xx.group(0)

print re.sub(r"[aeiou]+", ff, "encyclopeadia") # encyclopædia
print re.sub(r"[aeiou]+", ff, "book") # buk
print re.sub(r"[aeiou]+", ff, "geek") # geek

pattern may be a string or an regex object. If you need to specify regular expression flags, you can use a regex object. Alternatively, you can embed a flag in your regex pattern by (?iLmsux) in the beginning of your pattern. For example, re.sub("(?i)b+", "x", "bbbb BBBB") returns 'x x'. (See: regex pattern syntax for detail.)

re.sub(pattern, repl, string, count)count is the maximum number of pattern occurrences to be replaced.

In addition to character escapes and backreferences as described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>…) syntax. \g<number> uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn't ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character 0. The backreference \g<0> substitutes in the entire substring matched by the pattern.

re.subn(…)

re.subn(…) → Same as re.sub(…), except it returns a tuple: (new string, number of substitution made).

re.escape(…)

re.escape(string) → Return a string with a backslash character 「\」 inserted in front of every non-alphanumeric character. This is useful if you want to use a given string as a pattern for exact match.

Exception Error

Exception raised when a string passed to one of the functions here is not a valid regular expression (for example, it might contain unmatched parentheses) or when some other error occurs during compilation or matching. It is never an error if a string contains no match for a pattern.

blog comments powered by Disqus