Python: Regex
What is Regex
Regular Expression (aka regex) is a character sequence that represent a text pattern.
For example, you can use it to find all email addresses in a file by matching the email address pattern.
Regex is used by many functions to check if a string contains certain pattern, or extract it, or replace it with other string.
Check If String Match
To use regex in Python, first you need to import re.
To check if a pattern is in string, use:
re.search(pattern, str, flags)-
If pattern matches (part or whole of a string), then a
Match Object
is returned. Else, Returns
None. (Match Object evaluates toTrue) 〔see Python: Regex Match Object〕For regex flags, see: Python: Regex Flags .
# regex matching email email address import re xtext = "this xyz@example.com that" xx = re.search(r" (\w+@\w+\.com) ", xtext ) if xx: print("yes") print(xx.group(1)) else: print("no")
Find and Replace
sub(pattern, repl, string)-
Substitute pattern in string by the replacement repl. If the pattern isn't found, string is returned unchanged. Returns a new string.
Optional 4th argument is number of replacement to make. If omitted, it replace all occurrences of matches.
# example of regex replace import re x = "123"; x2 = re.sub(r"2", r"8", x) print(x2) # 183
Here's a more complex example, replacing all “gif” image paths to “png” in HTML file.
# regex example of replacing gif to png in html img tag import re myText = r"""<p><img src="rabbits.gif" width="30" height="20"> and <img class="xyz" src="../cats.gif">, but <img src ="tigers.gif">, <img src= "bird.gif">!</p>""" newText = re.sub(r'src\s*=\s*"([^"]+)\.gif"', r'src="\1.png"', myText) print(newText) # <p><img src="rabbits.png" width="30" height="20"> # and <img class="xyz" src="../cats.png">, # but <img src="tigers.png">, # <img src="bird.png">!</p>
Tip
Note: A successful match does not necessarily mean it contains part of the given string. e.g. these patterns matches any string: '' and 'y*'.
Note: pattern string should be enclosed using raw quotes, like this r"…".
Otherwise, backslashes in it must be escaped. e.g. to search for a sequence of tabs, use re.search(r"\t+") or re.search("\\t+").
〔see Python: Quote String〕