The Pythonre
module
comes with functions that can search for patterns right
away or make compiled pattern objects for running matches later. Pattern
objects (and module search calls) in turn generate match objects, which
contain information about successful matches and matched substrings. For
reference, the next few sections describe the module’s interfaces and
some of the operators you can use to code patterns.
The top level of the module provides functions for matching,
substitution, precompiling, and so on:
compile(pattern [,
flags])
Compile a
regular expressionpattern
string into a regular
expression pattern object, for later matching. See the reference
manual
or
Python
Pocket Reference
for theflags
argument’s meaning.
match(pattern, string [,
flags])
If zero or
more characters at the start ofstring
match thepattern
string, return a corresponding
match object, orNone
if no
match is found. Roughly like asearch
for a pattern that begins with
the^
operator.
search(pattern, string [,
flags])
Scan
throughstring
for a location matchingpattern
, and return a corresponding
match object, orNone
if no
match is found.
findall(pattern, string [,
flags])
Return a
list of strings giving all nonoverlapping matches
ofpattern
instring
. If there are any groups in
patterns, returns a list of groups, and a list of tuples if the
pattern has more than one group.
finditer(pattern, string [,
flags])
Return
iterator over all nonoverlapping matches ofpattern
instring
.
split(pattern, string [, maxsplit,
flags])
Splitstring
by
occurrences ofpattern
. If
capturing parentheses (()
)
are used in the pattern, the text of all groups in the pattern
are also returned in the resulting list.
sub(pattern, repl, string [, count,
flags])
Return the
string obtained by replacing the (firstcount
) leftmost nonoverlapping
occurrences ofpattern
(a
string or a pattern object) instring
byrepl
(which may be a string with
backslash escapes that may back-reference a matched group, or a
function that is passed a single match object and returns the
replacement string).
subn(pattern, repl, string [,
count, flags])
Same assub
,
but returns a tuple: (new-string,
number-of-substitutions-made).
escape(string)
Returnstring
with all
nonalphanumeric characters backslashed, such that
they can be compiled as a string literal.
At the next level,
pattern objects provide similar attributes, but the
pattern string is implied. There.compile
function in the previous section
is useful to optimize patterns that may be matched more than once
(compiled patterns match faster). Pattern objects returned byre.compile
have these sorts of
attributes:
match(string [, pos] [, |
search(string [, pos] [, |
findall(string [, pos [, |
finditer(string [, pos [, |
split(string [, maxsplit]) |
sub(repl, string [, count]) |
subn(repl, string [, count]) |
These are the same as there
module functions, but the pattern is implied, andpos
andendpos
give start/end string indexes for the
match.
Finally, when
amatch
orsearch
function or method is successful, you
get back a match object (None
comes
back on failed matches). Match objects export a set of attributes of
their own, including:
group(g)
group(g1, g2, ...)
Return the substring that matched a parenthesized group
(or groups) in the pattern. Accept group numbers or names. Group
numbers start at 1; group 0 is the entire string matched by the
pattern. Returns a tuple when passed multiple group numbers, and
group number defaults to 0 if omitted.
groups()
Returns a tuple of all groups’ substrings of the match
(for group numbers 1 and higher).
groupdict()
Returns a dictionary containing all named groups of the
match (see(?P
syntax ahead).
start([group])
end([group])
Indices of the start and end of the substring matched bygroup
(or the entire matched
string, if nogroup
is
passed).
span([group])
Returns the two-item tuple:(start(group), end(group))
.
expand([template])
Performs backslash group substitutions; see the Python
library manual.
Regular expression strings
are built up by concatenating single-character regular
expression forms, shown in
Table 19-1
. The
longest-matching string is usually matched by each form, except for
the nongreedy operators. In the table,R
means any regular expression form,C
is a character, andN
denotes a digit.
Table 19-1. re pattern syntax
Operator | Interpretation |
---|---|
| Matches any character |
| Matches start of the |
| Matches end of the |
| Any nonspecial (or |
| Zero or more of |
| One or more of |
| Zero or one occurrence |
| Matches exactly |
| Matches from |
| Same as |
| Defines character set: |
| Defines complemented |
| Escapes special |
| Matches a literal |
| Matches the contents of |
| Alternative: matches |
| Concatenation: match |
| Matches any regular |
| Same as |
| Look-ahead assertion: |
| Matches if |
| Matches any regular |
| Matches whatever text |
| A comment; |
| Set mode flag; |
| Look-behind assertion: |
| Matches if the current |
| Will try to match with |
Within patterns, ranges and selections can be combined. For
instance,[a-zA-Z0-9_]+
matches the
longest possible string of one or more letters, digits, or
underscores. Special characters can be escaped as usual in Python
strings:[\t ]*
matches zero or
more tabs and spaces (i.e., it skips such whitespace).
The parenthesized grouping construct,(R)
, lets you extract matched substrings
after a successful match. The portion of the string matched by the
expression in parentheses is retained in a numbered register. It’s
available through thegroup
method
of a match object after a successful match.
In addition to the entries in this table, special sequences in
Table 19-2
can be used in patterns, too.
Because of Python string rules, you sometimes must double up on
backslashes (\\
) or use Python raw
strings (r'...'
) to retain
backslashes in the pattern verbatim. Python ignores backslashes in
normal strings if the letter following the backslash is not recognized
as an escape code. Some of the entries in
Table 19-2
are affected by Unicode when
matchingstr
instead ofbytes
, and an ASCII flag may be set to
emulate the behavior forbytes
; see
Python’s manuals for more details.
Table 19-2. re special sequences
Sequence | Interpretation |
---|---|
| Matches text of group |
| Matches only at the |
| Empty string at word |
| Empty string not at |
| Any decimal digit |
| Any nondecimal digit |
| Any whitespace |
| Any nonwhitespace |
| Any alphanumeric |
| Any nonalphanumeric |
| Matches only at the end |
Most of the standard escapes supported by Python string literals
are also accepted by the regular expression parser:\a
,\b
,\f
,\n
,\r
,\t
,\v
,\x
,
and\\
. The Python library manual
gives these escapes’ interpretation and additional details on pattern
syntax in general. But to further demonstrate how there
pattern syntax is typically used in
scripts, let’s go back to writing some
code.