Programming Python (187 page)

Using the re Module

The Python
remodule
comes with functions that can search for patterns right
away or make compiled pattern objects for running matches later. Pattern
objects (and module search calls) in turn generate match objects, which
contain information about successful matches and matched substrings. For
reference, the next few sections describe the module’s interfaces and
some of the operators you can use to code patterns.

Module functions

The top level of the module provides functions for matching,
substitution, precompiling, and so on:

compile(pattern [, flags]): Compile a
regular expression
patternstring into a regular
expression pattern object, for later matching. See the reference
manual
or
Python
Pocket Reference
for the
flagsargument’s meaning.
match(pattern, string [, flags]): If zero or
more characters at the start of
stringmatch the
patternstring, return a corresponding
match object, or
Noneif no
match is found. Roughly like a
searchfor a pattern that begins with
the
^operator.
search(pattern, string [, flags]): Scan
through
stringfor a location matching
pattern, and return a corresponding
match object, or
Noneif no
match is found.
findall(pattern, string [, flags]): Return a
list of strings giving all nonoverlapping matches
of
patternin
string. If there are any groups in
patterns, returns a list of groups, and a list of tuples if the
pattern has more than one group.
finditer(pattern, string [, flags]): Return
iterator over all nonoverlapping matches of
patternin
string.
split(pattern, string [, maxsplit, flags]): Split
stringby
occurrences of
pattern. If
capturing parentheses (
())
are used in the pattern, the text of all groups in the pattern
are also returned in the resulting list.
sub(pattern, repl, string [, count, flags]): Return the
string obtained by replacing the (first
count) leftmost nonoverlapping
occurrences of
pattern(a
string or a pattern object) in
stringby
repl(which may be a string with
backslash escapes that may back-reference a matched group, or a
function that is passed a single match object and returns the
replacement string).
subn(pattern, repl, string [, count, flags]): Same as
sub,
but returns a tuple: (new-string,
number-of-substitutions-made).
escape(string): Return
stringwith all
nonalphanumeric characters backslashed, such that
they can be compiled as a string literal.

Compiled pattern objects

At the next level,
pattern objects provide similar attributes, but the
pattern string is implied. The
re.compilefunction in the previous section
is useful to optimize patterns that may be matched more than once
(compiled patterns match faster). Pattern objects returned by
re.compilehave these sorts of
attributes:

match(string [, pos] [,
endpos])

search(string [, pos] [,
endpos])

findall(string [, pos [,
endpos]])

finditer(string [, pos [,
endpos]])

split(string [, maxsplit])

sub(repl, string [, count])

subn(repl, string [, count])

These are the same as the
remodule functions, but the pattern is implied, and
posand
endposgive start/end string indexes for the
match.

Match objects

Finally, when
a
matchor
searchfunction or method is successful, you
get back a match object (
Nonecomes
back on failed matches). Match objects export a set of attributes of
their own, including:

group(g)group(g1, g2, ...): Return the substring that matched a parenthesized group
(or groups) in the pattern. Accept group numbers or names. Group
numbers start at 1; group 0 is the entire string matched by the
pattern. Returns a tuple when passed multiple group numbers, and
group number defaults to 0 if omitted.
groups(): Returns a tuple of all groups’ substrings of the match
(for group numbers 1 and higher).
groupdict(): Returns a dictionary containing all named groups of the
match (see
(?PR)syntax ahead).
start([group]) end([group]): Indices of the start and end of the substring matched by
group(or the entire matched
string, if no
groupis
passed).
span([group]): Returns the two-item tuple:
(start(group), end(group)).
expand([template]): Performs backslash group substitutions; see the Python
library manual.

Regular expression patterns

Regular expression strings
are built up by concatenating single-character regular
expression forms, shown in
Table 19-1
. The
longest-matching string is usually matched by each form, except for
the nongreedy operators. In the table,
Rmeans any regular expression form,
Cis a character, and
Ndenotes a digit.

Table 19-1. re pattern syntax

Operator	Interpretation
`.`	Matches any character (including newline if `DOTALL`flag is specified or `(?s)`at pattern front)
`^`	Matches start of the string (of every line in `MULTILINE`mode)
`$`	Matches end of the string (of every line in `MULTILINE`mode)
`C`	Any nonspecial (or backslash-escaped) character matches itself
`R*`	Zero or more of preceding regular expression `R`(as many as possible)
`R+`	One or more of preceding regular expression `R`(as many as possible)
`R?`	Zero or one occurrence of preceding regular expression `R`(optional)
`R{m}`	Matches exactly `m`copies preceding `R`: `a{5}`matches `'aaaaa'`
`R{m,n}`	Matches from `m`to `n`repetitions of preceding regular expression `R`
`R*?, R+?, R??, R{m,n}?`	Same as ``, `+`, and `?`but matches as few characters/times as possible; these are known as nongreedy* match operators (unlike others, they match and consume as few characters as possible)
`[...]`	Defines character set: e.g., `[a-zA-Z]`to match all letters (alternatives, with `-`for ranges)
`[^...]`	Defines complemented character set: matches if `char`is not in set
`\`	Escapes special `char`s (e.g., `*?+\|()`) and introduces special sequences in Table 19-2
`\\`	Matches a literal `\`(write as `\\\\`in pattern, or use `r'\\'`)
`\N`	Matches the contents of the group of the same number N: `'(.+) \1'`matches “42 42”
`R\|R`	Alternative: matches left or right `R`
`RR`	Concatenation: match both `R`s
`(R)`	Matches any regular expression inside `()`, and delimits a group (retains matched substring)
`(?:R)`	Same as `(R)`but simply delimits part R and does not denote a saved group
`(?=R)`	Look-ahead assertion: matches if `R`matches next, but doesn’t consume any of the string (e.g., `X (?=Y)`matches `X`only if followed by `Y`)
`(?!R)`	Matches if `R`doesn’t match next; negative of `(?=R)`
`(?PR)`	Matches any regular expression inside `()`, and delimits a named group
`(?P=name)`	Matches whatever text was matched by the earlier group named `name`
`(?#...)`	A comment; ignored
`(?letter)`	Set mode flag; `letter`is one of `aiLmsux`(see the library manual)
`(?<=R)`	Look-behind assertion: matches if the current position in the string is preceded by a match of `R`that ends at the current position
`(?`	Matches if the current position in the string is not preceded by a match for `R`; negative of `(?<= R)`
`(?(id/name)yespattern\|nopattern)`	Will try to match with `yespattern`if the group with given `id`or `name`exists, else with optional `nopattern`

Within patterns, ranges and selections can be combined. For
instance,
[a-zA-Z0-9_]+matches the
longest possible string of one or more letters, digits, or
underscores. Special characters can be escaped as usual in Python
strings:
[\t ]*matches zero or
more tabs and spaces (i.e., it skips such whitespace).

The parenthesized grouping construct,
(R), lets you extract matched substrings
after a successful match. The portion of the string matched by the
expression in parentheses is retained in a numbered register. It’s
available through the
groupmethod
of a match object after a successful match.

In addition to the entries in this table, special sequences in
Table 19-2
can be used in patterns, too.
Because of Python string rules, you sometimes must double up on
backslashes (
\\) or use Python raw
strings (
r'...') to retain
backslashes in the pattern verbatim. Python ignores backslashes in
normal strings if the letter following the backslash is not recognized
as an escape code. Some of the entries in
Table 19-2
are affected by Unicode when
matching
strinstead of
bytes, and an ASCII flag may be set to
emulate the behavior for
bytes; see
Python’s manuals for more details.

Table 19-2. re special sequences

Sequence	Interpretation
`\``number`	Matches text of group `number`(numbered from 1)
`\A`	Matches only at the start of the string
`\b`	Empty string at word boundaries
`\B`	Empty string not at word boundaries
`\d`	Any decimal digit character ( `[0-9]`for ASCII)
`\D`	Any nondecimal digit character ( `[^O-9]`for ASCII)
`\s`	Any whitespace character ( `[ \t\n\r\f\v]`for ASCII)
`\S`	Any nonwhitespace character ( `[^ \t\n\r\f\v]`for ASCII)
`\w`	Any alphanumeric character ( `[a-zA-Z0-9_]`for ASCII)
`\W`	Any nonalphanumeric character ( `[^a-zA-Z0-9_]`for ASCII )
`\Z`	Matches only at the end of the string

Most of the standard escapes supported by Python string literals
are also accepted by the regular expression parser:
\a,
\b,
\f,
\n,
\r,
\t,
\v,
\x,
and
\\. The Python library manual
gives these escapes’ interpretation and additional details on pattern
syntax in general. But to further demonstrate how the
repattern syntax is typically used in
scripts, let’s go back to writing some
code.

Other books

The Housemistress by Keira Michelle Telford

Bitch Witch by S.R. Karfelt

Reality Is Broken: Why Games Make Us Better and How They Can Change the World by Jane McGonigal

Angel of Ash by Law, Josephine

Bodas de odio by Florencia Bonelli

Beautifully Broken by Shayne Donovan

Inside Threat by Jason Elam, Steve Yohn

Samantha James by The Secret Passion of Simon Blackwell

The Satanic Mechanic by Sally Andrew

Ripper by Stefan Petrucha