In one form or another, processing text-based information is one of
the more common tasks that applications need to perform. This can include
anything from scanning a text file by columns to analyzing statements in a
language defined by a formal grammar. Such processing usually is
called
parsing
—analyzing the structure of a
text string. In this chapter, we’ll explore ways to handle language and
text-based information and summarize some Python development concepts in
sidebars along the way. In the process, we’ll meet string methods, text
pattern matching, XML and HTML parsers, and other tools.
Some of this material is advanced, but the examples are small to
keep this chapter short. For instance, recursive descent parsing is
illustrated with a simple example to show how it can be implemented in
Python. We’ll also see that it’s often unnecessary to write custom parsers
for each language processing task in Python. They can usually be replaced
by exporting APIs for use in Python programs, and sometimes by a single
built-in function call. Finally, this chapter closes by presenting
PyCalc—a calculator GUI written in Python, and the last major Python
coding example in this text. As we’ll see, writing calculators isn’t much
more difficult than juggling stacks while scanning text.
In the grand
scheme of things, there are a variety of ways to handle text
processing and language analysis in Python:
Built-in string object expressions
Built-in string object method calls
Regular expression pattern matching
XML and HTML text parsing
Custom language parsers, both handcoded and generated
Running Python code witheval
andexec
built-ins
Natural language processing
For simpler tasks, Python’s built-in
string object is often all we really need. Python strings
can be indexed, concatenated, sliced, and processed with both string
method calls and built-in functions. Our main emphasis in this chapter is
mostly on higher-level tools and techniques for analyzing textual
information and language, but we’ll briefly explore each of these
techniques in turn. Let’s get
started.
Some readers may have come to this chapter seeking coverage of
Unicode text, too, but this topic is not presented here. For a look at
Python’s Unicode support, see
Chapter 2
’s
discussion of string tools,
Chapter 4
’s discussion of text and binary
file distinctions and encodings, and
Chapter 9
’s coverage of text in tkinter
GUIs. Unicode also appears in various Internet and database topics
throughout this book (e.g., email encodings).
Because Unicode is a core language topic, all these chapters will
also refer you to the fuller coverage of Unicode in
Learning
Python
, Fourth Edition. Most of the topics in this
chapter, including string methods and pattern matching, apply to Unicode
automatically simply because the Python 3.Xstr
string type
is
Unicode, whether ASCII or wider.
The first stop on
our text and language tour is the most basic: Python’s
string objects come with an array of text processing tools, and serve as
your first line of defense in this domain. As you undoubtedly know by now,
concatenation, slicing, formatting, and other string
expressions
are workhorses of most programs (I’m
including the newerformat
method
in this category, as it’s really just an alternative to the%
expression):
>>>'spam eggs ham'[5:10]
# slicing: substring extraction
'eggs '
>>>'spam ' + 'eggs ham'
# concatenation (and *, len(), [ix])
'spam eggs ham'
>>>'spam %s %s' % ('eggs', 'ham')
# formatting expression: substitution
'spam eggs ham'
>>>'spam {} {}'.format('eggs', 'ham')
# formatting method: % alternative
'spam eggs ham'
>>>'spam = "%-5s", %+06d' % ('ham', 99)
# more complex formatting
'spam = "ham ", +00099'
>>>'spam = "{0:<5}", {1:+06}'.format('ham', 99)
'spam = "ham ", +00099'
These operations are covered in core language resources such as
Learning
Python
. For the purposes of this chapter, though, we’re
interested in more powerful tools: Python’s string object
methods
include a wide variety of text-processing
utilities that go above and beyond string expression operators. We saw
some of these in action early on in
Chapter 2
, and
have been using them ever since. For instance, given an instancestr
of the built-in string object type,
operations like the following are provided as object method calls:
str.find(
substr
)
Performs
substring searches
str.replace(
old
,
new
)
Performs
substring substitutions
str.split(
delimiter
)
Chops up a
string around a delimiter or whitespace
str.join(
iterable
)
Puts
substrings together with delimiters between
str.strip()
Removes l
eading and trailing whitespace
str.rstrip()
Removes
trailing whitespace only, if any
str.rjust(
width
)
Right-justifies
a string in a fixed-width field
str.upper()
Converts
to uppercase
str.isupper()
Tests whether
the string is uppercase
str.isdigit()
Tests
whether the string is all digit characters
str.endswith(
substr-or-tuple
)
Tests
for a substring (or a tuple of alternatives) at the
end
str.startswith(
substr-or-tuple
)
Tests for a
substring (or a tuple of alternatives) at the
front
This list is representative but partial, and some of these methods
take additional optional arguments. For the full list of string methods,
run adir(str)
call at the Python
interactive prompt and runhelp(str.
method
)
on any method for some quick documentation.
The Python library manual and reference books such as
Python
Pocket Reference
also include an exhaustive
list.
Moreover, in Python today all normal string methods apply to bothbytes
andstr
strings. The latter makes them applicable to
arbitrarily encoded Unicode text, simply because thestr
type is Unicode text, even if it’s only
ASCII. These methods originally appeared as function in thestring
module, but are
only object methods today; thestring
module is still present because it contains predefined constants (e.g.,string.ascii_uppercase
),
as well as theTemplate
substitution interface in 2.4 and
later—
one of the techniques discussed in the
next
section.
By way of
review, let’s take a quick look at string methods in the
context of some of their most common use cases. As we saw when
generating HTML forwarding pages in
Chapter 6
, the stringreplace
method is
often adequate by itself as a string
templating
tool—we can compute values and insert them at fixed positions in a
string with simple replacement calls:
>>>template = '---$target1---$target2---'
>>>val1 = 'Spam'
>>>val2 = 'shrubbery'
>>>template = template.replace('$target1', val1)
>>>template = template.replace('$target2', val2)
>>>template
'---Spam---shrubbery---'
As we also saw when generating HTML reply pages in the CGI scripts
of Chapters
15
and
16
, the
string%
formatting operator is also
a powerful templating tool, especially when combined with
dictionaries—simply fill out a dictionary with values and apply multiple
substitutions to the HTML string all at once:
>>>template = """
...---
...---%(key1)s---
...---%(key2)s---
..."""
>>>
>>>vals = {}
>>>vals['key1'] = 'Spam'
>>>vals['key2'] = 'shrubbery'
>>>print(template % vals)
---
---Spam---
---shrubbery---
Beginning with Python 2.4, thestring
module’sTemplate
feature
is essentially a simplified and limited variation of the
dictionary-based format scheme just shown, but it allows some additional
call patterns which some may consider simpler:
>>>vals
{'key2': 'shrubbery', 'key1': 'Spam'}
>>>import string
>>>template = string.Template('---$key1---$key2---')
>>>template.substitute(vals)
'---Spam---shrubbery---'
>>>template.substitute(key1='Brian', key2='Loretta')
'---Brian---Loretta---'
See the library manual for more on this extension. Although the
string datatype does not itself support the pattern-directed text
processing that we’ll meet later in this chapter, its tools are powerful
enough for many tasks.
In terms of this
chapter’s main focus, Python’s built-in tools for
splitting and joining strings around tokens turn out to be especially
useful when it comes to parsing text:
str.split(delimiter?,
maxsplits?)
Splits a string
into a list of substrings, using either whitespace
(tabs, spaces, newlines) or an explicitly passed string as a
delimiter.maxsplits
limits the
number of splits performed, if passed.
delimiter.join(iterable)
Concatenates
a sequence or other iterable of substrings (e.g.,
list, tuple, generator), adding the subject separator string
between each.
These two are among the most powerful of string methods. As we saw
in
Chapter 2
,split
chops a string into a list of substrings
andjoin
puts them back
together:
>>>'A B C D'.split()
['A', 'B', 'C', 'D']
>>>'A+B+C+D'.split('+')
['A', 'B', 'C', 'D']
>>>'--'.join(['a', 'b', 'c'])
'a--b--c'
Despite their simplicity, they can handle surprisingly complex
text-parsing tasks. Moreover, string method calls are very fast because
they are implemented in C language code. For instance, to quickly
replace all tabs in a file with four periods, pipe the file into a
script that looks like this:
from sys import *
stdout.write(('.' * 4).join(stdin.read().split('\t')))
Thesplit
call here divides
input around tabs, and thejoin
puts
it back together with periods where tabs had been. In this case, the
combination of the two calls is equivalent to using the simpler global
replacement string method call as follows:
stdout.write(stdin.read().replace('\t', '.' * 4))
As we’ll see in the next section, splitting strings is sufficient
for many text-parsing goals.
Let’s look next at
some practical applications of string splits and joins. In
many domains, scanning files by columns is a fairly common task. For
instance, suppose you have a file containing columns of numbers output
by another system, and you need to sum each column’s numbers. In Python,
string splitting is the core operation behind solving this problem, as
demonstrated by
Example 19-1
. As
an added bonus, it’s easy to make the solution a reusable tool in Python
by packaging it as an importable function.
Example 19-1. PP4E\Lang\summer.py
#!/usr/local/bin/python
def summer(numCols, fileName):
sums = [0] * numCols # make list of zeros
for line in open(fileName): # scan file's lines
cols = line.split() # split up columns
for i in range(numCols): # around blanks/tabs
sums[i] += eval(cols[i]) # add numbers to sums
return sums
if __name__ == '__main__':
import sys
print(summer(eval(sys.argv[1]), sys.argv[2])) # '% summer.py cols file'
Notice that we use file iterators here to read line by line,
instead of calling the filereadlines
method
explicitly (recall from
Chapter 4
that
iterators avoid loading the entire file into memory all at once). The
file itself is a temporary object, which will be automatically closed
when garbage collected.
As usual for properly architected scripts, you
can both
import
this module and call
its function, and
run
it as a shell tool from the
command line. The
summer.py
script callssplit
to make a list of strings representing
the line’s columns, andeval
to
convert column strings to numbers. Here’s an input file that uses both
blanks and tabs to separate columns, and the result of turning our
script loose on it:
C:\...\PP4E\Lang>type table1.txt
1 5 10 2 1.0
2 10 20 4 2.0
3 15 30 8 3
4 20 40 16 4.0
C:\...\PP4E\Lang>python summer.py 5 table1.txt
[10, 50, 100, 30, 10.0]
Also notice that because the summer script useseval
to convert file text to numbers, you
could really store arbitrary Python expressions in the file. Here, for
example, it’s run on a file of Python code snippets:
C:\...\PP4E\Lang>type table2.txt
2 1+1 1<<1 eval("2")
16 2*2*2*2 pow(2,4) 16.0
3 len('abc') [1,2,3][2] {'spam':3}['spam']
C:\...\PP4E\Lang>python summer.py 4 table2.txt
[21, 21, 21, 21.0]
We’ll revisiteval
later
in this chapter, when we explore expression evaluators.
Sometimes this is more than we want—if we can’t be sure that the
strings that we run this way won’t contain malicious code, for
instance, it may be necessary to run them with limited machine access
or use more restrictive conversion tools. Consider the following
recoding of thesummer
function
(this is in file
summer2.py
in
the examples package if you care to experiment with it):
def summer(numCols, fileName):
sums = [0] * numCols
for line in open(fileName): # use file iterators
cols = line.split(',') # assume comma-delimited
nums = [int(x) for x in cols] # use limited converter
both = zip(sums, nums) # avoid nested for loop
sums = [x + y for (x, y) in both] # 3.X: zip is an iterable
return sums
This version usesint
for its
conversions from strings to support only numbers, and not arbitrary
and possibly unsafe expressions. Although the first four lines of this
coding are similar to the original, for variety this version also
assumes the data is separated by commas rather than whitespace, and
runs list comprehensions andzip
to
avoid the nestedfor
loop
statement. This version is also substantially trickier than the
original and so might be less desirable from a maintenance
perspective. If its code is confusing, try addingprint
call statements after each step to
trace the results of each operation. Here is its handiwork:
C:\...\PP4E\Lang>type table3.txt
1,5,10,2,1
2,10,20,4,2
3,15,30,8,3
4,20,40,16,4
C:\...\PP4E\Lang>python summer2.py 5 table3.txt
[10, 50, 100, 30, 10]
The summer
logic so far works, but it can be even more general— by
making the column numbers a key of a dictionary rather than an offset
in a list, we can remove the need to pass in a number-columns value
altogether. Besides allowing us to associate meaningful labels with
data rather than numeric positions, dictionaries are often more
flexible than lists in general, especially when there isn’t a fixed
size to our problem. For instance, suppose you need to sum up columns
of data stored in a text file where the number of columns is not known
or fixed:
C:\...\PP4E\Lang>python
>>>print(open('table4.txt').read())
001.1 002.2 003.3
010.1 020.2 030.3 040.4
100.1 200.2 300.3
Here, we cannot preallocate a fixed-length list of sums because
the number of columns may vary. Splitting on whitespace extracts the
columns, andfloat
converts to
numbers, but a fixed-size list won’t easily accommodate a set of sums
(at least, not without extra code to manage its size). Dictionaries
are more convenient here because we can use column positions as keys
instead of using absolute offsets. The following code demonstrates
interactively (it’s also in file
summer3.py
in the examples package):
>>>sums = {}
>>>for line in open('table4.txt'):
...cols = [float(col) for col in line.split()]
...for pos, val in enumerate(cols):
...sums[pos] = sums.get(pos, 0.0) + val
...
>>>for key in sorted(sums):
...print(key, '=', sums[key])
...
0 = 111.3
1 = 222.6
2 = 333.9
3 = 40.4
>>>sums
{0: 111.3, 1: 222.6, 2: 333.90000000000003, 3: 40.4}
Interestingly, most of this code uses tools added to Python over
the years—file and dictionary iterators, comprehensions,dict.get
, and theenumerate
andsorted
built-ins were not yet formed when
Python was new. For related examples, also see the tkinter grid
examples in
Chapter 9
for another
case ofeval
table magic at work.
That chapter’s table sums logic is a variation on this theme, which
obtains the number of columns from the first line of a data file and
tailors its summations for display in a
GUI.