This part of the book presents a collection of additional Python
application topics. Most of the tools presented along the way can be used
in a wide variety of application domains. You’ll find the following
chapters here:
This chapter covers commonly used and advanced Python
techniques for storing information between program executions—DBM
files, object pickling, object shelves, and Python’s SQL database
API—and briefly introduces full-blown OODBs such as ZODB, as well as
ORMs such as SQLObject and SQLAlchemy. The Python standard library’s
SQLite support is used for the SQL examples, but the API is portable
to enterprise-level systems such as MySQL.
This chapter explores techniques for implementing more
advanced data structures in Python—stacks, sets, binary search
trees, graphs, and the like. In Python, these take the form of
object implementations.
This chapter addresses Python tools and techniques for parsing
text-based information—string splits and joins, regular expression
matching, XML parsing, recursive descent parsing, and more advanced
language-based topics.
This chapter introduces integration techniques—both extending
Python with compiled libraries and embedding Python code in other
applications. While the main focus here is on linking Python with
compiled C code, we’ll also investigate integration with Java, .NET,
and more. This chapter assumes that you know how to read C programs,
and it is intended mostly for developers responsible for
implementing application integration layers.
This is the last technical part of the book, and it makes heavy use
of tools presented earlier in the text to help underscore the notion of
code reuse. For instance, a calculator GUI (PyCalc) serves to demonstrate
language processing and code reuse concepts.
So far in this book, we’ve used Python in the system programming,
GUI development, and Internet scripting domains—three of Python’s most
common applications, and representative of its use as an application
programming language at large. In the next four chapters, we’re going to
take a quick look at other major Python programming topics: persistent
data, data structure techniques, text and language processing, and
Python/C integration.
These four topics are not really application areas themselves, but
they are techniques that span domains. The database topics in this
chapter, for instance, can be applied on the Web, in desktop GUI
applications, and so on. Text processing is a similarly general tool.
Moreover, none of these final four topics is covered exhaustively (each
could easily fill a book alone), but we’ll sample Python in action in
these domains and highlight their core concepts and tools. If any of these
chapters spark your interest, additional resources are readily available
in the Python world.
In this chapter, our focus is on
persistent
data—the kind that outlives a
program that creates it. That’s not true by default for
objects a script constructs, of course; things like lists, dictionaries,
and even class instance
objects live in your computer’s memory and are lost as soon
as the script ends. To make data live longer, we need to do something
special. In Python programming, there are today at least six traditional
ways to save information in between program executions:
Text and bytes stored directly on your computer
Keyed access to
strings stored in dictionary-like files
Serialized Python
objects saved to files and streams
Pickled Python
objects saved in DBM keyed files
Persistent
Python objects stored in persistent dictionaries
(ZODB, Durus)
Table-based
storage that supports SQL queries (SQLite, MySQL,
PostGreSQL, etc.)
Mediators
that map Python classes to relational
tables (SQLObject, SQLAlchemy)
In some sense, Python’s interfaces to network-based object
transmission protocols such as
SOAP, XML-RPC, and CORBA also offer persistence options, but
they are beyond the scope of this chapter. Here, our interest is in
techniques that allow a program to store its data directly and, usually,
on the local machine. Although some database servers may operate on a
physically remote machine on a network, this is largely transparent to
most of the techniques we’ll study here.
We studied Python’s simple (or “flat”) file
interfaces in earnest in
Chapter 4
, and we have been using them ever
since. Python provides standard access to both thestdio
filesystem (through the built-inopen
function), as well as lower-level
descriptor-based files (with the built-inos
module). For simple data storage tasks, these
are all that many scripts need. To save for use in a future program run,
simply write data out to a newly opened file on your computer in text or
binary mode, and read it back from that file later. As we’ve seen, for
more advanced tasks, Python also supports other file-like
interfaces
such as pipes, fifos, and
sockets.
Since we’ve already explored flat files, I won’t say more about them
here. The rest of this chapter introduces the remaining topics on the
preceding list. At the end, we’ll also meet a GUI program for browsing the
contents of things such as shelves and DBM files. Before that, though, we
need to learn what manner of beast
these are.
Fourth edition coverage note
: The prior
edition of this book used themysql-python
interface to the MySQL relational
database system, as well as the ZODB object database system. As I update
this chapter in June 2010, neither of these is yet available for Python
3.X, the version of Python used in this edition. Because of that, most
ZODB information has been trimmed, and the SQL database examples here
were changed to use the SQLite in-process database system that ships
with Python 3.X as part of its standard library. The prior edition’s
ZODB and MySQL examples and overviews are still available in the
examples package, as described later. Because Python’s SQL database API
is portable, though, the SQLite code here should work largely unchanged
on most other systems.
Flat files are handy
for simple persistence tasks, but they are generally geared
toward a sequential processing mode. Although it is possible to jump
around to arbitrary locations withseek
calls, flat files don’t provide much structure to data beyond the notion
of bytes and text lines.
DBM files, a standard tool in the Python library for database
management, improve on that by providing key-based access to stored text
strings. They implement a random-access, single-key view on stored data.
For instance, information related to objects can be stored in a DBM file
using a unique key per object and later can be fetched back directly with
the same key. DBM files are implemented by a variety of underlying modules
(including one coded in Python), but if you have Python, you have a
DBM.
Although
DBM filesystems have to do a bit of work to map chunks of
stored data to keys for fast retrieval (technically, they generally use
a technique
called
hashing
to store data in
files), your scripts don’t need to care about the action going on behind
the scenes. In fact, DBM is one of the easiest ways to save information
in Python—DBM files behave so much like in-memory dictionaries that you
may forget you’re actually dealing with a file at all. For instance,
given a DBM file object:
Indexing by key fetches data from the file.
Assigning to an index stores data in the file.
DBM file objects also support common dictionary methods such as
keys-list fetches and tests and key deletions. The DBM library itself is
hidden behind this simple model. Since it is so simple, let’s jump right
into an interactive example that creates a DBM file and shows how the
interface works:
C:\...\PP4E\Dbase>python
>>>import dbm
# get interface: bsddb, gnu, ndbm, dumb
>>>file = dbm.open('movie', 'c')
# make a DBM file called 'movie'
>>>file['Batman'] = 'Pow!'
# store a string under key 'Batman'
>>>file.keys()
# get the file's key directory
[b'Batman']
>>>file['Batman']
# fetch value for key 'Batman'
b'Pow!'
>>>who = ['Robin', 'Cat-woman', 'Joker']
>>>what = ['Bang!', 'Splat!', 'Wham!']
>>>for i in range(len(who)):
...file[who[i]] = what[i]
# add 3 more "records"
...
>>>file.keys()
[b'Cat-woman', b'Batman', b'Joker', b'Robin']
>>>len(file), 'Robin' in file, file['Joker']
(4, True, b'Wham!')
>>>file.close()
# close sometimes required
Internally, importing thedbm
standard library
module automatically loads whatever DBM interface is available in your
Python interpreter (attempting alternatives in a fixed order), and
opening the new DBM file creates one or more external files with names
that start with the string'movie'
(more on the details in a moment). But after the import and open, a DBM
file is virtually indistinguishable from a dictionary.
In effect, the object calledfile
here can be thought of as a dictionary
mapped to an external file calledmovie
; the only obvious differences are that
keys must be strings (not arbitrary immutables), and we need to remember
to open to access and close after changes.
Unlike normal dictionaries, though, the contents offile
are retained between Python program runs.
If we come back later and restart Python, our dictionary is still
available. Again, DBM files are like dictionaries that must be
opened:
C:\...\PP4E\Dbase>python
>>>import dbm
>>>file = dbm.open('movie', 'c')
# open existing DBM file
>>>file['Batman']
b'Pow!'
>>>file.keys()
# keys gives an index list
[b'Cat-woman', b'Batman', b'Joker', b'Robin']
>>>for key in file.keys(): print(key, file[key])
...
b'Cat-woman' b'Splat!'
b'Batman' b'Pow!'
b'Joker' b'Wham!'
b'Robin' b'Bang!'
Notice how DBM files return a real list for thekeys
call; not shown here, theirvalues
method instead returns an iterable view
like dictionaries. Further, DBM files always store both keys and values
asbytes
objects; interpretation as
arbitrary types of Unicode text is left to the client application. We
can use eitherbytes
orstr
strings in our code when accessing or
storing keys and values—usingbytes
allows your keys and values to retain arbitrary Unicode encodings, butstr
objects in our code will be
encoded tobytes
internally using the
UTF-8 Unicode encoding by Python’s DBM implementation.
Still, we can always decode to Unicodestr
strings to display in a more friendly
fashion if desired, and DBM files have a keys iterator just like
dictionaries. Moreover, assigning and deleting keys updates the DBM
file, and we should close after making changes (this ensure that changes
are flushed to disk):
>>>for key in file: print(key.decode(), file[key].decode())
...
Cat-woman Splat!
Batman Pow!
Joker Wham!
Robin Bang!
>>>file['Batman'] = 'Ka-Boom!'
# change Batman slot
>>>del file['Robin']
# delete the Robin entry
>>>file.close()
# close it after changes
Apart from having to import the interface and open and close the
DBM file, Python programs don’t have to know anything about DBM itself.
DBM modules achieve this integration by overloading the indexing
operations and routing them to more primitive library tools. But you’d
never know that from looking at this Python code—DBM files look like
normal Python dictionaries, stored on external files. Changes made to
them are retained indefinitely:
C:\...\PP4E\Dbase>python
>>>import dbm
# open DBM file again
>>>file = dbm.open('movie', 'c')
>>>for key in file: print(key.decode(), file[key].decode())
...
Cat-woman Splat!
Batman Ka-Boom!
Joker Wham!
As you can see, this is about as simple as it can be.
Table 17-1
lists the most commonly used DBM file
operations. Once such a file is opened, it is processed just as though
it were an in-memory Python dictionary. Items are fetched by indexing
the file object by key and are stored by assigning to a
key.
Table 17-1. DBM file operations
Python | Action | Description |
---|---|---|
| Import | Get DBM |
| Open | Create or open an |
| Store | Create or change the |
| Fetch | Load the value for the |
| Size | Return the number of |
| Index | Fetch the stored keys |
| Query | See if there’s an entry |
| Delete | Remove the entry for |
| Iterate | Iterate over stored |
| Close | Manual close, not always |
Despite the dictionary-like interface, DBM files really do map to
one or more external files. For instance, the underlying defaultdbm
interface used by Python 3.1 on
Windows writes two files—
movie.dir
and
movie.dat
—when a DBM file calledmovie
is made, and saves a
movie.bak
on later opens. If your Python has
access to a different underlying keyed-file interface, different
external files might show up on your computer.
Technically, the moduledbm
is
really an interface to whatever DBM-like filesystem you
have available in your Python:
When opening an already existing DBM file,dbm
tries to determine the system that
created it with thedbm.whichdb
function instead. This determination is based upon the content of
the database itself.
When creating a new file,dbm
today tries a set of keyed-file
interface modules in a fixed order. According to its documentation,
it attempts to import the interfacesdbm.bsd
,dbm.gnu
,dbm.ndbm
, ordbm.dumb
, and uses the first that
succeeds. Pythons without any of these automatically fall back on an
all-Python and always-present implementation calleddbm.dumb
, which is not really “dumb,” or
course, but may not be as fast or robust as other options.
Future Pythons are free to change this selection order, and may
even add additional alternatives to it. You normally don’t need to care
about any of this, though, unless you delete any of the files your DBM
creates, or transfer them between machines with different
configurations—if you need to care about the
portability
of your DBM files (and as we’ll see
later, by proxy, that of your shelve files), you should configure
machines such that all have the same DBM interface installed or rely
upon thedumb
fallback. For example,
the Berkeley DB package (a.k.a.bsddb
) used bydbm.bsd
is widely available and
portable.
Note that DBM files may or may not need to be explicitly closed,
per the last entry in
Table 17-1
. Some DBM
files don’t require a close call, but some depend on it to flush changes
out to disk. On such systems, your file may be corrupted if you omit the
close call. Unfortunately, the default DBM in some older Windows
Pythons,dbhash
(a.k.a.bsddb
), is one of the DBM systems that
requires a close call to avoid data loss. As a rule of thumb, always
close your DBM files explicitly after making changes and before your
program exits to avoid potential problems; it’s essential a “commit”
operation for these files. This rule extends by proxy to shelves, a
topic we’ll meet later in this
chapter.
Recent changes
: Be sure to also pass a
string'c'
as a second argument
when callingdbm.open
, to force
Python to create the file if it does not yet exist and to simply open
it for reads and writes otherwise. This used to be the default
behavior but is no longer. You do not need the'c'
argument when opening
shelves
discussed ahead—they still use an “open
or create”'c'
mode by default if
passed no open mode argument. Other open mode strings can be passed todbm
, includingn
to always create the file, andr
for read-only of an existing file—the new
default. See the Python library manual for more details.
In addition, Python 3.X stores both key and value strings asbytes
instead ofstr
as we’ve seen (which turns out to be
convenient for pickled data in shelves, discussed ahead) and no longer
ships withbsddb
as a standard
component—it’s available independently on the Web as a third-party
extension, but in its absence Python falls back on its own DBM file
implementation. Since the underlying DBM implementation rules are
prone to change with time, you should always consult Python’s library
manuals as well as thedbm
module’s
standard library source code for more information.