Technically,
Python reimports a class to re-create its stored instances
as they are fetched and unpickled. Here’s how this works:
When Python pickles a class instance to store it in a
shelve, it saves the instance’s attributes plus a reference to the
instance’s class. In effect, pickled class instances in the prior
example record theself
attributes assigned in the class. Really, Python serializes and
stores the instance’s__dict__
attribute dictionary along with enough source file information to
be able to locate the class’s module later—the names of the
instance’s class as well as its class’s enclosing module.
When Python unpickles a class instance fetched from a
shelve, it re-creates the instance object in memory by reimporting
the class using the save class and module name strings, assigning
the saved attribute dictionary to a new empty instance, and
linking the instance back to the class. This is be default, and it
can be tailored by defining special methods that will be called bypickle
to fetch and store
instance state (see the Python library manual for details).
The key point in this is that the class and stored instance data
are separate. The class itself is not stored with its instances, but is
instead located in the Python source file and reimported later when
instances are fetched.
The downside of this model is that the class must be importable to
load instances off a shelve (more on this in a moment). The upside is
that by modifying external classes in module files, we can change the
way stored objects’ data is interpreted and used without actually having
to change those stored objects. It’s as if the class is a program that
processes stored records.
To illustrate, suppose thePerson
class from the previous section was
changed to the source code in
Example 17-3
.
Example 17-3. PP4E\Dbase\person.py (version 2)
"""
a person object: fields + behavior
change: the tax method is now a computed attribute
"""
class Person:
def __init__(self, name, job, pay=0):
self.name = name
self.job = job
self.pay = pay # real instance data
def __getattr__(self, attr): # on person.attr
if attr == 'tax':
return self.pay * 0.30 # computed on access
else:
raise AttributeError() # other unknown names
def info(self):
return self.name, self.job, self.pay, self.tax
This revision has a new tax rate (30 percent), introduces a__getattr__
qualification overload
method, and deletes the originaltax
method. Because this new version of the class is re-imported when its
existing instances are loaded from the shelve file, they acquire the new
behavior automatically—their tax attribute references are now
intercepted and computed when accessed:
C:\...\PP4E\Dbase>python
>>>import shelve
>>>dbase = shelve.open('cast')
# reopen shelve
>>>
>>>print(list(dbase.keys()))
# both objects are here
['bob', 'emily']
>>>print(dbase['emily'])
>>>
>>>print(dbase['bob'].tax)
# no need to call tax()
21000.0
Because the class has changed,tax
is now simply qualified, not called. In
addition, because the tax rate was changed in the class, Bob pays more
this time around. Of course, this example is artificial, but when used
well, this separation of classes and persistent instances can eliminate
many traditional database update programs. In most cases, you can simply
change the class, not each stored instance, for new behavior.
Although
shelves are generally straightforward to use, there are a
few rough edges worth knowing about.
First, although
they can store arbitrary objects, keys must still be
strings. The following fails, unless you convert the integer 42 to the
string42
manually first:
dbase[42] = value # fails, but str(42) will work
This is different from in-memory dictionaries, which allow any
immutable object to be used as a key, and derives from the shelve’s
use of DBM files internally. As we’ve seen, keys must further bestr
strings in Python 3.X, notbytes
, because the shelve will
attempt to encode them in all cases.
Although theshelve
module
is smart enough to detect multiple occurrences of a
nested object and re-create only one copy when fetched, this holds
true only within a given slot:
dbase[key] = [object, object] # OK: only one copy stored and fetched
dbase[key1] = object
dbase[key2] = object # bad?: two copies of object in the shelve
Whenkey1
andkey2
are fetched, they reference independent
copies of the original shared object; if that object is mutable,
changes from one won’t be reflected in the other. This really stems
from the fact the each key assignment runs an independent pickle
operation—the pickler detects repeated objects but only within each
pickle call. This may or may not be a concern in your practice, and it
can be avoided with extra support logic, but an object can be
duplicated if it spans keys.
Because objects fetched from a shelve don’t know that they came
from a shelve, operations that change components of a fetched object
change only the in-memory copy, not the data on a shelve:
dbase[key].attr = value # shelve unchanged
To really change an object stored on a shelve, fetch it into
memory, change its parts, and then write it back to the shelve as a
whole by key assignment:
object = dbase[key] # fetch it
object.attr = value # modify it
dbase[key] = object # store back-shelve changed (unless writeback)
As noted earlier, theshelve.open
call’s optionalwriteback
argument
can be used to avoid the last step here, by automatically caching
objects fetched and writing them to disk when the shelve is closed,
but this can require substantial memory resources and make close
operations slow.
Theshelve
module does not
currently support simultaneous updates. Simultaneous readers are OK,
but writers must be given exclusive access to the shelve. You can
trash a shelve if multiple processes write to it at the same time,
which is a common potential in things such as server-side scripts on
the Web. If your shelves may be updated by multiple processes, be sure
to wrap updates in calls to theos.open
standard library function to lock
files and provide exclusive access.
With shelves, the
files created by an underlying DBM system used to store
your persistent objects are not necessarily compatible with all
possible DBM implementations or Pythons. For instance, a file
generated bygdbm
on Linux, or by
thebsddb
library on Windows, may
not be readable by a Python with other DBM modules installed.
This is really the same portability issue we discussed for DBM
files earlier. As you’ll recall, when a DBM file (or by proxy, a
shelve) is created, thedbm
module
tries to import all possible DBM system modules in a predefined order
and uses the first that it finds. Whendmb
later opens an existing file, it
attempts to determine which DBM system created it by inspecting the
files(s). Because thebsddb
system
is tried first at file creation time and is available on both Windows
and many Unix-like systems, your DBM file is portable as long as your
Pythons support BSD on both platforms. This is also true if all
platforms you’ll use fall back on Python’s owndbm.dumb
implementation. If the system used
to create a DBM file is not available on the underlying platform,
though, the DBM file cannot be used.
If DBM file portability is a concern, make sure that all the
Pythons that will read your data use compatible DBM modules. If that
is not an option, use thepickle
module directly and flat files for storage (thereby bypassing bothshelve
anddbm
), or use the OODB systems we’ll meet
later in this chapter. Such systems may also offer a more complete
answer to
transaction processing
, with calls to
commit changes, and automatic rollback to prior commit points
on errors.
In addition to these
shelve constraints, storing class instances in a shelve
adds a set of additional rules you need to be aware of. Really, these
are imposed by thepickle
module, not
byshelve
, so be sure to follow these
if you store class instance objects withpickle
directly too:
As we’ve seen, the Python pickler stores instance attributes
only when pickling an instance object, and it reimports the class
later to re-create the instance. Because of that, the classes of
stored objects must be importable when objects are
unpickled—
they must be coded
unnested at the top level of a module file that is accessible on
the module import search path at load time (e.g., named
inPYTHONPATH
or in
a
.pth
file, or the current working directory
or that of the top-level script).
Further, the class usually must be associated with a real
imported module when instances are pickled, not with a top-level
script (with the module name__main__
), unless they will only ever be
used in the top-level script. You also need to be careful about
moving class modules after instances are stored. When an instance
is unpickled, Python must find its class’s module on the module
search using the original module name (including any package path
prefixes) and fetch the class from that module using the original
class name. If the module or class has been moved or renamed, it
might not be found.
In applications where pickled objects are shipped over
network sockets, it’s possible to satisfy this constraint by
shipping the text of the class along with stored instances;
recipients may simply store the class in a local module file on
the import search path prior to unpickling received instances.
Where this is inconvenient or impossible, simpler pickled objects
such as lists and dictionaries with nesting may be transferred
instead, as they require no source file to be
reconstructed.
Although Python lets you change a class while instances of
it are stored on a shelve, those changes must be backward
compatible with the objects already stored. For instance, you
cannot change the class to expect an attribute not associated with
already stored persistent instances unless you first manually
update those stored instances or provide extra conversion
protocols on the class.
Shelves also inherit the pickling systems’ nonclass
limitations. As discussed earlier, some types of objects (e.g.,
open files and sockets) cannot be pickled, and thus cannot be
stored in a shelve.
In an early Python release, persistent object classes also had to
either use constructors with no arguments or provide defaults for all
constructor arguments (much like the notion of a C++ copy constructor).
This constraint was dropped as of Python 1.5.2—classes with nondefaulted
constructor arguments now work as is in the pickling
system.
[
69
]
Finally, although shelves store objects persistently, they are not
really object-oriented database systems. Such systems also implement
features such as immediate automatic write-through on changes,
transaction commits and rollbacks, safe concurrent updates, and object
decomposition and delayed (“lazy”) component fetches based on generated
object IDs. Parts of larger objects may be loaded into memory only as
they are accessed. It’s possible to extend shelves to support such
features manually, but you don’t need to—the ZODB system, among others,
provides an implementation of a more complete object-oriented database
system. It is constructed on top of Python’s built-in pickling
persistence support, but it offers additional features for advanced data
stores. For more on ZODB, let’s move on to the next
section.
[
69
]
Interestingly, Python avoids calling the class to re-create a
pickled instance and instead simply makes a class object
generically, inserts instance attributes, and sets the instance’s__class__
pointer to the original
class directly. This avoids the need for defaults, but it also means
that the class__init__
constructors that are no longer called as objects are unpickled,
unless you provide extra methods to force the call. See the library
manual for more details, and see thepickle
module’s source code
(
pickle.py
in the source library) if you’re
curious about how this works. Or see the PyForm example later in
this chapter—it does something very similar with__class__
links to build an instance
object from a class and dictionary of attributes, without calling
the class’s__init__
constructor.
This makes constructor argument defaults unnecessary in classes used
for records browsed by PyForm, but it’s the same idea.