Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

Programming Python (23 page)

BOOK: Programming Python

9.8Mb size Format: txt, pdf, ePub

ads

Parsing packed binary data with the struct module

By using the
letter
b
in the
opencall, you can open binary datafiles in
a platform-neutral way and read and write their content with normal
file object methods. But how do you process binary data once it has
been read? It will be returned to your script as a simple string of
bytes, most of which are probably not printable characters.

If you just need to pass binary data along to another file or
program, your work is
done—
for
instance, simply pass the byte string to another file opened in binary
mode. And if you just need to extract a number of bytes from a
specific position, string slicing will do the job; you can even follow
up with bitwise operations if you need to. To get at the contents of
binary data in a structured way, though, as well as to construct its
contents, the standard library
structmodule is a more powerful
alternative.

The
structmodule provides
calls to pack and unpack binary data, as though the data was laid out
in a C-language
structdeclaration.
It is also capable of composing and decomposing using any endian-ness
you desire
(endian-ness determines whether the most significant
bits of binary numbers are on the left or right side). Building a
binary datafile, for instance, is straightforward—pack Python values
into a byte string and write them to a file. The format string here in
the
packcall means big-endian
(
>), with an integer,
four-character string, half integer, and floating-point number:

>>>
import struct
>>>
data = struct.pack('>i4shf', 2, 'spam', 3, 1.234)
>>>
data
b'\x00\x00\x00\x02spam\x00\x03?\x9d\xf3\xb6'
>>>
file = open('data.bin', 'wb')
>>>
file.write(data)
14
>>>
file.close()

Notice how the
structmodule
returns a bytes string: we’re in the realm of binary data here, not
text, and must use binary mode files to store. As usual, Python
displays most of the packed binary data’s bytes here with
\xNNhexadecimal escape sequences, because
the bytes are not printable characters. To parse data like that which
we just produced, read it off the file and pass it to the
structmodule with the same format
string—you get back a tuple containing the values parsed out of the
string and converted to Python objects:

>>>
import struct
>>>
file   = open('data.bin', 'rb')
>>>
bytes  = file.read()
>>>
values = struct.unpack('>i4shf', data)
>>>
values
(2, b'spam', 3, 1.2339999675750732)

Parsed-out strings are byte strings again, and we can apply
string and bitwise operations to probe deeper:

>>>
bin(values[0] | 0b1)
# accessing bits and bytes
'0b11'
>>>
values[1], list(values[1]), values[1][0]
(b'spam', [115, 112, 97, 109], 115)

Also note that slicing comes in handy in this domain; to grab
just the four-character string in the middle of the packed binary data
we just read, we can simply slice it out. Numeric values could
similarly be sliced out and then passed to
struct.unpackfor conversion:

>>>
bytes
b'\x00\x00\x00\x02spam\x00\x03?\x9d\xf3\xb6'
>>>
bytes[4:8]
b'spam'
>>>
number = bytes[8:10]
>>>
number
b'\x00\x03'
>>>
struct.unpack('>h', number)
(3,)

Packed binary data crops up in many contexts, including some
networking tasks, and in data produced by other programming languages.
Because it’s not part of every programming job’s description, though,
we’ll defer to the
structmodule’s
entry in the Python library manual for more
details.

Random access files

Binary files
also typically see action in random access processing.
Earlier, we mentioned that adding a
+to the
openmode string allows a file to be both
read and written. This mode is typically used in conjunction with the
file object’s
seekmethod to support random read/write access. Such
flexible file processing modes allow us to read bytes from one
location, write to another, and so on. When scripts combine this with
binary file modes, they may fetch and update arbitrary bytes within a
file.

We used
seekearlier to
rewind files instead of closing and reopening. As mentioned, read and
write operations always take place at the current position in the
file; files normally start at offset 0 when opened and advance as data
is transferred. The
seekcall lets
us move to a new position for the next transfer operation by passing
in a byte offset.

Python’s
seekmethod also
accepts an optional second argument that has one of three values—0 for
absolute file positioning (the default); 1 to seek relative to the
current position; and 2 to seek relative to the file’s end. That’s why
passing just an offset of 0 to
seekis roughly a file
rewind
operation: it
repositions the file to its absolute start. In general,
seeksupports random access on a byte-offset
basis. Seeking to a multiple of a record’s size in a binary file, for
instance, allows us to fetch a record by its relative position.

Although you can use
seekwithout
+modes in
open(e.g., to just read from random
locations), it’s most flexible when combined with input/output files.
And while you can perform random access in
text
mode
, too, the fact that text modes perform Unicode
encodings and line-end translations make them difficult to use when
absolute byte offsets and lengths are required for seeks and
reads—your data may look very different when stored in files. Text
mode may also make your data nonportable to platforms with different
default encodings, unless you’re willing to always specify an explicit
encoding for opens. Except for simple unencoded ASCII text without
line-ends,
seektends to works best
with binary mode files.

To demonstrate, let’s create a file in
w+bmode (equivalent to
wb+) and write some data to it; this mode
allows us to both read and write, but initializes the file to be empty
if it’s already present (all
wmodes do). After writing some data, we seek back to file start to read
its content (some integer return values are omitted in this example
again for brevity):

>>>
records = [bytes([char] * 8) for char in b'spam']
>>>
records
[b'ssssssss', b'pppppppp', b'aaaaaaaa', b'mmmmmmmm']
>>>
file = open('random.bin', 'w+b')
>>>
for rec in records:
# write four records
...
size = file.write(rec)
# bytes for binary mode
...
>>>
file.flush()
>>>
pos = file.seek(0)
# read entire file
>>>
print(file.read()
)
b'ssssssssppppppppaaaaaaaammmmmmmm'

Now, let’s reopen our file in
r+bmode; this mode allows both reads and
writes again, but does not initialize the file to be empty. This time,
we seek and read in multiples of the size of data items (“records”)
stored, to both fetch and update them at random:

c:\temp>
python
>>>
file = open('random.bin', 'r+b')
>>>
print(file.read())
# read entire file
b'ssssssssppppppppaaaaaaaammmmmmmm'
>>>
record = b'X' * 8
>>>
file.seek(0)
# update first record
>>>
file.write(record)
>>>
file.seek(len(record) * 2)
# update third record
>>>
file.write(b'Y' * 8)
>>>
file.seek(8)
>>>
file.read(len(record))
# fetch second record
b'pppppppp'
>>>
file.read(len(record))
# fetch next (third) record
b'YYYYYYYY'
>>>
file.seek(0)
# read entire file
>>>
file.read()
b'XXXXXXXXppppppppYYYYYYYYmmmmmmmm'
c:\temp>
type random.bin
# the view outside Python
XXXXXXXXppppppppYYYYYYYYmmmmmmmm

Finally, keep in mind that
seekcan be used to achieve random access,
even if it’s just for input. The following seeks in multiples of
record size to read (but not write) fixed-length records at random.
Notice that it also uses
rtext
mode: since this data is simple ASCII text bytes and has no line-ends,
text and binary modes work the same on this platform:

c:\temp>
python
>>>
file = open('random.bin', 'r')
# text mode ok if no encoding/endlines
>>>
reclen = 8
>>>
file.seek(reclen * 3)
# fetch record 4
>>>
file.read(reclen)
'mmmmmmmm'
>>>
file.seek(reclen * 1)
# fetch record 2
>>>
file.read(reclen)
'pppppppp'
>>>
file = open('random.bin', 'rb')
# binary mode works the same here
>>>
file.seek(reclen * 2)
# fetch record 3
>>>
file.read(reclen)
# returns byte strings
b'YYYYYYYY'

But unless your file’s content is always a simple unencoded text
form like ASCII and has no translated line-ends, text mode should not
generally be used if you are going to seek—line-ends may be translated
on Windows and Unicode encodings may make arbitrary transformations,
both of which can make absolute seek offsets difficult to use. In the
following, for example, the positions of characters after the first
non-ASCII no longer match between the string in Python and its encoded
representation on the file:

>>>
data = 'sp\xe4m'
# data to your script
>>>
data, len(data)
# 4 unicode chars, 1 nonascii
('späm', 4)
>>>
data.encode('utf8'), len(data.encode('utf8'))
# bytes written to file
(b'sp\xc3\xa4m', 5)
>>>
f = open('test', mode='w+', encoding='utf8')
# use text mode, encoded
>>>
f.write(data)
>>>
f.flush()
>>>
f.seek(0); f.read(1)
# ascii bytes work
's'
>>>
f.seek(2); f.read(1)
# as does 2-byte nonascii
'ä'
>>>
data[3]
# but offset 3 is not 'm' !
'm'
>>>
f.seek(3); f.read(1)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 0:
unexpected code byte

As you can see, Python’s file modes provide flexible file
processing for programs that require it. In fact, the
osmodule offers even more file processing
options, as the next section
describes.

Lower-Level File Tools in the os Module

The
osmodule
contains an additional set of file-processing functions
that are distinct from the built-in file
object
tools demonstrated in previous examples. For instance, here is a partial
list of
osfile-related calls:

os.open(path, flags, mode): Opens a
file and returns its descriptor
os.read(descriptor, N): Reads at
most
Nbytes and returns
a byte string
os.write(descriptor, string): Writes
bytes in byte string
stringto the file
os.lseek(descriptor, position,how): Moves
to
positionin the
file

Technically,
oscalls process
files by their
descriptors
, which are integer codes
or “handles” that identify files in the operating system.
Descriptor-based files deal in raw bytes, and have no notion of the
line-end or Unicode translations for text that we studied in the prior
section. In fact, apart from extras like buffering, descriptor-based
files generally correspond to binary mode file objects, and we similarly
read and write
bytesstrings, not
strstrings. However, because the
descriptor-based file tools in
osare
lower level and more complex than the built-in file objects created with
the built-in
openfunction, you
should generally use the latter for all but very special file-processing
needs.
^[
9
]

Using os.open files

To give you the general
flavor of this tool set, though, let’s run a few
interactive experiments. Although built-in file objects and
osmodule descriptor files are processed
with distinct tool sets, they are in fact related—the file system used
by file objects simply adds a layer of logic on top of
descriptor-based files.

In fact, the
filenofile
object method returns the integer descriptor associated with a
built-in file object. For instance, the standard stream file objects
have descriptors 0, 1, and 2; calling the
os.writefunction to send data to
stdoutby descriptor has the same effect as
calling the
sys.stdout.writemethod:

>>>
import sys
>>>
for stream in (sys.stdin, sys.stdout, sys.stderr):
...
print(stream.fileno())
...
0
1
2
>>>
sys.stdout.write('Hello stdio world\n')
# write via file method
Hello stdio world
18
>>>
import os
>>>
os.write(1, b'Hello descriptor world\n')
# write via os module
Hello descriptor world
23

Because file objects we open explicitly behave the same way,
it’s also possible to process a given real external file on the
underlying computer through the built-in
openfunction, tools in the
osmodule, or both (some integer return
values are omitted here for brevity):

>>>
file = open(r'C:\temp\spam.txt', 'w')
# create external file, object
>>>
file.write('Hello stdio file\n')
# write via file object method
>>>
file.flush()
# else os.write to disk first!
>>>
fd = file.fileno()
# get descriptor from object
>>>
fd
3
>>>
import os
>>>
os.write(fd, b'Hello descriptor file\n')
# write via os module
>>>
file.close()
C:\temp>
type spam.txt
# lines from both schemes
Hello stdio file
Hello descriptor file

os.open mode flags

So why the extra file tools in
os? In short, they give more low-level
control over file processing. The built-in
openfunction is easy to use, but it may be
limited by the underlying filesystem that it uses, and it adds extra
behavior that we do not want. The
osmodule lets scripts be more specific—for
example, the following opens a descriptor-based file in read-write and
binary modes by performing a binary “or” on two mode flags exported by
os:

>>>
fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>
os.read(fdfile, 20)
b'Hello stdio file\r\nHe'
>>>
os.lseek(fdfile, 0, 0)
# go back to start of file
>>>
os.read(fdfile, 100)
# binary mode retains "\r\n"
b'Hello stdio file\r\nHello descriptor file\n'
>>>
os.lseek(fdfile, 0, 0)
>>>
os.write(fdfile, b'HELLO')
# overwrite first 5 bytes
5
C:\temp>
type spam.txt
HELLO stdio file
Hello descriptor file

In this case, binary mode strings
rb+and
r+bin the basic
opencall are equivalent:

>>>
file = open(r'C:\temp\spam.txt', 'rb+')
# same but with open/objects
>>>
file.read(20)
b'HELLO stdio file\r\nHe'
>>>
file.seek(0)
>>>
file.read(100)
b'HELLO stdio file\r\nHello descriptor file\n'
>>>
file.seek(0)
>>>
file.write(b'Jello')
5
>>>
file.seek(0)
>>>
file.read()
b'Jello stdio file\r\nHello descriptor file\n'

But on some systems,
os.openflags let us specify more advanced things like
exclusive
access
(
O_EXCL) and
nonblocking
modes (
O_NONBLOCK) when a file is opened. Some of
these flags are not portable across platforms (another reason to use
built-in file objects most of the time); see the library manual or run
a
dir(os)call on your machine for
an exhaustive list of other open flags available.

One final note here: using
os.openwith the
O_EXCLflag is the most portable way to
lock files
for concurrent updates or other
process synchronization in Python today. We’ll see contexts where this
can matter in the next chapter, when we begin to explore
multi
processing tools. Programs running
in parallel on a server machine, for instance, may need to lock files
before performing updates, if multiple threads or processes might
attempt such updates at the same
time.

Wrapping descriptors in file objects

We saw earlier how to
go from file object to field descriptor with the
filenofile object method; given a
descriptor, we can use
osmodule
tools for lower-level file access to the underlying file. We can also
go the other way—the
os.fdopencall wraps
a file descriptor in a file object. Because conversions work both
ways, we can generally use either tool set—file object or
osmodule:

>>>
fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>
fdfile
3
>>>
objfile = os.fdopen(fdfile, 'rb')
>>>
objfile.read()
b'Jello stdio file\r\nHello descriptor file\n'

In fact, we can wrap a file descriptor in either a binary or
text-mode file object: in text mode, reads and writes perform the
Unicode encodings and line-end translations we studied earlier and
deal in
strstrings instead of
bytes:

C:\...\PP4E\System>
python
>>>
import os
>>>
fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>
objfile = os.fdopen(fdfile, 'r')
>>>
objfile.read()
'Jello stdio file\nHello descriptor file\n'

In Python 3.X, the built-in
opencall also accepts a file descriptor
instead of a file name string; in this mode it works much like
os.fdopen, but gives you greater
control—for example, you can use additional arguments to specify a
nondefault Unicode encoding for text and suppress the default
descriptor close. Really, though,
os.fdopenaccepts the same extra-control
arguments in 3.X, because it has been redefined to do little but call
back to the built-in
open(see
os.py
in the standard
library):

C:\...\PP4E\System>
python
>>>
import os
>>>
fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>
fdfile
3
>>>
objfile = open(fdfile, 'r', encoding='latin1', closefd=False)
>>>
objfile.read()
'Jello stdio file\nHello descriptor file\n'
>>>
objfile = os.fdopen(fdfile, 'r', encoding='latin1', closefd=True)
>>>
objfile.seek(0)
>>>
objfile.read()
'Jello stdio file\nHello descriptor file\n'

We’ll make use of this file object wrapper technique to simplify
text-oriented pipes and other descriptor-like objects later in this
book (e.g., sockets have a
makefilemethod which achieves similar effects).

Other os module file tools

The
osmodule also includes
an assortment of file tools that accept a file pathname string and
accomplish file-related tasks such as renaming (
os.rename), deleting (
os.remove), and changing the file’s owner
and permission settings (
os.chown,
os.chmod). Let’s step through a few
examples of these tools in action:

>>>
os.chmod('spam.txt', 0o777)
# enable all accesses

This
os.chmodfile permissions call passes a 9-bit string composed of
three sets of three bits each. From left to right, the three sets
represent the file’s owning user, the file’s group, and all others.
Within each set, the three bits reflect read, write, and execute
access permissions. When a bit is “1” in this string, it means that
the corresponding operation is allowed for the assessor. For instance,
octal 0777 is a string of nine “1” bits in binary, so it enables all
three kinds of accesses for all three user groups; octal 0600 means
that the file can be read and written only by the user that owns it
(when written in binary, 0600 octal is really bits 110 000
000).

This scheme stems from Unix file permission settings, but the
call works on Windows as well. If it’s puzzling, see your system’s
documentation (e.g., a Unix manpage) for
chmod
.
Moving on:

>>>
os.rename(r'C:\temp\spam.txt', r'C:\temp\eggs.txt')
# from, to
>>>
os.remove(r'C:\temp\spam.txt')
# delete file?
WindowsError: [Error 2] The system cannot find the file specified: 'C:\\temp\\...'
>>>
os.remove(r'C:\temp\eggs.txt')

The
os.renamecall used here changes a file’s name; the
os.removefile
deletion call deletes a file from your system and is synonymous
with
os.unlink(the
latter reflects the call’s name on Unix but was obscure to
users
of other platforms).
^[
10
]The
osmodule also
exports the
statsystem
call:

>>>
open('spam.txt', 'w').write('Hello stat world\n')
# +1 for \r added
17
>>>
import os
>>>
info = os.stat(r'C:\temp\spam.txt')
>>>
info
nt.stat_result(st_mode=33206, st_ino=0, st_dev=0, st_nlink=0, st_uid=0, st_gid=0,
st_size=18, st_atime=1267645806, st_mtime=1267646072, st_ctime=1267645806)
>>>
info.st_mode, info.st_size
# via named-tuple item attr names
(33206, 18)
>>>
import stat
>>>
info[stat.ST_MODE], info[stat.ST_SIZE]
# via stat module presets
(33206, 18)
>>>
stat.S_ISDIR(info.st_mode), stat.S_ISREG(info.st_mode)
(False, True)

The
os.statcall returns a tuple of values (really, in 3.X a special
kind of tuple with named items) giving low-level information about the
named file, and the
statmodule
exports constants and functions for querying this information in a
portable way. For instance, indexing an
os.statresult on offset
stat.ST_SIZEreturns the file’s size, and
calling
stat.S_ISDIRwith the mode
item from an
os.statresult checks
whether the file is a directory. As shown earlier, though, both of
these operations are available in the
os.pathmodule, too, so it’s rarely
necessary to use
os.statexcept for
low-level file queries:

>>>
path = r'C:\temp\spam.txt'
>>>
os.path.isdir(path), os.path.isfile(path), os.path.getsize(path)
(False, True, 18)

BOOK: Programming Python

9.8Mb size Format: txt, pdf, ePub

Read Book Download Book

ads

Other books

Things Half in Shadow by Alan Finn

Rusty Summer by Mary McKinley

Proof Positive (2006) by Margolin, Phillip - Jaffe 3

CB19 A Question of Belief (2010) by Donna Leon

Salvage Marines (Necrospace Book 1) by Argo, Sean-Michael

Savage Spring by Kallentoft, Mons

ASCENSION: THE SYSTEMIC SERIES by CALLAHAN, K.W.

Girl with the Golden Voice by Carl Hancock

A Good Man by J.J. Murray

Borderlands: Gunsight by John Shirley