By using the
letter
b
in theopen
call, you can open binary datafiles in
a platform-neutral way and read and write their content with normal
file object methods. But how do you process binary data once it has
been read? It will be returned to your script as a simple string of
bytes, most of which are probably not printable characters.
If you just need to pass binary data along to another file or
program, your work is
done—
for
instance, simply pass the byte string to another file opened in binary
mode. And if you just need to extract a number of bytes from a
specific position, string slicing will do the job; you can even follow
up with bitwise operations if you need to. To get at the contents of
binary data in a structured way, though, as well as to construct its
contents, the standard librarystruct
module is a more powerful
alternative.
Thestruct
module provides
calls to pack and unpack binary data, as though the data was laid out
in a C-languagestruct
declaration.
It is also capable of composing and decomposing using any endian-ness
you desire
(endian-ness determines whether the most significant
bits of binary numbers are on the left or right side). Building a
binary datafile, for instance, is straightforward—pack Python values
into a byte string and write them to a file. The format string here in
thepack
call means big-endian
(>
), with an integer,
four-character string, half integer, and floating-point number:
>>>import struct
>>>data = struct.pack('>i4shf', 2, 'spam', 3, 1.234)
>>>data
b'\x00\x00\x00\x02spam\x00\x03?\x9d\xf3\xb6'
>>>file = open('data.bin', 'wb')
>>>file.write(data)
14
>>>file.close()
Notice how thestruct
module
returns a bytes string: we’re in the realm of binary data here, not
text, and must use binary mode files to store. As usual, Python
displays most of the packed binary data’s bytes here with\xNN
hexadecimal escape sequences, because
the bytes are not printable characters. To parse data like that which
we just produced, read it off the file and pass it to thestruct
module with the same format
string—you get back a tuple containing the values parsed out of the
string and converted to Python objects:
>>>import struct
>>>file = open('data.bin', 'rb')
>>>bytes = file.read()
>>>values = struct.unpack('>i4shf', data)
>>>values
(2, b'spam', 3, 1.2339999675750732)
Parsed-out strings are byte strings again, and we can apply
string and bitwise operations to probe deeper:
>>>bin(values[0] | 0b1)
# accessing bits and bytes
'0b11'
>>>values[1], list(values[1]), values[1][0]
(b'spam', [115, 112, 97, 109], 115)
Also note that slicing comes in handy in this domain; to grab
just the four-character string in the middle of the packed binary data
we just read, we can simply slice it out. Numeric values could
similarly be sliced out and then passed tostruct.unpack
for conversion:
>>>bytes
b'\x00\x00\x00\x02spam\x00\x03?\x9d\xf3\xb6'
>>>bytes[4:8]
b'spam'
>>>number = bytes[8:10]
>>>number
b'\x00\x03'
>>>struct.unpack('>h', number)
(3,)
Packed binary data crops up in many contexts, including some
networking tasks, and in data produced by other programming languages.
Because it’s not part of every programming job’s description, though,
we’ll defer to thestruct
module’s
entry in the Python library manual for more
details.
Binary files
also typically see action in random access processing.
Earlier, we mentioned that adding a+
to theopen
mode string allows a file to be both
read and written. This mode is typically used in conjunction with the
file object’sseek
method to support random read/write access. Such
flexible file processing modes allow us to read bytes from one
location, write to another, and so on. When scripts combine this with
binary file modes, they may fetch and update arbitrary bytes within a
file.
We usedseek
earlier to
rewind files instead of closing and reopening. As mentioned, read and
write operations always take place at the current position in the
file; files normally start at offset 0 when opened and advance as data
is transferred. Theseek
call lets
us move to a new position for the next transfer operation by passing
in a byte offset.
Python’sseek
method also
accepts an optional second argument that has one of three values—0 for
absolute file positioning (the default); 1 to seek relative to the
current position; and 2 to seek relative to the file’s end. That’s why
passing just an offset of 0 toseek
is roughly a file
rewind
operation: it
repositions the file to its absolute start. In general,seek
supports random access on a byte-offset
basis. Seeking to a multiple of a record’s size in a binary file, for
instance, allows us to fetch a record by its relative position.
Although you can useseek
without+
modes inopen
(e.g., to just read from random
locations), it’s most flexible when combined with input/output files.
And while you can perform random access in
text
mode
, too, the fact that text modes perform Unicode
encodings and line-end translations make them difficult to use when
absolute byte offsets and lengths are required for seeks and
reads—your data may look very different when stored in files. Text
mode may also make your data nonportable to platforms with different
default encodings, unless you’re willing to always specify an explicit
encoding for opens. Except for simple unencoded ASCII text without
line-ends,seek
tends to works best
with binary mode files.
To demonstrate, let’s create a file inw+b
mode (equivalent towb+
) and write some data to it; this mode
allows us to both read and write, but initializes the file to be empty
if it’s already present (allw
modes do). After writing some data, we seek back to file start to read
its content (some integer return values are omitted in this example
again for brevity):
>>>records = [bytes([char] * 8) for char in b'spam']
>>>records
[b'ssssssss', b'pppppppp', b'aaaaaaaa', b'mmmmmmmm']
>>>file = open('random.bin', 'w+b')
>>>for rec in records:
# write four records
...size = file.write(rec)
# bytes for binary mode
...
>>>file.flush()
>>>pos = file.seek(0)
# read entire file
>>>print(file.read()
)
b'ssssssssppppppppaaaaaaaammmmmmmm'
Now, let’s reopen our file inr+b
mode; this mode allows both reads and
writes again, but does not initialize the file to be empty. This time,
we seek and read in multiples of the size of data items (“records”)
stored, to both fetch and update them at random:
c:\temp>python
>>>file = open('random.bin', 'r+b')
>>>print(file.read())
# read entire file
b'ssssssssppppppppaaaaaaaammmmmmmm'
>>>record = b'X' * 8
>>>file.seek(0)
# update first record
>>>file.write(record)
>>>file.seek(len(record) * 2)
# update third record
>>>file.write(b'Y' * 8)
>>>file.seek(8)
>>>file.read(len(record))
# fetch second record
b'pppppppp'
>>>file.read(len(record))
# fetch next (third) record
b'YYYYYYYY'
>>>file.seek(0)
# read entire file
>>>file.read()
b'XXXXXXXXppppppppYYYYYYYYmmmmmmmm'
c:\temp>type random.bin
# the view outside Python
XXXXXXXXppppppppYYYYYYYYmmmmmmmm
Finally, keep in mind thatseek
can be used to achieve random access,
even if it’s just for input. The following seeks in multiples of
record size to read (but not write) fixed-length records at random.
Notice that it also usesr
text
mode: since this data is simple ASCII text bytes and has no line-ends,
text and binary modes work the same on this platform:
c:\temp>python
>>>file = open('random.bin', 'r')
# text mode ok if no encoding/endlines
>>>reclen = 8
>>>file.seek(reclen * 3)
# fetch record 4
>>>file.read(reclen)
'mmmmmmmm'
>>>file.seek(reclen * 1)
# fetch record 2
>>>file.read(reclen)
'pppppppp'
>>>file = open('random.bin', 'rb')
# binary mode works the same here
>>>file.seek(reclen * 2)
# fetch record 3
>>>file.read(reclen)
# returns byte strings
b'YYYYYYYY'
But unless your file’s content is always a simple unencoded text
form like ASCII and has no translated line-ends, text mode should not
generally be used if you are going to seek—line-ends may be translated
on Windows and Unicode encodings may make arbitrary transformations,
both of which can make absolute seek offsets difficult to use. In the
following, for example, the positions of characters after the first
non-ASCII no longer match between the string in Python and its encoded
representation on the file:
>>>data = 'sp\xe4m'
# data to your script
>>>data, len(data)
# 4 unicode chars, 1 nonascii
('späm', 4)
>>>data.encode('utf8'), len(data.encode('utf8'))
# bytes written to file
(b'sp\xc3\xa4m', 5)
>>>f = open('test', mode='w+', encoding='utf8')
# use text mode, encoded
>>>f.write(data)
>>>f.flush()
>>>f.seek(0); f.read(1)
# ascii bytes work
's'
>>>f.seek(2); f.read(1)
# as does 2-byte nonascii
'ä'
>>>data[3]
# but offset 3 is not 'm' !
'm'
>>>f.seek(3); f.read(1)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 0:
unexpected code byte
As you can see, Python’s file modes provide flexible file
processing for programs that require it. In fact, theos
module offers even more file processing
options, as the next section
describes.
Theos
module
contains an additional set of file-processing functions
that are distinct from the built-in file
object
tools demonstrated in previous examples. For instance, here is a partial
list ofos
file-related calls:
os.open(
path,
flags, mode
)
Opens a
file and returns its descriptor
os.read(
descriptor, N
)
Reads at
mostN
bytes and returns
a byte string
os.write(
descriptor, string
)
Writes
bytes in byte stringstring
to the file
os.lseek(
descriptor, position
,
how
)
Moves
toposition
in the
file
Technically,os
calls process
files by their
descriptors
, which are integer codes
or “handles” that identify files in the operating system.
Descriptor-based files deal in raw bytes, and have no notion of the
line-end or Unicode translations for text that we studied in the prior
section. In fact, apart from extras like buffering, descriptor-based
files generally correspond to binary mode file objects, and we similarly
read and writebytes
strings, notstr
strings. However, because the
descriptor-based file tools inos
are
lower level and more complex than the built-in file objects created with
the built-inopen
function, you
should generally use the latter for all but very special file-processing
needs.
[
9
]
To give you the general
flavor of this tool set, though, let’s run a few
interactive experiments. Although built-in file objects andos
module descriptor files are processed
with distinct tool sets, they are in fact related—the file system used
by file objects simply adds a layer of logic on top of
descriptor-based files.
In fact, thefileno
file
object method returns the integer descriptor associated with a
built-in file object. For instance, the standard stream file objects
have descriptors 0, 1, and 2; calling theos.write
function to send data tostdout
by descriptor has the same effect as
calling thesys.stdout.write
method:
>>>import sys
>>>for stream in (sys.stdin, sys.stdout, sys.stderr):
...print(stream.fileno())
...
0
1
2
>>>sys.stdout.write('Hello stdio world\n')
# write via file method
Hello stdio world
18
>>>import os
>>>os.write(1, b'Hello descriptor world\n')
# write via os module
Hello descriptor world
23
Because file objects we open explicitly behave the same way,
it’s also possible to process a given real external file on the
underlying computer through the built-inopen
function, tools in theos
module, or both (some integer return
values are omitted here for brevity):
>>>file = open(r'C:\temp\spam.txt', 'w')
# create external file, object
>>>file.write('Hello stdio file\n')
# write via file object method
>>>file.flush()
# else os.write to disk first!
>>>fd = file.fileno()
# get descriptor from object
>>>fd
3
>>>import os
>>>os.write(fd, b'Hello descriptor file\n')
# write via os module
>>>file.close()
C:\temp>type spam.txt
# lines from both schemes
Hello stdio file
Hello descriptor file
So why the extra file tools inos
? In short, they give more low-level
control over file processing. The built-inopen
function is easy to use, but it may be
limited by the underlying filesystem that it uses, and it adds extra
behavior that we do not want. Theos
module lets scripts be more specific—for
example, the following opens a descriptor-based file in read-write and
binary modes by performing a binary “or” on two mode flags exported byos
:
>>>fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>os.read(fdfile, 20)
b'Hello stdio file\r\nHe'
>>>os.lseek(fdfile, 0, 0)
# go back to start of file
>>>os.read(fdfile, 100)
# binary mode retains "\r\n"
b'Hello stdio file\r\nHello descriptor file\n'
>>>os.lseek(fdfile, 0, 0)
>>>os.write(fdfile, b'HELLO')
# overwrite first 5 bytes
5
C:\temp>type spam.txt
HELLO stdio file
Hello descriptor file
In this case, binary mode stringsrb+
andr+b
in the basicopen
call are equivalent:
>>>file = open(r'C:\temp\spam.txt', 'rb+')
# same but with open/objects
>>>file.read(20)
b'HELLO stdio file\r\nHe'
>>>file.seek(0)
>>>file.read(100)
b'HELLO stdio file\r\nHello descriptor file\n'
>>>file.seek(0)
>>>file.write(b'Jello')
5
>>>file.seek(0)
>>>file.read()
b'Jello stdio file\r\nHello descriptor file\n'
But on some systems,os.open
flags let us specify more advanced things like
exclusive
access
(O_EXCL
) and
nonblocking
modes (O_NONBLOCK
) when a file is opened. Some of
these flags are not portable across platforms (another reason to use
built-in file objects most of the time); see the library manual or run
adir(os)
call on your machine for
an exhaustive list of other open flags available.
One final note here: usingos.open
with theO_EXCL
flag is the most portable way to
lock files
for concurrent updates or other
process synchronization in Python today. We’ll see contexts where this
can matter in the next chapter, when we begin to explore
multi
processing tools. Programs running
in parallel on a server machine, for instance, may need to lock files
before performing updates, if multiple threads or processes might
attempt such updates at the same
time.
We saw earlier how to
go from file object to field descriptor with thefileno
file object method; given a
descriptor, we can useos
module
tools for lower-level file access to the underlying file. We can also
go the other way—theos.fdopen
call wraps
a file descriptor in a file object. Because conversions work both
ways, we can generally use either tool set—file object oros
module:
>>>fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>fdfile
3
>>>objfile = os.fdopen(fdfile, 'rb')
>>>objfile.read()
b'Jello stdio file\r\nHello descriptor file\n'
In fact, we can wrap a file descriptor in either a binary or
text-mode file object: in text mode, reads and writes perform the
Unicode encodings and line-end translations we studied earlier and
deal instr
strings instead ofbytes
:
C:\...\PP4E\System>python
>>>import os
>>>fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>objfile = os.fdopen(fdfile, 'r')
>>>objfile.read()
'Jello stdio file\nHello descriptor file\n'
In Python 3.X, the built-inopen
call also accepts a file descriptor
instead of a file name string; in this mode it works much likeos.fdopen
, but gives you greater
control—for example, you can use additional arguments to specify a
nondefault Unicode encoding for text and suppress the default
descriptor close. Really, though,os.fdopen
accepts the same extra-control
arguments in 3.X, because it has been redefined to do little but call
back to the built-inopen
(see
os.py
in the standard
library):
C:\...\PP4E\System>python
>>>import os
>>>fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>fdfile
3
>>>objfile = open(fdfile, 'r', encoding='latin1', closefd=False)
>>>objfile.read()
'Jello stdio file\nHello descriptor file\n'
>>>objfile = os.fdopen(fdfile, 'r', encoding='latin1', closefd=True)
>>>objfile.seek(0)
>>>objfile.read()
'Jello stdio file\nHello descriptor file\n'
We’ll make use of this file object wrapper technique to simplify
text-oriented pipes and other descriptor-like objects later in this
book (e.g., sockets have amakefile
method which achieves similar effects).
Theos
module also includes
an assortment of file tools that accept a file pathname string and
accomplish file-related tasks such as renaming (os.rename
), deleting (os.remove
), and changing the file’s owner
and permission settings (os.chown
,os.chmod
). Let’s step through a few
examples of these tools in action:
>>>os.chmod('spam.txt', 0o777)
# enable all accesses
Thisos.chmod
file permissions call passes a 9-bit string composed of
three sets of three bits each. From left to right, the three sets
represent the file’s owning user, the file’s group, and all others.
Within each set, the three bits reflect read, write, and execute
access permissions. When a bit is “1” in this string, it means that
the corresponding operation is allowed for the assessor. For instance,
octal 0777 is a string of nine “1” bits in binary, so it enables all
three kinds of accesses for all three user groups; octal 0600 means
that the file can be read and written only by the user that owns it
(when written in binary, 0600 octal is really bits 110 000
000).
This scheme stems from Unix file permission settings, but the
call works on Windows as well. If it’s puzzling, see your system’s
documentation (e.g., a Unix manpage) for
chmod
.
Moving on:
>>>os.rename(r'C:\temp\spam.txt', r'C:\temp\eggs.txt')
# from, to
>>>os.remove(r'C:\temp\spam.txt')
# delete file?
WindowsError: [Error 2] The system cannot find the file specified: 'C:\\temp\\...'
>>>os.remove(r'C:\temp\eggs.txt')
Theos.rename
call used here changes a file’s name; theos.remove
file
deletion call deletes a file from your system and is synonymous
withos.unlink
(the
latter reflects the call’s name on Unix but was obscure to
users
of other platforms).
[
10
]
Theos
module also
exports thestat
system
call:
>>>open('spam.txt', 'w').write('Hello stat world\n')
# +1 for \r added
17
>>>import os
>>>info = os.stat(r'C:\temp\spam.txt')
>>>info
nt.stat_result(st_mode=33206, st_ino=0, st_dev=0, st_nlink=0, st_uid=0, st_gid=0,
st_size=18, st_atime=1267645806, st_mtime=1267646072, st_ctime=1267645806)
>>>info.st_mode, info.st_size
# via named-tuple item attr names
(33206, 18)
>>>import stat
>>>info[stat.ST_MODE], info[stat.ST_SIZE]
# via stat module presets
(33206, 18)
>>>stat.S_ISDIR(info.st_mode), stat.S_ISREG(info.st_mode)
(False, True)
Theos.stat
call returns a tuple of values (really, in 3.X a special
kind of tuple with named items) giving low-level information about the
named file, and thestat
module
exports constants and functions for querying this information in a
portable way. For instance, indexing anos.stat
result on offsetstat.ST_SIZE
returns the file’s size, and
callingstat.S_ISDIR
with the mode
item from anos.stat
result checks
whether the file is a directory. As shown earlier, though, both of
these operations are available in theos.path
module, too, so it’s rarely
necessary to useos.stat
except for
low-level file queries:
>>>path = r'C:\temp\spam.txt'
>>>os.path.isdir(path), os.path.isfile(path), os.path.getsize(path)
(False, True, 18)