[
8
]
Notice thatinput
raises an
exception to signal end-of-file, but file read methods simply return
an empty string for this condition. Becauseinput
also strips the end-of-line
character at the end of lines, an empty string result means an empty
line, so an exception is necessary to specify the end-of-file
condition. File read methods retain the end-of-line character and
denote an empty line as"\n"
instead of""
. This is one way in
which readingsys.stdin
directly
differs frominput
. The latter
also accepts a prompt string that is automatically printed before
input is accepted.
This chapter continues our look at system interfaces in Python by
focusing on file and directory-related tools. As you’ll see, it’s easy to
process files and directory trees with Python’s built-in and standard
library support. Because files are part of the core Python language, some
of this chapter’s material is a review of file basics covered in books
like
Learning
Python
, Fourth Edition, and we’ll defer to such
resources for more background details on some file-related concepts. For
example, iteration, context managers, and the file object’s support for
Unicode encodings are demonstrated along the way, but these topics are not
repeated in full here. This chapter’s goal is to tell enough of the file
story to get you started writing useful scripts.
External files
are at the heart of much of what we do with system
utilities. For instance, a testing system may read its inputs from one
file, store program results in another file, and check expected results by
loading yet another file. Even user interface and Internet-oriented
programs may load binary images and audio clips from files on the
underlying computer. It’s a core programming concept.
In Python, the built-inopen
function is the primary tool scripts use to access the files
on the underlying computer system. Since this function is an inherent part
of the Python language, you may already be familiar with its basic
workings. When called, theopen
function returns a new
file object
that is connected to the
external file; the file object has methods that transfer data to and from
the file and perform a variety of file-related operations. Theopen
function also provides a
portable
interface to the underlying filesystem—it
works the same way on every platform on which Python runs.
Other file-related modules built into Python allow us to do things
such as manipulate lower-level descriptor-based files (os
); copy, remove, and move files and
collections of files (os
andshutil
); store data and objects in files by key
(dbm
andshelve
); and access SQL databases (sqlite3
and third-party add-ons). The last two
of these categories are related to database topics, addressed in
Chapter 17
.
In this section, we’ll take a brief tutorial look at the built-in
file object and explore a handful of more advanced file-related topics. As
usual, you should consult either Python’s library manual or reference
books such as
Python Pocket
Reference
for further details and methods we don’t have
space to cover here. Remember, for quick interactive help, you can also
rundir(
file
)
on an open file object to see an attributes
list that includes methods;help(
file
)
for general help; andhelp(
file
.read)
for help on a specific method such asread
, though the file object
implementation in 3.1 provides less information forhelp
than the library manual and other
resources.
Just like the string
types we noted in
Chapter 2
, file
support in Python 3.X is a bit richer than it was in the past. As we
noted earlier, in Python 3.Xstr
strings always represent Unicode text (ASCII or wider), andbytes
andbytearray
strings represent raw binary data.
Python 3.X draws a similar and related distinction between files
containing text and binary data:
Text files
contain Unicode text. In your script, text file
content is always astr
string—
a sequence of characters
(technically, Unicode “code points”). Text files perform the
automatic line-end translations described in this chapter by default
and automatically apply Unicode encodings to file content: they
encode to and decode from raw binary bytes on transfers to and from
the file, according to a provided or default encoding name. Encoding
is trivial for ASCII text, but may be sophisticated in other
cases.
Binary files
contain raw 8-bit bytes. In your script, binary file
content is always a byte string, usually abytes
object—a sequence of small integers,
which supports moststr
operations and displays as ASCII characters whenever possible.
Binary files perform no translations of data when it is transferred
to and from files: no line-end translations or Unicode encodings are
performed.
In practice, text files are used for all truly text-related data,
and binary files store items like packed binary data, images, audio
files, executables, and so on. As a programmer you distinguish between
the two file types in the mode string argument you pass toopen
: adding a “b” (e.g.,'rb'
,'wb'
)
means the file contains binary data. For coding new file content, use
normal strings for text (e.g.,'spam'
orbytes.decode()
) and byte strings
for binary (e.g.,b'spam'
orstr.encode()
).
Unless your file scope is limited to ASCII text, the 3.X
text/binary distinction can sometimes impact your code. Text files
create and requirestr
strings, and
binary files use byte strings; because you cannot freely mix the two
string types in expressions, you must choose file mode carefully. Many
built-in tools we’ll use in this book make the choice for us; thestruct
andpickle
modules, for instance, deal in byte
strings in 3.X, and thexml
package
in Unicodestr
. You must even be
aware of the 3.X text/binary distinction when using system tools like
pipe descriptors and
sockets
, because they transfer
data as byte strings today (though their content can be decoded and
encoded as Unicode text if needed).
Moreover, because text-mode files require that content be
decodable per a Unicode encoding scheme, you must read undecodable file
content in binary mode, as byte strings (or catch Unicode exceptions intry
statements and skip the file
altogether). This may include both truly binary files as well as text
files that use encodings that are nondefault and unknown. As we’ll see
later in this chapter, becausestr
strings are always Unicode in 3.X, it’s sometimes also necessary to
select byte string mode for the names of files in directory tools such
asos.listdir
,glob.glob
, andos.walk
if they cannot be decoded (passing in
byte strings essentially suppresses decoding).
In fact, we’ll see examples where the Python 3.X distinction
betweenstr
text andbytes
binary pops up in tools beyond basic
files throughout this book—in Chapters
5
and
12
when we explore sockets; in Chapters
6
and
11
when we’ll need to ignore Unicode
errors in file and directory searches; in
Chapter 12
, where we’ll see how client-side Internet
protocol modules such as FTP and email, which run atop sockets, imply
file modes and encoding requirements; and more.
But just as for string types, although we will see some of these
concepts in action in this chapter, we’re going to take much of this
story as a given here. File and string objects are core language
material and are prerequisite to this text. As mentioned earlier,
because they are addressed by a 45-page chapter in the book
Learning
Python
, Fourth Edition, I won’t repeat their coverage
in full in this book. If you find yourself confused by the Unicode and
binary file and string concepts in the following sections, I encourage
you to refer to that text or other resources for more background
information in this
domain
.
Despite the text/binary
dichotomy in Python 3.X, files are still very
straightforward to use. For most purposes, in fact, theopen
built-in function
and its files objects are all you need to remember to process files in
your scripts. The file object returned byopen
has methods for reading data (read
,readline
,readlines
); writing data (write
,writelines
); freeing system resources
(close
); moving to arbitrary
positions in the file (seek
); forcing
data in output buffers to be transferred to disk (flush
); fetching the underlying file handle
(fileno
); and more. Since the
built-in file object is so easy to use, let’s jump right into a few
interactive examples.
To make a new
file, callopen
with
two arguments: the external
name
of the file to
be created and a
mode
stringw
(short for
write
). To
store data on the file, call the
file object’swrite
method with a string containing the data to store, and then call
theclose
method to
close the file. Filewrite
calls
return the number of characters or bytes written (which we’ll
sometimes omit in this book to save space), and as we’ll see,close
calls are often optional, unless you
need to open and read the file again during the same program or
session:
C:\temp>python
>>>file = open('data.txt', 'w')
# open output file object: creates
>>>file.write('Hello file world!\n')
# writes strings verbatim
18
>>>file.write('Bye file world.\n')
# returns number chars/bytes written
18
>>>file.close()
# closed on gc and exit too
And that’s it—you’ve just generated a brand-new text file on
your computer, regardless of the computer on which you type this
code:
C:\temp>dir data.txt /B
data.txt
C:\temp>type data.txt
Hello file world!
Bye file world.
There is nothing unusual about the new file; here, I use the DOSdir
andtype
commands to list and display the new
file, but it shows up in a file explorer GUI, too.
In theopen
function
call shown in the preceding example, the first
argument can optionally specify a complete directory path as part of
the filename string. If we pass just a simple filename without a
path, the file will appear in Python’s current working directory.
That is, it shows up in the place where the code is run. Here, the
directory
C:\temp
on my machine is implied by
the bare filename
data.txt
, so this actually
creates a file at
C:\temp\data.txt
. More
accurately, the filename is relative to the current working
directory if it does not include a complete absolute directory path.
See
Current Working Directory
(
Chapter 3
), for a refresher on this
topic.
Also note that when opening inw
mode, Python either creates the external
file if it does not yet exist or erases the file’s current contents
if it is already present on your machine (so be careful out
there—you’ll delete whatever was in the file before).
Notice that we
added an explicit\n
end-of-line character to lines written
to the file; unlike theprint
built-in function, file objectwrite
methods write exactly what they are
passed without adding any extra formatting. The string passed towrite
shows up character for
character on the external file. In text files, data written may
undergo line-end or Unicode translations which we’ll describe ahead,
but these are undone when the data is later read back.
Output files also sport
awritelines
method, which simply writes all of the strings in a list one at a
time without adding any extra formatting. For example, here is awritelines
equivalent to the twowrite
calls shown earlier:
file.writelines(['Hello file world!\n', 'Bye file world.\n'])
This call isn’t as commonly used (and can be emulated with a
simplefor
loop or other
iteration tool), but it is convenient in scripts that save output in
a list to be written later.
The fileclose
method
used earlier finalizes file contents and frees up
system resources. For instance, closing forces buffered output data
to be flushed out to disk. Normally, files are automatically closed
when the file object is garbage collected by the interpreter (that
is, when it is no longer referenced). This includes all remaining
open files when the Python session or program exits. Because of
that,close
calls are often
optional. In fact, it’s common to see file-processing code in Python
in this idiom:
open('somefile.txt', 'w').write("G'day Bruce\n") # write to temporary object
open('somefile.txt', 'r').read() # read from temporary object
Since both these expressions make a temporary file object, use
it immediately, and do not save a reference to it, the file object
is reclaimed right after data is transferred, and is automatically
closed in the process. There is usually no need for such code to
call theclose
method
explicitly.
In some contexts, though, you may wish to explicitly close
anyhow:
For one, because the Jython implementation relies on
Java’s garbage collector, you can’t always be as sure about when
files will be reclaimed as you can in standard Python. If you
run your Python code with Jython, you may need to close manually
if many files are created in a short amount of time (e.g. in a
loop), in order to avoid running out of file resources on
operating systems where this matters.
For another, some IDEs, such as Python’s standard IDLE
GUI, may hold on to your file objects longer than you expect (in
stack tracebacks of prior errors, for instance), and thus
prevent them from being garbage collected as soon as you might
expect. If you write to an output file in IDLE, be sure to
explicitly close (or flush) your file if you need to reliably
read it back during the same IDLE session. Otherwise, output
buffers might not be flushed to disk and your file may be
incomplete when read.
And while it seems very unlikely today, it’s not
impossible that this auto-close on reclaim file feature could
change in future. This is technically a feature of the file
object’s implementation, which may or may not be considered part
of the language definition over time.
For these reasons, manual close calls are not a bad idea in
nontrivial programs, even if they are technically not required.
Closing is a generally harmless but robust habit to
form.
Manual file
close method calls are easy in straight-line code, but
how do you ensure file closure when exceptions might kick your program
beyond the point where the close call is coded? First of all, make
sure you must—files close themselves when they are collected, and this
will happen eventually, even when exceptions occur.
If closure is required, though, there are two basic
alternatives: thetry
statement’sfinally
clause is the most general,
since it allows you to provide general exit actions for any type of
exceptions:
myfile = open(filename, 'w')
try:
...process myfile...
finally:
myfile.close()
In recent Python releases, though, thewith
statement
provides a more concise alternative for some specific objects and exit
actions, including closing files:
with open(filename, 'w') as myfile:
...process myfile, auto-closed on statement exit...
This statement relies on the file object’s context manager: code
automatically run both on statement entry and on statement exit
regardless of exception behavior. Because the file object’s exit code
closes the file automatically, this guarantees file closure whether an
exception occurs during the statement or not.
Thewith
statement is notably
shorter (3 lines) than thetry
/finally
alternative, but it’s also less
general—with
applies only to
objects that support the context manager protocol, whereastry
/finally
allows arbitrary exit actions for
arbitrary exception contexts. While some other object types have
context managers, too (e.g., thread locks),with
is limited in scope. In fact, if you
want to remember just one exit actions option,try
/finally
is the most inclusive. Still,with
yields less code for files
that must be closed and can serve well in such specific roles. It can
even save a line of code when no
exceptions
are expected (albeit at the
expense of further nesting and indenting file
processing
logic):
myfile = open(filename, 'w') # traditional form
...process myfile...
myfile.close()
with open(filename) as myfile: # context manager form
...process myfile...
In Python 3.1 and later, this statement can also specify
multiple (a.k.a. nested) context managers—any number of context
manager items may be separated by commas, and multiple items work the
same as nestedwith
statements. In
general terms, the 3.1 and later code:
with A() as a, B() as b:
...statements...
Runs the same as the following, which works in 3.1, 3.0, and
2.6:
with A() as a:
with B() as b:
...statements...
For example, when thewith
statement block exits in the following, both files’ exit actions are
automatically run to close the files, regardless of exception
outcomes:
with open('data') as fin, open('results', 'w') as fout:
for line in fin:
fout.write(transform(line))
Context manager–dependent code like this seems to have become
more common in recent years, but this is likely at least in part
because newcomers are accustomed to languages that require manual
close calls in all cases. In most contexts there is no need to wrap
all your Python file-processing code inwith
statements—the files object’s
auto-close-on-collection behavior often suffices, and manual close
calls are enough for many other scripts. You should use thewith
ortry
options outlined here only if you must
close, and only in the presence of potential exceptions. Since
standard C Python automatically closes files on collection, though,
neither option is required in many (and perhaps
most) scripts.
Reading data
from external files is just as easy as writing, but
there are more methods that let us load data in a variety of modes.
Input text files are opened with either a mode flag ofr
(for “read”) or no mode flag at all—it
defaults tor
if omitted, and it
commonly is. Once opened, we can read the lines of a text file with
thereadlines
method:
C:\temp>python
>>>file = open('data.txt')
# open input file object: 'r' default
>>>lines = file.readlines()
# read into line string list
>>>for line in lines:
# BUT use file line iterator! (ahead)
...print(line, end='')
# lines have a '\n' at end
...
Hello file world!
Bye file world.
Thereadlines
method loads
the entire contents of the file into memory and gives it to our
scripts as a list of line strings that we can step through in a loop.
In fact, there are many ways to read an input file:
file.read()
Returns a string
containing all the characters (or bytes) stored in
the file
file.read(N)
Returns a string containing the next N characters (or
bytes) from the file
file.readline()
Reads through
the next\n
and
returns a line string
file.readlines()
Reads the entire
file and returns a list of line strings
Let’s run these method calls to read files, lines, and
characters from a text file—theseek(0)
call is used here before each test
to rewind the file to its beginning (more on this call in a
moment):
>>>file.seek(0)
# go back to the front of file
>>>file.read()
# read entire file into string
'Hello file world!\nBye file world.\n'
>>>file.seek(0)
# read entire file into lines list
>>>file.readlines()
['Hello file world!\n', 'Bye file world.\n']
>>>file.seek(0)
>>>file.readline()
# read one line at a time
'Hello file world!\n'
>>>file.readline()
'Bye file world.\n'
>>>file.readline()
# empty string at end-of-file
''
>>>file.seek(0)
# read N (or remaining) chars/bytes
>>>file.read(1), file.read(8)
# empty string at end-of-file
('H', 'ello fil')
All of these input methods let us be specific about how much to
fetch. Here are a few rules of thumb about which to choose:
read()
andreadlines()
load the
entire
file
into memory all at once. That makes them handy for
grabbing a file’s contents with as little code as possible. It
also makes them generally fast, but costly in terms of memory for
huge files—loading a multigigabyte file into memory is not
generally a good thing to do (and might not be possible at all on
a given computer).
On the other hand, because thereadline()
andread(N)
calls fetch just
part
of the file
(the next line or N-character-or-byte
block), they are safer for potentially big files but a bit less
convenient and sometimes slower. Both return an empty string when
they reach end-of-file. If speed matters and your files aren’t
huge,read
orreadlines
may be a generally better
choice.
See also the discussion of the newer file iterators in the
next section. As we’ll see, iterators combine the convenience ofreadlines()
with the space
efficiency ofreadline()
and
are the preferred way to read text files by lines today.
Theseek(0)
call used
repeatedly here means “go back to the start of the file.” In our
example, it is an alternative to reopening the file each time. In
files, all read and write operations take place at the current
position; files normally start at offset 0 when opened and advance as
data is transferred. Theseek
call
simply lets us move to a new position for the next transfer operation.
More on this method later when we explore random access
files.