Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

Programming Python (21 page)

BOOK: Programming Python

7.83Mb size Format: txt, pdf, ePub

ads

^[
8
]Notice that
inputraises an
exception to signal end-of-file, but file read methods simply return
an empty string for this condition. Because
inputalso strips the end-of-line
character at the end of lines, an empty string result means an empty
line, so an exception is necessary to specify the end-of-file
condition. File read methods retain the end-of-line character and
denote an empty line as
"\n"instead of
"". This is one way in
which reading
sys.stdindirectly
differs from
input. The latter
also accepts a prompt string that is automatically printed before
input is accepted.

Chapter 4. File and Directory Tools

“Erase Your Hard Drive in Five Easy Steps!”

This chapter continues our look at system interfaces in Python by
focusing on file and directory-related tools. As you’ll see, it’s easy to
process files and directory trees with Python’s built-in and standard
library support. Because files are part of the core Python language, some
of this chapter’s material is a review of file basics covered in books
like
Learning
Python
, Fourth Edition, and we’ll defer to such
resources for more background details on some file-related concepts. For
example, iteration, context managers, and the file object’s support for
Unicode encodings are demonstrated along the way, but these topics are not
repeated in full here. This chapter’s goal is to tell enough of the file
story to get you started writing useful scripts.

File Tools

External files
are at the heart of much of what we do with system
utilities. For instance, a testing system may read its inputs from one
file, store program results in another file, and check expected results by
loading yet another file. Even user interface and Internet-oriented
programs may load binary images and audio clips from files on the
underlying computer. It’s a core programming concept.

In Python, the built-in
openfunction is the primary tool scripts use to access the files
on the underlying computer system. Since this function is an inherent part
of the Python language, you may already be familiar with its basic
workings. When called, the
openfunction returns a new
file object
that is connected to the
external file; the file object has methods that transfer data to and from
the file and perform a variety of file-related operations. The
openfunction also provides a
portable
interface to the underlying filesystem—it
works the same way on every platform on which Python runs.

Other file-related modules built into Python allow us to do things
such as manipulate lower-level descriptor-based files (
os); copy, remove, and move files and
collections of files (
osand
shutil); store data and objects in files by key
(
dbmand
shelve); and access SQL databases (
sqlite3and third-party add-ons). The last two
of these categories are related to database topics, addressed in
Chapter 17
.

In this section, we’ll take a brief tutorial look at the built-in
file object and explore a handful of more advanced file-related topics. As
usual, you should consult either Python’s library manual or reference
books such as
Python Pocket
Reference
for further details and methods we don’t have
space to cover here. Remember, for quick interactive help, you can also
run
dir(file)on an open file object to see an attributes
list that includes methods;
help(file)for general help; and
help(file.read)for help on a specific method such as
read, though the file object
implementation in 3.1 provides less information for
helpthan the library manual and other
resources.

The File Object Model in Python 3.X

Just like the string
types we noted in
Chapter 2
, file
support in Python 3.X is a bit richer than it was in the past. As we
noted earlier, in Python 3.X
strstrings always represent Unicode text (ASCII or wider), and
bytesand
bytearraystrings represent raw binary data.
Python 3.X draws a similar and related distinction between files
containing text and binary data:

Text files
contain Unicode text. In your script, text file
content is always a
strstring—
a sequence of characters
(technically, Unicode “code points”). Text files perform the
automatic line-end translations described in this chapter by default
and automatically apply Unicode encodings to file content: they
encode to and decode from raw binary bytes on transfers to and from
the file, according to a provided or default encoding name. Encoding
is trivial for ASCII text, but may be sophisticated in other
cases.
Binary files
contain raw 8-bit bytes. In your script, binary file
content is always a byte string, usually a
bytesobject—a sequence of small integers,
which supports most
stroperations and displays as ASCII characters whenever possible.
Binary files perform no translations of data when it is transferred
to and from files: no line-end translations or Unicode encodings are
performed.

In practice, text files are used for all truly text-related data,
and binary files store items like packed binary data, images, audio
files, executables, and so on. As a programmer you distinguish between
the two file types in the mode string argument you pass to
open: adding a “b” (e.g.,
'rb',
'wb')
means the file contains binary data. For coding new file content, use
normal strings for text (e.g.,
'spam'or
bytes.decode()) and byte strings
for binary (e.g.,
b'spam'or
str.encode()).

Unless your file scope is limited to ASCII text, the 3.X
text/binary distinction can sometimes impact your code. Text files
create and require
strstrings, and
binary files use byte strings; because you cannot freely mix the two
string types in expressions, you must choose file mode carefully. Many
built-in tools we’ll use in this book make the choice for us; the
structand
picklemodules, for instance, deal in byte
strings in 3.X, and the
xmlpackage
in Unicode
str. You must even be
aware of the 3.X text/binary distinction when using system tools like
pipe descriptors and
sockets
, because they transfer
data as byte strings today (though their content can be decoded and
encoded as Unicode text if needed).

Moreover, because text-mode files require that content be
decodable per a Unicode encoding scheme, you must read undecodable file
content in binary mode, as byte strings (or catch Unicode exceptions in
trystatements and skip the file
altogether). This may include both truly binary files as well as text
files that use encodings that are nondefault and unknown. As we’ll see
later in this chapter, because
strstrings are always Unicode in 3.X, it’s sometimes also necessary to
select byte string mode for the names of files in directory tools such
as
os.listdir,
glob.glob, and
os.walkif they cannot be decoded (passing in
byte strings essentially suppresses decoding).

In fact, we’ll see examples where the Python 3.X distinction
between
strtext and
bytesbinary pops up in tools beyond basic
files throughout this book—in Chapters
5
and
12
when we explore sockets; in Chapters
6
and
11
when we’ll need to ignore Unicode
errors in file and directory searches; in
Chapter 12
, where we’ll see how client-side Internet
protocol modules such as FTP and email, which run atop sockets, imply
file modes and encoding requirements; and more.

But just as for string types, although we will see some of these
concepts in action in this chapter, we’re going to take much of this
story as a given here. File and string objects are core language
material and are prerequisite to this text. As mentioned earlier,
because they are addressed by a 45-page chapter in the book
Learning
Python
, Fourth Edition, I won’t repeat their coverage
in full in this book. If you find yourself confused by the Unicode and
binary file and string concepts in the following sections, I encourage
you to refer to that text or other resources for more background
information in this
domain
.

Using Built-in File Objects

Despite the text/binary
dichotomy in Python 3.X, files are still very
straightforward to use. For most purposes, in fact, the
openbuilt-in function
and its files objects are all you need to remember to process files in
your scripts. The file object returned by
openhas methods for reading data (
read,
readline,
readlines); writing data (
write,
writelines); freeing system resources
(
close); moving to arbitrary
positions in the file (
seek); forcing
data in output buffers to be transferred to disk (
flush); fetching the underlying file handle
(
fileno); and more. Since the
built-in file object is so easy to use, let’s jump right into a few
interactive examples.

Output files

To make a new
file, call
openwith
two arguments: the external
name
of the file to
be created and a
mode
string
w(short for
write
). To
store data on the file, call the
file object’s
writemethod with a string containing the data to store, and then call
the
closemethod to
close the file. File
writecalls
return the number of characters or bytes written (which we’ll
sometimes omit in this book to save space), and as we’ll see,
closecalls are often optional, unless you
need to open and read the file again during the same program or
session:

C:\temp>
python
>>>
file = open('data.txt', 'w')
# open output file object: creates
>>>
file.write('Hello file world!\n')
# writes strings verbatim
18
>>>
file.write('Bye   file world.\n')
# returns number chars/bytes written
18
>>>
file.close()
# closed on gc and exit too

And that’s it—you’ve just generated a brand-new text file on
your computer, regardless of the computer on which you type this
code:

C:\temp>
dir data.txt /B
data.txt
C:\temp>
type data.txt
Hello file world!
Bye   file world.

There is nothing unusual about the new file; here, I use the DOS
dirand
typecommands to list and display the new
file, but it shows up in a file explorer GUI, too.

Opening

In the
openfunction
call shown in the preceding example, the first
argument can optionally specify a complete directory path as part of
the filename string. If we pass just a simple filename without a
path, the file will appear in Python’s current working directory.
That is, it shows up in the place where the code is run. Here, the
directory
C:\temp
on my machine is implied by
the bare filename
data.txt
, so this actually
creates a file at
C:\temp\data.txt
. More
accurately, the filename is relative to the current working
directory if it does not include a complete absolute directory path.
See
Current Working Directory
(
Chapter 3
), for a refresher on this
topic.

Also note that when opening in
wmode, Python either creates the external
file if it does not yet exist or erases the file’s current contents
if it is already present on your machine (so be careful out
there—you’ll delete whatever was in the file before).

Writing

Notice that we
added an explicit
\nend-of-line character to lines written
to the file; unlike the
printbuilt-in function, file object
writemethods write exactly what they are
passed without adding any extra formatting. The string passed to
writeshows up character for
character on the external file. In text files, data written may
undergo line-end or Unicode translations which we’ll describe ahead,
but these are undone when the data is later read back.

Output files also sport
a
writelinesmethod, which simply writes all of the strings in a list one at a
time without adding any extra formatting. For example, here is a
writelinesequivalent to the two
writecalls shown earlier:

file.writelines(['Hello file world!\n', 'Bye   file world.\n'])

This call isn’t as commonly used (and can be emulated with a
simple
forloop or other
iteration tool), but it is convenient in scripts that save output in
a list to be written later.

Closing

The file
closemethod
used earlier finalizes file contents and frees up
system resources. For instance, closing forces buffered output data
to be flushed out to disk. Normally, files are automatically closed
when the file object is garbage collected by the interpreter (that
is, when it is no longer referenced). This includes all remaining
open files when the Python session or program exits. Because of
that,
closecalls are often
optional. In fact, it’s common to see file-processing code in Python
in this idiom:

open('somefile.txt', 'w').write("G'day Bruce\n")       # write to temporary object
open('somefile.txt', 'r').read()                       # read from temporary object

Since both these expressions make a temporary file object, use
it immediately, and do not save a reference to it, the file object
is reclaimed right after data is transferred, and is automatically
closed in the process. There is usually no need for such code to
call the
closemethod
explicitly.

In some contexts, though, you may wish to explicitly close
anyhow:

For one, because the Jython implementation relies on
Java’s garbage collector, you can’t always be as sure about when
files will be reclaimed as you can in standard Python. If you
run your Python code with Jython, you may need to close manually
if many files are created in a short amount of time (e.g. in a
loop), in order to avoid running out of file resources on
operating systems where this matters.
For another, some IDEs, such as Python’s standard IDLE
GUI, may hold on to your file objects longer than you expect (in
stack tracebacks of prior errors, for instance), and thus
prevent them from being garbage collected as soon as you might
expect. If you write to an output file in IDLE, be sure to
explicitly close (or flush) your file if you need to reliably
read it back during the same IDLE session. Otherwise, output
buffers might not be flushed to disk and your file may be
incomplete when read.
And while it seems very unlikely today, it’s not
impossible that this auto-close on reclaim file feature could
change in future. This is technically a feature of the file
object’s implementation, which may or may not be considered part
of the language definition over time.

For these reasons, manual close calls are not a bad idea in
nontrivial programs, even if they are technically not required.
Closing is a generally harmless but robust habit to
form.

Ensuring file closure: Exception handlers and context
managers

Manual file
close method calls are easy in straight-line code, but
how do you ensure file closure when exceptions might kick your program
beyond the point where the close call is coded? First of all, make
sure you must—files close themselves when they are collected, and this
will happen eventually, even when exceptions occur.

If closure is required, though, there are two basic
alternatives: the
trystatement’s
finallyclause is the most general,
since it allows you to provide general exit actions for any type of
exceptions:

myfile = open(filename, 'w')
try:
...process myfile...
finally:
myfile.close()

In recent Python releases, though, the
withstatement
provides a more concise alternative for some specific objects and exit
actions, including closing files:

with open(filename, 'w') as myfile:
...process myfile, auto-closed on statement exit...

This statement relies on the file object’s context manager: code
automatically run both on statement entry and on statement exit
regardless of exception behavior. Because the file object’s exit code
closes the file automatically, this guarantees file closure whether an
exception occurs during the statement or not.

The
withstatement is notably
shorter (3 lines) than the
try/
finallyalternative, but it’s also less
general—
withapplies only to
objects that support the context manager protocol, whereas
try/
finallyallows arbitrary exit actions for
arbitrary exception contexts. While some other object types have
context managers, too (e.g., thread locks),
withis limited in scope. In fact, if you
want to remember just one exit actions option,
try/
finallyis the most inclusive. Still,
withyields less code for files
that must be closed and can serve well in such specific roles. It can
even save a line of code when no
exceptions
are expected (albeit at the
expense of further nesting and indenting file
processing
logic):

myfile = open(filename, 'w')               # traditional form
...process myfile...
myfile.close()
with open(filename) as myfile:             # context manager form
...process myfile...

In Python 3.1 and later, this statement can also specify
multiple (a.k.a. nested) context managers—any number of context
manager items may be separated by commas, and multiple items work the
same as nested
withstatements. In
general terms, the 3.1 and later code:

with A() as a, B() as b:
...statements...

Runs the same as the following, which works in 3.1, 3.0, and
2.6:

with A() as a:
with B() as b:
...statements...

For example, when the
withstatement block exits in the following, both files’ exit actions are
automatically run to close the files, regardless of exception
outcomes:

with open('data') as fin, open('results', 'w') as fout:
for line in fin:
fout.write(transform(line))

Context manager–dependent code like this seems to have become
more common in recent years, but this is likely at least in part
because newcomers are accustomed to languages that require manual
close calls in all cases. In most contexts there is no need to wrap
all your Python file-processing code in
withstatements—the files object’s
auto-close-on-collection behavior often suffices, and manual close
calls are enough for many other scripts. You should use the
withor
tryoptions outlined here only if you must
close, and only in the presence of potential exceptions. Since
standard C Python automatically closes files on collection, though,
neither option is required in many (and perhaps
most) scripts.

Input files

Reading data
from external files is just as easy as writing, but
there are more methods that let us load data in a variety of modes.
Input text files are opened with either a mode flag of
r(for “read”) or no mode flag at all—it
defaults to
rif omitted, and it
commonly is. Once opened, we can read the lines of a text file with
the
readlinesmethod:

C:\temp>
python
>>>
file = open('data.txt')
# open input file object: 'r' default
>>>
lines = file.readlines()
# read into line string list
>>>
for line in lines:
# BUT use file line iterator! (ahead)
...
print(line, end='')
# lines have a '\n' at end
...
Hello file world!
Bye   file world.

The
readlinesmethod loads
the entire contents of the file into memory and gives it to our
scripts as a list of line strings that we can step through in a loop.
In fact, there are many ways to read an input file:

file.read(): Returns a string
containing all the characters (or bytes) stored in
the file
file.read(N): Returns a string containing the next N characters (or
bytes) from the file
file.readline(): Reads through
the next
\nand
returns a line string
file.readlines(): Reads the entire
file and returns a list of line strings

Let’s run these method calls to read files, lines, and
characters from a text file—the
seek(0)call is used here before each test
to rewind the file to its beginning (more on this call in a
moment):

>>>
file.seek(0)
# go back to the front of file
>>>
file.read()
# read entire file into string
'Hello file world!\nBye   file world.\n'
>>>
file.seek(0)
# read entire file into lines list
>>>
file.readlines()
['Hello file world!\n', 'Bye   file world.\n']
>>>
file.seek(0)
>>>
file.readline()
# read one line at a time
'Hello file world!\n'
>>>
file.readline()
'Bye   file world.\n'
>>>
file.readline()
# empty string at end-of-file
''
>>>
file.seek(0)
# read N (or remaining) chars/bytes
>>>
file.read(1), file.read(8)
# empty string at end-of-file
('H', 'ello fil')

All of these input methods let us be specific about how much to
fetch. Here are a few rules of thumb about which to choose:

read()and
readlines()load the
entire
file
into memory all at once. That makes them handy for
grabbing a file’s contents with as little code as possible. It
also makes them generally fast, but costly in terms of memory for
huge files—loading a multigigabyte file into memory is not
generally a good thing to do (and might not be possible at all on
a given computer).
On the other hand, because the
readline()and
read(N)calls fetch just
part
of the file
(the next line or N-character-or-byte
block), they are safer for potentially big files but a bit less
convenient and sometimes slower. Both return an empty string when
they reach end-of-file. If speed matters and your files aren’t
huge,
reador
readlinesmay be a generally better
choice.
See also the discussion of the newer file iterators in the
next section. As we’ll see, iterators combine the convenience of
readlines()with the space
efficiency of
readline()and
are the preferred way to read text files by lines today.

The
seek(0)call used
repeatedly here means “go back to the start of the file.” In our
example, it is an alternative to reopening the file each time. In
files, all read and write operations take place at the current
position; files normally start at offset 0 when opened and advance as
data is transferred. The
seekcall
simply lets us move to a new position for the next transfer operation.
More on this method later when we explore random access
files.

BOOK: Programming Python

7.83Mb size Format: txt, pdf, ePub

Read Book Download Book

ads

Other books

Rage Is Back (9781101606179) by Mansbach, Adam

Connor (The Kendall Family Series Book 2) by Randi Everheart

Holy Terror in the Hebrides by Jeanne M. Dams

Bad Radio by Langlois, Michael

Jammy Dodger by Kevin Smith

Retribution by Dave O'Connor

Renewed (The Fractured Series Book 3) by James, Holleigh

Darkness Clashes by Susan Illene

Twenty Tones of Red by Montford, Pauline

The Invitation-kindle by Michael McKinney