Before we leave
our file tools survey, it’s time for something that
performs a more tangible task and illustrates some of what we’ve learned
so far. Unlike some shell-tool languages, Python doesn’t have an
implicit file-scanning loop procedure, but it’s simple to write a
general one that we can reuse for all time. The module in
Example 4-1
defines a general
file-scanning routine, which simply applies a passed-in Python function
to each line in an external file.
Example 4-1. PP4E\System\Filetools\scanfile.py
def scanner(name, function):
file = open(name, 'r') # create a file object
while True:
line = file.readline() # call file methods
if not line: break # until end-of-file
function(line) # call a function object
file.close()
Thescanner
function
doesn’t care what line-processing function is passed in,
and that accounts for most of its generality—it is happy to apply
any
single-argument function that exists now or in
the future to all of the lines in a text file. If we code this module
and put it in a directory on the module search path, we can use it any
time we need to step through a file line by line.
Example 4-2
is a client script that
does simple line translations.
Example 4-2. PP4E\System\Filetools\commands.py
#!/usr/local/bin/python
from sys import argv
from scanfile import scanner
class UnknownCommand(Exception): pass
def processLine(line): # define a function
if line[0] == '*': # applied to each line
print("Ms.", line[1:-1])
elif line[0] == '+':
print("Mr.", line[1:-1]) # strip first and last char: \n
else:
raise UnknownCommand(line) # raise an exception
filename = 'data.txt'
if len(argv) == 2: filename = argv[1] # allow filename cmd arg
scanner(filename, processLine) # start the scanner
The text file
hillbillies.txt
contains the
following lines:
*Granny
+Jethro
*Elly May
+"Uncle Jed"
and our commands script could be run as follows:
C:\...\PP4E\System\Filetools>python commands.py hillbillies.txt
Ms. Granny
Mr. Jethro
Ms. Elly May
Mr. "Uncle Jed"
This works, but there are a variety of coding alternatives for
both files, some of which may be better than those listed above. For
instance, we could also code the command processor of
Example 4-2
in the following way;
especially if the number of command options starts to become large, such
a data-driven approach may be more concise
and easier
to maintain than a largeif
statement with essentially
redundant actions (if you ever have to change the way output lines
print, you’ll have to change it in only one place with this
form):
commands = {'*': 'Ms.', '+': 'Mr.'} # data is easier to expand than code?
def processLine(line):
try:
print(commands[line[0]], line[1:-1])
except KeyError:
raise UnknownCommand(line)
The scanner could similarly be improved. As a rule of thumb, we
can also usually speed things up by shifting processing from Python code
to built-in tools. For instance, if we’re concerned with speed, we can
probably make our file scanner faster by using the file’s
line
iterator
to step through the file instead of the manualreadline
loop in
Example 4-1
(though you’d have to time
this with your Python to be sure):
def scanner(name, function):
for line in open(name, 'r'): # scan line by line
function(line) # call a function object
And we can work more magic in
Example 4-1
with the iteration tools
like themap
built-in function, the
list comprehension expression, and the generator expression. Here are
three minimalist’s versions; thefor
loop is replaced bymap
or a
comprehension, and we let Python close the file for us when it is
garbage collected or the script exits (these all build a temporary list
of results along the way to run through their iterations, but this
overhead is likely trivial for all but the largest of files):
def scanner(name, function):
list(map(function, open(name, 'r')))
def scanner(name, function):
[function(line) for line in open(name, 'r')]
def scanner(name, function):
list(function(line) for line in open(name, 'r'))
The preceding
works as planned, but what if we also want to
change
a file while scanning it?
Example 4-3
shows two approaches:
one uses explicit files, and the other uses the standard input/output
streams to allow for redirection on the command line.
Example 4-3. PP4E\System\Filetools\filters.py
import sys
def filter_files(name, function): # filter file through function
input = open(name, 'r') # create file objects
output = open(name + '.out', 'w') # explicit output file too
for line in input:
output.write(function(line)) # write the modified line
input.close()
output.close() # output has a '.out' suffix
def filter_stream(function): # no explicit files
while True: # use standard streams
line = sys.stdin.readline() # or: input()
if not line: break
print(function(line), end='') # or: sys.stdout.write()
if __name__ == '__main__':
filter_stream(lambda line: line) # copy stdin to stdout if run
Notice that the newer
context managers
feature discussed earlier could save us a few lines here
in the file-based filter of
Example 4-3
, and also guarantee
immediate file closures if the processing function fails with an
exception:
def filter_files(name, function):
with open(name, 'r') as input, open(name + '.out', 'w') as output:
for line in input:
output.write(function(line)) # write the modified line
And again, file object
line iterators
could simplify the stream-based filter’s code in this
example as well:
def filter_stream(function):
for line in sys.stdin: # read by lines automatically
print(function(line), end='')
Since the standard streams are preopened for us, they’re often
easier to use. When run standalone, it simply parrotsstdin
tostdout
:
C:\...\PP4E\System\Filetools>filters.py < hillbillies.txt
*Granny
+Jethro
*Elly May
+"Uncle Jed"
But this module is also useful when imported as a library
(clients provide the line-processing function):
>>>from filters import filter_files
>>>filter_files('hillbillies.txt', str.upper)
>>>print(open('hillbillies.txt.out').read())
*GRANNY
+JETHRO
*ELLY MAY
+"UNCLE JED"
We’ll see files in action often in the remainder of this book,
especially in the more complete and functional system examples of
Chapter 6
. First though, we turn to
tools for processing our files’
home.
[
9
]
For instance, to process
pipes
, described
in
Chapter 5
. The Pythonos.pipe
call returns two file descriptors,
which can be processed withos
module file tools or wrapped in a file object withos.fdopen
. When used with descriptor-based
file tools inos
, pipes deal in
byte strings, not text. Some device files may require lower-level
control as well.
[
10
]
For related tools, see also theshutil
module in Python’s standard
library; it has higher-level tools for copying and removing files
and more. We’ll also write directory compare, copy, and search
tools of our own in
Chapter 6
,
after we’ve had a chance to study the directory tools presented
later in this chapter.
One of the more
common tasks in the shell utilities domain is applying an
operation to a set of files in a
directory
—a “folder”
in Windows-speak. By running a script on a batch of files, we can automate
(that is,
script
) tasks we might have to otherwise
run repeatedly by hand.
For instance, suppose you need to search all of your Python files in
a development directory for a global variable name (perhaps you’ve
forgotten where it is used). There are many platform-specific ways to do
this (e.g., thefind
andgrep
commands in Unix), but Python scripts that
accomplish such tasks will work on every platform where Python
works—Windows, Unix, Linux, Macintosh, and just about any other platform
commonly used today. If you simply copy your script to any machine you
wish to use it on, it will work regardless of which other tools are
available there; all you need is Python. Moreover, coding such tasks in
Python also allows you to perform arbitrary actions along the
way—replacements, deletions, and whatever else you can code in the Python
language.
The most common
way to go about writing such tools is to first grab a list
of the names of the files you wish to process, and then step through
that list with a Pythonfor
loop or
other iteration tool, processing each file in turn. The trick we need to
learn here, then, is how to get such a directory list within our
scripts. For scanning directories there are at least three options:
running shell listing commands withos.popen
, matching filename patterns withglob.glob
, and getting directory
listings withos.listdir
. They vary
in interface, result format, and portability.
How did you go
about getting directory file listings before you heard
of Python? If you’re new to shell tools programming, the answer may be
“Well, I started a Windows file explorer and clicked on things,” but
I’m thinking here in terms of less GUI-oriented command-line
mechanisms.
On Unix, directory listings are usually obtained by typingls
in a shell; on Windows, they can
be generated with adir
command
typed in an MS-DOS console box. Because Python scripts may useos.popen
to run any command line
that we can type in a shell, they are the most general way to grab a
directory listing inside a Python program. We metos.popen
in the prior chapters; it runs a
shell command string and gives us a file object from which we can read
the command’s output. To illustrate, let’s first assume the following
directory structures—I have both the usualdir
and a Unix-likels
command from Cygwin on my Windows
laptop:
c:\temp>dir /B
parts
PP3E
random.bin
spam.txt
temp.bin
temp.txt
c:\temp>c:\cygwin\bin\ls
PP3E parts random.bin spam.txt temp.bin temp.txt
c:\temp>c:\cygwin\bin\ls parts
part0001 part0002 part0003 part0004
The
parts
and
PP3E
names are a nested subdirectory in
C:\temp
here (the latter is a copy of the prior
edition’s examples tree, which I used occasionally in this text). Now,
as we’ve seen, scripts can grab a listing of file and directory names
at this level by simply spawning the appropriate platform-specific
command line and reading its output (the text normally thrown up on
the console window):
C:\temp>python
>>>import os
>>>os.popen('dir /B').readlines()
['parts\n', 'PP3E\n', 'random.bin\n', 'spam.txt\n', 'temp.bin\n', 'temp.txt\n']
Lines read from a shell command come back with a trailing
end-of-line character, but it’s easy enough to slice it off; theos.popen
result also gives us a
line iterator just like normal files:
>>>for line in os.popen('dir /B'):
...print(line[:-1])
...
parts
PP3E
random.bin
spam.txt
temp.bin
temp.txt
>>>lines = [line[:-1] for line in os.popen('dir /B')]
>>>lines
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
For pipe objects, the effect of iterators may be even more
useful than simply avoiding loading the entire result into memory all
at once:readlines
will always
block the caller until the spawned program is completely finished,
whereas the iterator might not.
Thedir
andls
commands let us be specific about filename patterns to
be matched and directory names to be listed by using name patterns;
again, we’re just running shell commands here, so anything you can
type at a shell prompt goes:
>>>os.popen('dir *.bin /B').readlines()
['random.bin\n', 'temp.bin\n']
>>>os.popen(r'c:\cygwin\bin\ls *.bin').readlines()
['random.bin\n', 'temp.bin\n']
>>>list(os.popen(r'dir parts /B'))
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']
>>>[fname for fname in os.popen(r'c:\cygwin\bin\ls parts')]
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']
These calls use general tools and work as advertised. As I noted
earlier, though, the downsides ofos.popen
are that it requires using a
platform-specific shell command and it incurs a performance hit to
start up an independent program. In fact, different listing tools may
sometimes produce different results:
>>>list(os.popen(r'dir parts\part* /B'))
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']
>>>
>>>list(os.popen(r'c:\cygwin\bin\ls parts/part*'))
['parts/part0001\n', 'parts/part0002\n', 'parts/part0003\n', 'parts/part0004\n']
The next two alternative techniques do better on both
counts.
The term
globbing
comes
from the*
wildcard character
in filename patterns; per computing folklore, a*
matches a “glob” of characters. In less
poetic terms, globbing simply means collecting the names of all
entries in a directory—files and subdirectories—whose names match a
given filename pattern. In Unix shells, globbing expands filename
patterns within a command line into all matching filenames before the
command is ever run. In Python, we can do something similar by calling
theglob.glob
built-in—a
tool that accepts a filename pattern to expand, and returns a list
(not a generator) of matching file names:
>>>import glob
>>>glob.glob('*')
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
>>>glob.glob('*.bin')
['random.bin', 'temp.bin']
>>>glob.glob('parts')
['parts']
>>>glob.glob('parts/*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']
>>>glob.glob('parts\part*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']
Theglob
call accepts the
usual filename pattern syntax used in shells:?
means any one character,*
means any number of characters, and[]
is a character selection
set.
[
11
]
The pattern should include a directory path if you wish
to glob in something other than the current working directory, and the
module accepts either Unix or DOS-style directory separators (/
or\
).
This call is implemented without spawning a shell command (it usesos.listdir
, described in the next
section) and so is likely to be faster and more portable and uniform
across all Python platforms than theos.popen
schemes shown earlier.
Technically speaking,glob
is
a bit more powerful than described so far. In fact, using it to list
files in one directory is just one use of its pattern-matching skills.
For instance, it can also be used to collect matching names across
multiple directories, simply because each level in a passed-in
directory path can be a pattern too:
>>>for path in glob.glob(r'PP3E\Examples\PP3E\*\s*.py'): print(path)
...
PP3E\Examples\PP3E\Lang\summer-alt.py
PP3E\Examples\PP3E\Lang\summer.py
PP3E\Examples\PP3E\PyTools\search_all.py
Here, we get back filenames from two different directories that
match thes*.py
pattern; because
the directory name preceding this is a*
wildcard, Python collects all possible
ways to reach the base filenames. Usingos.popen
to spawn shell commands achieves
the same effect, but only if the underlying shell or listing command
does, too, and with possibly different result formats across tools and
platforms.
Theos
module’slistdir
call provides yet another way to collect filenames in a
Python list. It takes a simple directory name string, not a filename
pattern, and returns a list containing the names of all entries in
that directory—both simple files and nested
directories—
for use in the calling
script:
>>>import os
>>>os.listdir('.')
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
>>>
>>>os.listdir(os.curdir)
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
>>>
>>>os.listdir('parts')
['part0001', 'part0002', 'part0003', 'part0004']
This, too, is done without resorting to shell commands and so is
both fast and portable to all major Python platforms. The result is
not in any particular order across platforms (but can be sorted with
the listsort
method orsorted
built-in function); returns base
filenames without their directory path prefixes; does not include
names “.” or “..” if present; and includes names of both files and
directories at the listed level.
To compare all three listing techniques, let’s run them here
side by side on an explicit directory. They differ in some ways but
are mostly just variations on a theme for this task—os.popen
returns end-of-lines and may sort
filenames on some platforms,glob.glob
accepts a pattern and returns
filenames with directory prefixes, andos.listdir
takes a simple directory name and
returns names without directory prefixes:
>>>os.popen('dir /b parts').readlines()
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']
>>>glob.glob(r'parts\*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']
>>>os.listdir('parts')
['part0001', 'part0002', 'part0003', 'part0004']
Of these three,glob
andlistdir
are generally better
options if you care about script portability and result uniformity,
andlistdir
seems fastest in recent
Python releases (but gauge its performance yourself—implementations
may change over time).
In the last example, I
pointed out thatglob
returns names with directory paths, whereaslistdir
gives raw base filenames. For
convenient processing, scripts often need to splitglob
results into base files or expandlistdir
results into full paths.
Such translations are easy if we let theos.path
module do all the work for us. For
example, a script that intends to copy all files elsewhere will
typically need to first split off the base filenames fromglob
results so that it can add different
directory names on the front:
>>>dirname = r'C:\temp\parts'
>>>
>>>import glob
>>>for file in glob.glob(dirname + '/*'):
...head, tail = os.path.split(file)
...print(head, tail, '=>', ('C:\\Other\\' + tail))
...
C:\temp\parts part0001 => C:\Other\part0001
C:\temp\parts part0002 => C:\Other\part0002
C:\temp\parts part0003 => C:\Other\part0003
C:\temp\parts part0004 => C:\Other\part0004
Here, the names after the=>
represent names that files might be
moved to. Conversely, a script that means to process all files in a
different directory than the one it runs in will probably need to
prependlistdir
results with the
target directory name before passing filenames on to other
tools:
>>>import os
>>>for file in os.listdir(dirname):
...print(dirname, file, '=>', os.path.join(dirname, file))
...
C:\temp\parts part0001 => C:\temp\parts\part0001
C:\temp\parts part0002 => C:\temp\parts\part0002
C:\temp\parts part0003 => C:\temp\parts\part0003
C:\temp\parts part0004 => C:\temp\parts\part0004
When you begin writing realistic directory processing tools of
the sort we’ll develop in
Chapter 6
,
you’ll find these calls to be almost
habit.