You may have
noticed that almost all of the techniques in this section
so far return the names of files in only a
single
directory (globbing with more involved patterns is the only exception).
That’s fine for many tasks, but what if you want to apply an operation
to every file in every directory and subdirectory in an entire directory
tree
?
For instance, suppose again that we need to find every occurrence
of a global name in our Python scripts. This time, though, our scripts
are arranged into a module
package
: a directory
with nested subdirectories, which may have subdirectories of their own.
We could rerun our hypothetical single-directory searcher manually in
every directory in the tree, but that’s tedious, error prone, and just
plain not fun.
Luckily, in Python it’s almost as easy to process a directory tree
as it is to inspect a single directory. We can either write a recursive
routine to traverse the tree, or use a tree-walker utility built into
theos
module. Such tools can be used
to search, copy, compare, and otherwise process arbitrary directory
trees on any platform that Python runs on (and that’s just about
everywhere).
To make it easy to
apply an operation to all files in a complete directory
tree, Python comes with a utility that scans trees for us and runs
code we provide at every directory along the way: theos.walk
function is called with a directory
root name and automatically walks the entire tree at root and
below.
Operationally,os.walk
is
a
generator function
—at each
directory in the tree, it yields a three-item tuple, containing the
name of the current directory as well as lists of both all the files
and all the subdirectories in the current directory. Because it’s a
generator, its walk is usually run by afor
loop (or other iteration tool); on each
iteration, the walker advances to the next subdirectory, and the loop
runs its code for the next level of the tree (for instance, opening
and searching all the files at that level).
That description might sound complex the first time you hear it,
butos.walk
is fairly
straightforward once you get the hang of it. In the following, for
example, the loop body’s code is run for each directory in the tree
rooted at the current working directory (.
). Along the way, the loop simply prints
the directory name and all the files at the current level after
prepending the directory name. It’s simpler in Python than in English
(I removed the PP3E subdirectory for this test to keep the output
short):
>>>import os
>>>for (dirname, subshere, fileshere) in os.walk('.'):
...print('[' + dirname + ']')
...for fname in fileshere:
...print(os.path.join(dirname, fname))
# handle one file
...
[.]
.\random.bin
.\spam.txt
.\temp.bin
.\temp.txt
[.\parts]
.\parts\part0001
.\parts\part0002
.\parts\part0003
.\parts\part0004
In other words, we’ve coded our own custom and easily changed
recursive directory listing tool in Python. Because this may be
something we would like to tweak and reuse elsewhere, let’s make it
permanently available in a module file, as shown in
Example 4-4
, now that we’ve worked
out the details interactively.
Example 4-4. PP4E\System\Filetools\lister_walk.py
"list file tree with os.walk"
import sys, os
def lister(root): # for a root dir
for (thisdir, subshere, fileshere) in os.walk(root): # generate dirs in tree
print('[' + thisdir + ']')
for fname in fileshere: # print files in this dir
path = os.path.join(thisdir, fname) # add dir name prefix
print(path)
if __name__ == '__main__':
lister(sys.argv[1]) # dir name in cmdline
When packaged this way, the code can also be run from a shell
command line. Here it is being launched with the root directory to be
listed passed in as a command-line argument:
C:\...\PP4E\System\Filetools>python lister_walk.py C:\temp\test
[C:\temp\test]
C:\temp\test\random.bin
C:\temp\test\spam.txt
C:\temp\test\temp.bin
C:\temp\test\temp.txt
[C:\temp\test\parts]
C:\temp\test\parts\part0001
C:\temp\test\parts\part0002
C:\temp\test\parts\part0003
C:\temp\test\parts\part0004
Here’s a more involved example ofos.walk
in action. Suppose you have a
directory tree of files and you want to find all Python source files
within it that reference themimetypes
module we’ll study in
Chapter 6
. The following is one (albeit
hardcoded and overly specific) way to accomplish this task:
>>>import os
>>>matches = []
>>>for (dirname, dirshere, fileshere) in os.walk(r'C:\temp\PP3E\Examples'):
...for filename in fileshere:
...if filename.endswith('.py'):
...pathname = os.path.join(dirname, filename)
...if 'mimetypes' in open(pathname).read():
...matches.append(pathname)
...
>>>for name in matches: print(name)
...
C:\temp\PP3E\Examples\PP3E\Internet\Email\mailtools\mailParser.py
C:\temp\PP3E\Examples\PP3E\Internet\Email\mailtools\mailSender.py
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\downloadflat.py
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\downloadflat_modular.py
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\ftptools.py
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\uploadflat.py
C:\temp\PP3E\Examples\PP3E\System\Media\playfile.py
This code loops through all the files at each level, looking for
files with
.py
at the end of their names and
which contain the search string. When a match is found, its full name
is appended to the results list object; alternatively, we could also
simply build a list of all
.py
files and search
each in afor
loop after the walk.
Since we’re going to code much more general solution to this type of
problem in
Chapter 6
, though, we’ll
let this stand for now.
If you want to see what’s really going on in theos.walk
generator, call its__next__
method (or equivalently, pass it to
thenext
built-in function)
manually a few times, just as thefor
loop does automatically; each time, you
advance to the next subdirectory in the tree:
>>>gen = os.walk(r'C:\temp\test')
>>>gen.__next__()
('C:\\temp\\test', ['parts'], ['random.bin', 'spam.txt', 'temp.bin', 'temp.txt'])
>>>gen.__next__()
('C:\\temp\\test\\parts', [], ['part0001', 'part0002', 'part0003', 'part0004'])
>>>gen.__next__()
Traceback (most recent call last):
File "", line 1, in
StopIteration
The library manual documentsos.walk
further than we will here. For
instance, it supports bottom-up instead of top-down walks with its
optionaltopdown=False
argument,
and callers may prune tree branches by deleting names in the
subdirectories lists of the yielded tuples.
Internally, theos.walk
call
generates filename lists at each level with theos.listdir
call we met earlier, which
collects both file and directory names in no particular order and
returns them without their directory paths;os.walk
segregates this list into
subdirectories and files (technically, nondirectories) before yielding
a result. Also note thatwalk
uses
the very same subdirectories list it yields to callers in order to
later descend into subdirectories. Because lists are mutable objects
that can be changed in place, if your code modifies the yielded
subdirectory names list, it will impact whatwalk
does next. For example, deleting
directory names will prune traversal branches, and sorting the list
will order the
walk.
Theos.walk
tool
does the work of tree traversals for us; we simply
provide loop code with task-specific logic. However, it’s sometimes
more flexible and hardly any more work to do the walking ourselves.
The following script recodes the directory listing script with a
manual
recursive
traversal function (a function
that calls itself to repeat its actions). Themylister
function in
Example 4-5
is almost the same aslister
in
Example 4-4
but callsos.listdir
to generate file paths manually
and calls itself recursively to descend into subdirectories.
Example 4-5. PP4E\System\Filetools\lister_recur.py
# list files in dir tree by recursion
import sys, os
def mylister(currdir):
print('[' + currdir + ']')
for file in os.listdir(currdir): # list files here
path = os.path.join(currdir, file) # add dir path back
if not os.path.isdir(path):
print(path)
else:
mylister(path) # recur into subdirs
if __name__ == '__main__':
mylister(sys.argv[1]) # dir name in cmdline
As usual, this file can be both imported and called or run as a
script, though the fact that its result is printed text makes it less
useful as an imported component unless its output stream is captured
by another program.
When run as a script, this file’s output is equivalent to that
of
Example 4-4
, but not
identical—unlike theos.walk
version, our recursive walker here doesn’t order the walk to visit
files before stepping into subdirectories. It could by looping through
the filenames list twice (selecting files first), but as coded, the
order is dependent onos.listdir
results. For most use cases, the walk order would be
irrelevant:
C:\...\PP4E\System\Filetools>python lister_recur.py C:\temp\test
[C:\temp\test]
[C:\temp\test\parts]
C:\temp\test\parts\part0001
C:\temp\test\parts\part0002
C:\temp\test\parts\part0003
C:\temp\test\parts\part0004
C:\temp\test\random.bin
C:\temp\test\spam.txt
C:\temp\test\temp.bin
C:\temp\test\temp.txt
We’ll make better use of most of this section’s techniques in
later examples in
Chapter 6
and in
this book at large. For example, scripts for copying and comparing
directory trees use the tree-walker techniques introduced here. Watch
for these tools in action along the way. We’ll also code a
find
utility in
Chapter 6
that combines the tree traversal
ofos.walk
with the filename
pattern
expansion ofglob.glob
.
Because all
normal strings are Unicode in Python 3.X, the directory
and file names generated byos.listdir
,os.walk
, andglob.glob
so far in this chapter are
technically Unicode strings. This can have some ramifications if your
directories contain unusual names that might not decode properly.
Technically, because filenames may contain arbitrary text, theos.listdir
works in two modes in 3.X:
given abytes
argument, this function
will return filenames as encoded byte strings; given a normalstr
string argument, it instead returns
filenames as Unicode strings, decoded per the filesystem’s encoding
scheme:
C:\...\PP4E\System\Filetools>python
>>>import os
>>>os.listdir('.')[:4]
['bigext-tree.py', 'bigpy-dir.py', 'bigpy-path.py', 'bigpy-tree.py']
>>>os.listdir(b'.')[:4]
[b'bigext-tree.py', b'bigpy-dir.py', b'bigpy-path.py', b'bigpy-tree.py']
The byte string version can be used if undecodable file names may
be present. Becauseos.walk
andglob.glob
both work by callingos.listdir
internally, they inherit
this behavior by proxy. Theos.walk
tree walker, for example, callsos.listdir
at each directory level; passing
byte string arguments suppresses decoding and returns byte string
results:
>>>for (dir, subs, files) in os.walk('..'):
print(dir)
...
..
..\Environment
..\Filetools
..\Processes
>>>for (dir, subs, files) in os.walk(b'..'): print(dir)
...
b'..'
b'..\\Environment'
b'..\\Filetools'
b'..\\Processes'
Theglob.glob
tool similarly
callsos.listdir
internally before
applying name patterns, and so also returns undecoded byte string names
for byte string arguments:
>>>glob.glob('.\*')[:3]
['.\\bigext-out.txt', '.\\bigext-tree.py', '.\\bigpy-dir.py']
>>>
>>>glob.glob(b'.\*')[:3]
[b'.\\bigext-out.txt', b'.\\bigext-tree.py', b'.\\bigpy-dir.py']
Given a normal string name (as a command-line argument, for
example), you can force the issue by converting to byte strings with
manual encoding to suppress decoding:
>>>name = '.'
>>>os.listdir(name.encode())[:4]
[b'bigext-out.txt', b'bigext-tree.py', b'bigpy-dir.py', b'bigpy-path.py']
The upshot is that if your directories may contain names which
cannot be decoded according to the underlying platform’s Unicode
encoding scheme, you may need to pass byte strings to these tools to
avoid Unicode encoding errors. You’ll get byte strings back, which may
be less readable if printed, but you’ll avoid errors while traversing
directories and files.
This might be especially useful on systems that use simple
encodings such as ASCII or Latin-1, but may contain files with
arbitrarily encoded names from cross-machine copies, the Web, and so on.
Depending upon context, exception handlers may be used to suppress some
types of encoding errors as well.
We’ll see an example of how this can matter in the first section
of
Chapter 6
, where an undecodable
directory name generates an error if printed during a full disk scan
(although that specific error seems more related to printing than to
decoding in general).
Note that the basicopen
built-in function allows the name of the file being opened to be passed
as either Unicodestr
or rawbytes
, too, though this is used only to name
the file initially; the additional mode argument determines whether the
file’s content is handled in text or binary modes. Passing a byte string
filename allows you to name files with arbitrarily encoded names.
In fact, it’s
important to keep in mind that there are two different
Unicode concepts related to files: the encoding of file
content
and the encoding of file
name
. Python provides your platform’s defaults
for these settings in two different attributes; on
Windows 7
:
>>>import sys
>>>sys.getdefaultencoding()
# file content encoding, platform default
'utf-8'
>>>sys.getfilesystemencoding()
# file name encoding, platform scheme
'mbcs'
These settings allow you to be explicit when needed—the content
encoding is used when data is read and written to the file, and the
name encoding is used when dealing with names prior to transferring
data. In addition, usingbytes
for
file name tools may work around incompatibilities with the underlying
file system’s scheme, and opening files in binary mode can suppress
Unicode decoding errors for content.
As we’ve seen, though, opening text files in
binary
mode
may also mean that the raw and still-encoded text will
not match search strings as expected: search strings must also be byte
strings encoded per a specific and possibly incompatible encoding
scheme. In fact, this approach essentially mimics the behavior of text
files in Python 2.X, and underscores why elevating Unicode in 3.X is
generally desirable—such text files sometimes may appear to work even
though they probably shouldn’t. On the other hand, opening text in
binary mode to suppress Unicode content decoding and avoid decoding
errors might still be useful if you do not wish to skip undecodable
files and content is largely irrelevant.
As a rule of thumb, you should try to always provide an encoding
name for text content if it might be outside the platform default, and
you should rely on the default Unicode API for file names in most
cases. Again, see Python’s manuals for more on the Unicode file name
story than we have space to cover fully here, and see
Learning
Python
, Fourth Edition, for more on Unicode in
general
.
In
Chapter 6
, we’re going to
put the tools we met in this chapter to realistic use. For example,
we’ll apply file and directory tools to implement file splitters,
testing systems, directory copies and compares, and a variety of
utilities based on tree walking. We’ll find that Python’s directory
tools we met here have an enabling quality that allows us to automate
a large set of real-world tasks. First, though,
Chapter 5
concludes our basic tool survey, by
exploring another system topic that tends to weave its way into a wide
variety of application domains—parallel processing in
Python.