So how does
this script placate CD backup paranoia? To double-check my
CD writer’s work, I run a command such as the following. I can also use
a command like this to find out what has been changed since the last
backup. Again, since the CD is “G:” on my machine when plugged in, I
provide a path rooted there; use a root such as
/dev/cdrom
or
/mnt/cdrom
on
Linux:
C:\...\PP4E\System\Filetools>python diffall.py
Examples g:\PP3E\Examples > diff0226
C:\...\PP4E\System\Filetools>more diff0226
...output omitted...
The CD spins, the script compares, and a summary of differences
appears at the end of the report. For an example of a full difference
report, see the file
diff*.txt
files in the book’s
examples distribution package. And to be
really
sure, I run the following global comparison command to verify the entire
book development tree backed up to a memory stick (which works just like
a CD in terms of the filesystem):
C:\...\PP4E\System\Filetools>diffall.py
F:\writing-backups\feb-26-10\dev
C:\Users\mark\Stuff\Books\4E\PP4E\dev > diff3.txt
C:\...\PP4E\System\Filetools>more diff3.txt
--------------------
Comparing F:\writing-backups\feb-26-10\dev to C:\Users\mark\Stuff\Books\4E\PP4E\dev
Directory lists are identical
Comparing contents
ch00.doc DIFFERS
ch01.doc matches
ch02.doc DIFFERS
ch03.doc matches
ch04.doc DIFFERS
ch05.doc matches
ch06.doc DIFFERS
...more output omitted...
--------------------
Comparing F:\writing-backups\feb-26-10\dev\Examples\PP4E\System\Filetools to C:\…
Files unique to C:\Users\mark\Stuff\Books\4E\PP4E\dev\Examples\PP4E\System\Filetools
... copytemp
... cpall.py
... diff2.txt
... diff3.txt
... diffall.py
... diffs.txt
... dirdiff.py
... dirdiff.pyc
Comparing contents
bigext-tree.py matches
bigpy-dir.py matches
...more output omitted...
========================================
Diffs found: 7
- files differ at F:\writing-backups\feb-26-10\dev\ch00.doc –
C:\Users\mark\Stuff\Books\4E\PP4E\dev\ch00.doc
- files differ at F:\writing-backups\feb-26-10\dev\ch02.doc –
C:\Users\mark\Stuff\Books\4E\PP4E\dev\ch02.doc
- files differ at F:\writing-backups\feb-26-10\dev\ch04.doc –
C:\Users\mark\Stuff\Books\4E\PP4E\dev\ch04.doc
- files differ at F:\writing-backups\feb-26-10\dev\ch06.doc –
C:\Users\mark\Stuff\Books\4E\PP4E\dev\ch06.doc
- files differ at F:\writing-backups\feb-26-10\dev\TOC.txt –
C:\Users\mark\Stuff\Books\4E\PP4E\dev\TOC.txt
- unique files at F:\writing-backups\feb-26-10\dev\Examples\PP4E\System\Filetools –
C:\Users\mark\Stuff\Books\4E\PP4E\dev\Examples\PP4E\System\Filetools
- files differ at F:\writing-backups\feb-26-10\dev\Examples\PP4E\Tools\visitor.py –
C:\Users\mark\Stuff\Books\4E\PP4E\dev\Examples\PP4E\Tools\visitor.py
This particular run indicates that I’ve added a few examples and
changed some chapter files since the last backup; if run immediately
after a backup, nothing should show up ondiffall
radar except for any files that cannot
be copied in general. This global comparison can take a few minutes. It
performs byte-for-byte comparisons of all chapter files and screenshots,
the examples tree, and more, but it’s an accurate and complete
verification. Given that this book development tree contained many
files, a more manual verification procedure without Python’s help would
be utterly impossible.
After writing this script, I also started using it to verify full
automated backups of my laptops onto an external hard-drive device. To
do so, I run thecpall
copy script we
wrote earlier in the preceding section of this chapter, and then the
comparison script developed here to check results and get a list of
files that didn’t copy correctly. The last time I did this, this
procedure copied and compared 225,000 files and 15,000 directories in 20
GB of space—not the sort of task that lends itself to manual
labor!
Here are the magic incantations on my Windows laptop.f:\
is a partition on my external hard drive,
and you shouldn’t be surprised if each of these commands runs for half
an hour or more on currently common hardware. A drag-and-drop copy takes
at least as long (assuming it works at all!):
C:\...\System\Filetools>cpall.py c:\ f:\ > f:\copy-log.txt
C:\...\System\Filetools>diffall.py f:\ c:\ > f:\diff-log.txt
Finally, it’s worth
noting that this script still only
detects
differences in the tree but does not give
any further details about individual file differences. In fact, it
simply loads and compares the binary contents of corresponding files
with string comparisons. It’s a simple yes/no result.
If and when I need more details about how two reported files
actually differ, I either edit the files or run the file-comparison
command on the host platform (e.g.,fc
on Windows/DOS,diff
orcmp
on Unix and Linux). That’s not a portable solution for this last step;
but for my purposes, just finding the differences in a 1,400-file tree
was much more critical than reporting which lines differ in files
flagged in the report.
Of course, since we can always run shell commands in Python, this
last step could be automated by spawning adiff
orfc
command withos.popen
as differences
are encountered (or after the traversal, by scanning the report
summary). The output of these system calls could be displayed verbatim,
or parsed for relevant parts.
We also might try to do a bit better here by opening true text
files in text mode to ignore line-terminator differences caused by
transferring across platforms, but it’s not clear that such differences
should be ignored (what if the caller wants to know whether line-end
markers have been changed?). For example, after downloading a website
with an FTP script we’ll meet in
Chapter 13
, thediffall
script detected a discrepancy between
the local copy of a file and the one at the remote server. To probe
further, I simply ran some interactive Python code:
>>>a = open('lp2e-updates.html', 'rb').read()
>>>b = open(r'C:\Mark\WEBSITE\public_html\lp2e-updates.html', 'rb').read()
>>>a == b
False
This verifies that there really is a binary difference in the
downloaded and local versions of the file; to see whether it’s because a
Unix or DOS line end snuck into the file, try again in text mode so that
line ends are all mapped to the standard\n
character:
>>>a = open('lp2e-updates.html', 'r').read()
>>>b = open(r'C:\Mark\WEBSITE\public_html\lp2e-updates.html', 'r').read()
>>>a == b
True
Sure enough; now, to find where the difference is, the following
code checks character by character until the first mismatch is found (in
binary mode, so we retain the
difference
):
>>>a = open('lp2e-updates.html', 'rb').read()
>>>b = open(r'C:\Mark\WEBSITE\public_html\lp2e-updates.html', 'rb').read()
>>>for (i, (ac, bc)) in enumerate(zip(a, b)):
...if ac != bc:
...print(i, repr(ac), repr(bc))
...break
...
37966 '\r' '\n'
This means that at byte offset 37,966, there is a\r
in the downloaded file, but a\n
in the local copy. This line has a DOS line
end in one and a Unix line end in the other. To see more, print text
around the mismatch:
>>>for (i, (ac, bc)) in enumerate(zip(a, b)):
...if ac != bc:
...print(i, repr(ac), repr(bc))
...print(repr(a[i-20:i+20]))
...print(repr(b[i-20:i+20]))
...break
...
37966 '\r' '\n'
're>\r\ndef min(*args):\r\n tmp = list(arg'
're>\r\ndef min(*args):\n tmp = list(args'
Apparently, I wound up with a Unix line end at one point in the
local copy and a DOS line end in the version I downloaded—the combined
effect of the text mode used by the download script itself (which
translated\n
to\r\n
) and years of edits on both Linux and
Windows PDAs and laptops (I probably coded this change on Linux and
copied it to my local Windows copy in binary mode). Code such as this
could be integrated into thediffall
script to make it more intelligent about text files and difference
reporting
.
Because Python excels at processing files and strings, it’s even
possible to go one step further and code a Python equivalent of thefc
anddiff
commands. In fact, much of the work has
already been done; the standard library moduledifflib
could make this task simple. See the
Python library manual for details and usage examples.
We could also be smarter by avoiding the load and compare steps
for files that differ in size, and we might use a smaller block size to
reduce the script’s memory requirements. For most trees, such
optimizations are unnecessary; reading multimegabyte files into strings
is very fast in Python, and garbage collection reclaims the space as you
go.
Since such extensions are beyond both this script’s scope and this
chapter’s size limits, though, they will have to await the attention of
a curious reader (this book doesn’t have formal exercises, but that
almost sounds like one, doesn’t it?). For now, let’s move on to explore
ways to code one more common directory task:
search.
Engineers
love to change things. As I was writing this book, I found
it almost
irresistible
to move and rename
directories, variables, and shared modules in the book examples tree
whenever I thought I’d stumbled onto a more coherent structure. That was
fine early on, but as the tree became more intertwined, this became a
maintenance nightmare. Things such as program directory paths and module
names were hardcoded all over the place—in package import statements,
program startup calls, text notes, configuration files, and more.
One way to repair these references, of course, is to edit every file
in the directory by hand, searching each for information that has changed.
That’s so tedious as to be utterly impossible in this book’s examples
tree, though; the examples of the prior edition contained 186 directories
and 1,429 files! Clearly, I needed a way to automate updates after
changes. There are a variety of solutions to such goals—from shell
commands, to find operations, to custom tree walkers, to general-purpose
frameworks. In this and the next section, we’ll explore each option in
turn, just as I did while refining solutions to this real-world
dilemma.
If you work on Unix-like systems, you probably already know that
there is a standard way to search files for strings on such
platforms—the command-line programgrep
and its relatives list all lines in one or more files
containing a string or string pattern.
[
22
]
Given that shells expand (i.e., “glob”) filename patterns
automatically, a command such as the following will search a single
directory’s Python files for a string named on the command line (this
uses thegrep
command installed with
the Cygwin Unix-like system for Windows that I described in the prior
chapter):
C:\...\PP4E\System\Filetools>c:\cygwin\bin\grep.exe walk *.py
bigext-tree.py:for (thisDir, subsHere, filesHere) in os.walk(dirname):
bigpy-path.py: for (thisDir, subsHere, filesHere) in os.walk(srcdir):
bigpy-tree.py:for (thisDir, subsHere, filesHere) in os.walk(dirname):
As we’ve seen, we can often accomplish the same within a Python
script by running such a shell command withos.system
oros.popen
. And if we search its results
manually, we can also achieve similar results with the Pythonglob
module we met in
Chapter 4
; it expands a filename
pattern into a list of matching filename strings much like a
shell:
C:\...\PP4E\System\Filetools>python
>>>import os
>>>for line in os.popen(r'c:\cygwin\bin\grep.exe walk *.py'):
...print(line, end='')
...
bigext-tree.py:for (thisDir, subsHere, filesHere) in os.walk(dirname):
bigpy-path.py: for (thisDir, subsHere, filesHere) in os.walk(srcdir):
bigpy-tree.py:for (thisDir, subsHere, filesHere) in os.walk(dirname):
>>>from glob import glob
>>>for filename in glob('*.py'):
...if 'walk' in open(filename).read():
...print(filename)
...
bigext-tree.py
bigpy-path.py
bigpy-tree.py
Unfortunately, these tools are generally limited to a single
directory.glob
can visit multiple
directories given the right sort of pattern string, but it’s not a
general directory walker of the sort I need to maintain a large examples
tree. On Unix-like systems, afind
shell command
can go the extra mile to traverse an entire directory
tree. For
instance
, the following
Unix command line would pinpoint lines and files at and below the
current directory that mention the stringpopen
:
find . -name "*.py" -print -exec fgrep popen {} \;
If you happen to have a Unix-likefind
command on every machine you will ever
use, this is one way to process directories.
But if you don’t happen to have a Unixfind
on all your computers, not to worry—it’s
easy to code a portable one in Python. Python itself used to have
afind
module in its
standard library, which I used frequently in the past. Although that
module was removed between the second and third editions of this book,
the neweros.walk
makes writing your
own simple. Rather than lamenting the demise of a module, I decided to
spend 10 minutes coding a custom equivalent.
Example 6-13
implements a
find utility in Python, which collects all matching filenames in a
directory tree. Unlikeglob.glob
, itsfind.find
automatically matches
through an entire tree. And unlike the tree walk structure ofos.walk
, we can treatfind.find
results as a simple linear
group.
Example 6-13. PP4E\Tools\find.py
#!/usr/bin/python
"""
################################################################################
Return all files matching a filename pattern at and below a root directory;
custom version of the now deprecated find module in the standard library:
import as "PP4E.Tools.find"; like original, but uses os.walk loop, has no
support for pruning subdirs, and is runnable as a top-level script;
find() is a generator that uses the os.walk() generator to yield just
matching filenames: use findlist() to force results list generation;
################################################################################
"""
import fnmatch, os
def find(pattern, startdir=os.curdir):
for (thisDir, subsHere, filesHere) in os.walk(startdir):
for name in subsHere + filesHere:
if fnmatch.fnmatch(name, pattern):
fullpath = os.path.join(thisDir, name)
yield fullpath
def findlist(pattern, startdir=os.curdir, dosort=False):
matches = list(find(pattern, startdir))
if dosort: matches.sort()
return matches
if __name__ == '__main__':
import sys
namepattern, startdir = sys.argv[1], sys.argv[2]
for name in find(namepattern, startdir): print(name)
There’s not much to this file—it’s largely just a minor extension
toos.walk
—but calling
itsfind
function provides the same
utility as both the deprecatedfind
standard library module and the Unix utility of the same name. It’s also
much more portable, and noticeably easier than repeating all of this
file’s code every time you need to perform a find-type search. Because
this file is instrumented to be both a script and a library, it can also
be both run as a command-line tool or called from other programs.
For instance, to process every Python file in the directory tree
rooted one level up from the current working directory, I simply run the
following command line from a system console window. Run this yourself
to watch its progress; the script’s standard output is piped into themore
command to page it here, but it
can be piped into any processing program that reads its input from the
standard input stream:
C:\...\PP4E\Tools>python find.py *.py .. | more
..\LaunchBrowser.py
..\Launcher.py
..\__init__.py
..\Preview\attachgui.py
..\Preview\customizegui.py
...more lines omitted...
For more control, run the following sort of Python code from a
script or interactive prompt. In this mode, you can apply any operation
to the found files that the Python language provides:
C:\...\PP4E\System\Filetools>python
>>>from PP4E.Tools import find
# or just import find if in cwd
>>>for filename in find.find('*.py', '..'):
...if 'walk' in open(filename).read():
...print(filename)
...
..\Launcher.py
..\System\Filetools\bigext-tree.py
..\System\Filetools\bigpy-path.py
..\System\Filetools\bigpy-tree.py
..\Tools\cleanpyc.py
..\Tools\find.py
..\Tools\visitor.py
Notice how this avoids having to recode the nested loop structure
required foros.walk
every time you
want a list of matching file names; for many use cases, this seems
conceptually simpler. Also note that because this finder is a generator
function, your script doesn’t have to wait until all matching files have
been found and collected;os.walk
yields results as it goes, andfind.find
yields matching files among that
set.
Here’s a more complex example of ourfind
module at work: the following system
command line lists all Python files in directory
C:\temp\PP3E
whose names begin with the letter
q
or
t
. Note howfind
returns full directory paths that begin
with the start directory specification:
C:\...\PP4E\Tools>find.py [qx]*.py C:\temp\PP3E
C:\temp\PP3E\Examples\PP3E\Database\SQLscripts\querydb.py
C:\temp\PP3E\Examples\PP3E\Gui\Tools\queuetest-gui-class.py
C:\temp\PP3E\Examples\PP3E\Gui\Tools\queuetest-gui.py
C:\temp\PP3E\Examples\PP3E\Gui\Tour\quitter.py
C:\temp\PP3E\Examples\PP3E\Internet\Other\Grail\Question.py
C:\temp\PP3E\Examples\PP3E\Internet\Other\XML\xmlrpc.py
C:\temp\PP3E\Examples\PP3E\System\Threads\queuetest.py
And here’s some Python code that does the same find but also
extracts base names and file sizes for each file found:
C:\...\PP4E\Tools>python
>>>import os
>>>from find import find
>>>for name in find('[qx]*.py', r'C:\temp\PP3E'):
...print(os.path.basename(name), os.path.getsize(name))
...
querydb.py 635
queuetest-gui-class.py 1152
queuetest-gui.py 963
quitter.py 801
Question.py 817
xmlrpc.py 705
queuetest.py 1273
To achieve such
code economy, thefind
module callsos.walk
to walk the tree and simply yields
matching filenames along the way. New here, though, is thefnmatch
module—yet another Python standard
library module that performs Unix-like pattern matching against
filenames. This module supports common operators in name pattern
strings:*
to match any number of
characters,?
to match any single
character, and[...]
and[!...]
to match any character inside the
bracket pairs or not; other characters match themselves. Unlike there
module,fnmatch
supports only common Unix shell
matching operators, not full-blown regular expression patterns; we’ll
see why this distinction matters in
Chapter 19
.
Interestingly, Python’sglob.glob
function also uses thefnmatch
module to match names: it combinesos.listdir
andfnmatch
to match in directories in much the
same way ourfind.find
combinesos.walk
andfnmatch
to match in trees (thoughos.walk
ultimately usesos.listdir
as well). One ramification of all
this is that you can pass byte strings for both pattern and
start-directory tofind.find
if you
need to suppress Unicode filename decoding, just as you can foros.walk
andglob.glob
; you’ll receive byte strings for
filenames in the result. See
Chapter 4
for more details on Unicode
filenames.
By comparison,find.find
with
just “*” for its name pattern is also roughly equivalent to
platform-specific directory tree listing shell commands such asdir /B /S
on DOS and Windows. Since
all files match “*”, this just exhaustively generates all the file
names in a tree with a single traversal. Because we can usually run
such shell commands in a Python script withos.popen
, the following do the same work,
but the first is inherently nonportable and must start up a separate
program along the way:
>>>import os
>>>for line in os.popen('dir /B /S'): print(line, end='')
>>>from PP4E.Tools.find import find
>>>for name in find(pattern='*', startdir='.'): print(name)
Watch for this utility to show up in action later in this
chapter and book, including an arguably strong showing in the next
section and a cameo appearance in the Grep dialog of
Chapter 11
’s PyEdit text editor GUI, where it
will serve a central role in a threaded external files search tool.
The standard library’sfind
module
may be gone, but it need not be
forgotten.
In fact, you
must
pass abytes
pattern string for abytes
filename tofnnmatch
(or pass both asstr
), because there
pattern matching module it uses does
not allow the string types of subject and pattern to be mixed. This
rule is inherited by ourfind.find
for directory and pattern. See
Chapter 19
for more onre
.
Curiously, thefnmatch
module in Python 3.1 also converts abytes
pattern string to and from Unicodestr
in order to perform internal
text processing, using the Latin-1 encoding. This suffices for many
contexts, but may not be entirely sound for some encodings which do
not map to Latin-1 cleanly.sys.getfilesystemencoding
might be a
better encoding choice in such contexts, as this reflects the
underlying file system’s constraints (as we learned in
Chapter 4
,sys.getdefaultencoding
reflects file
content, not names).
In the absence ofbytes
,os.walk
assumes filenames follow
the platform’s convention and does not ignore decoding errors
triggered byos.listdir
. In the
“grep” utility of
Chapter 11
’s PyEdit,
this picture is further clouded by the fact that astr
pattern string from a GUI would have
to be encoded tobytes
using a
potentially inappropriate encoding for some files present. See
fnmatch.py
and
os.py
in Python’s library and the Python
library manual for more details. Unicode can be a very subtle
affair.