One fine point before
we move on: notice the seemingly superfluous exception
handling in
Example 6-4
’stryprint
function. When I first tried
to scan an entire drive as shown in the preceding section, this script
died on a Unicode encoding error while trying to print a directory name
of a saved web page. Adding the exception handler skips the error
entirely.
This demonstrates a subtle but pragmatically important issue:
Python 3.X’s Unicode orientation extends to filenames, even if they are
just printed. As we learned in
Chapter 4
, because filenames may contain
arbitrary text,os.listdir
returns filenames in two different ways—we get back
decoded Unicode strings when we pass in a normalstr
argument, and still-encoded byte strings
when we send abytes
:
>>>import os
>>>os.listdir('.')[:4]
['bigext-tree.py', 'bigpy-dir.py', 'bigpy-path.py', 'bigpy-tree.py']
>>>os.listdir(b'.')[:4]
[b'bigext-tree.py', b'bigpy-dir.py', b'bigpy-path.py', b'bigpy-tree.py']
Bothos.walk
(used in the
Example 6-4
script) andglob.glob
inherit this behavior for the
directory and file names they return, because they work by callingos.listdir
internally at each
directory level. For all these calls, passing in a byte string argument
suppresses Unicode decoding of file and directory names. Passing a
normal string assumes that filenames are decodable per the file system’s
Unicode scheme.
The reason this potentially mattered to this section’s example is
that running the tree search version over an entire hard drive
eventually reached an undecodable filename (an old saved web page with
an odd name), which generated an exception when theprint
function tried to display it. Here’s a
simplified recreation of the error, run in a shell window (Command
Prompt) on Windows:
>>>root = r'C:\py3000'
>>>for (dir, subs, files) in os.walk(root): print(dir)
...
C:\py3000
C:\py3000\FutureProofPython - PythonInfo Wiki_files
C:\py3000\Oakwinter_com Code » Porting setuptools to py3k_files
Traceback (most recent call last):
File "", line 1, in
File "C:\Python31\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position
45: character maps to
One way out of this dilemma is to usebytes
strings for the directory root name—this
suppresses filename decoding in theos.listdir
calls run byos.walk
, and effectively limits the scope of
later printing to raw bytes. Since printing does not have to deal with
encodings, it works without error. Manually encoding to bytes prior to
printing works too, but the results are slightly different:
>>>root.encode()
b'C:\\py3000'
>>>for (dir, subs, files) in os.walk(root.encode()): print(dir)
...
b'C:\\py3000'
b'C:\\py3000\\FutureProofPython - PythonInfo Wiki_files'
b'C:\\py3000\\Oakwinter_com Code \xbb Porting setuptools to py3k_files'
b'C:\\py3000\\What\x92s New in Python 3_0 \x97 Python Documentation'
>>>for (dir, subs, files) in os.walk(root): print(dir.encode())
...
b'C:\\py3000'
b'C:\\py3000\\FutureProofPython - PythonInfo Wiki_files'
b'C:\\py3000\\Oakwinter_com Code \xc2\xbb Porting setuptools to py3k_files'
b'C:\\py3000\\What\xe2\x80\x99s New in Python 3_0 \xe2\x80\x94 Python Documentation'
Unfortunately, either approach means that all the directory names
printed during the walk display as cryptic byte strings. To maintain the
better readability of normal strings, I instead opted for the exception
handler approach used in the script’s code. This avoids the issues
entirely:
>>>for (dir, subs, files) in os.walk(root):
...try:
...print(dir)
...except UnicodeEncodeError:
...print(dir.encode())
# or simply punt if enocde may fail too
...
C:\py3000
C:\py3000\FutureProofPython - PythonInfo Wiki_files
C:\py3000\Oakwinter_com Code » Porting setuptools to py3k_files
b'C:\\py3000\\What\xe2\x80\x99s New in Python 3_0 \xe2\x80\x94 Python Documentation'
Oddly, though, the error seems more related to printing than to
Unicode encodings of filenames—because the filename did not fail until
printed, it must have been decodable when its string was created
initially. That’s why wrapping up theprint
in atry
suffices; otherwise, the error would occur
earlier.
Moreover, this error does not occur if the script’s output is
redirected to a file, either at the shell level (bigext-tree.py c:\ > out
), or by the print
call itself (print(dir, file=F)
). In
the latter case the output file must later be read back in binary mode,
as text mode triggers the same error when printing the file’s content to
the shell window (but again, not until printed). In fact, the exact same
code that fails when run in a system shell Command Prompt on Windows
works without error when run in the IDLE GUI on the same platform—the
tkinter GUI used by IDLE handles display of characters that printing to
standard output connected to a shell terminal window does not:
>>>import os
# run in IDLE (a tkinter GUI), not system shell
>>>root = r'C:\py3000'
>>>for (dir, subs, files) in os.walk(root): print(dir)
C:\py3000
C:\py3000\FutureProofPython - PythonInfo Wiki_files
C:\py3000\Oakwinter_com Code » Porting setuptools to py3k_files
C:\py3000\What's New in Python 3_0 — Python Documentation_files
In other words, the exception occurs only when printing to a shell
window, and long after the file name string is created. This reflects an
artifact of extra translations
performed
by the Python printer, not of
Unicode file names in general. Because we have no room for further
exploration here, though, we’ll have to be satisfied with the fact that
our exception handler sidesteps the printing problem altogether. You
should still be aware of the implications of Unicode filename decoding,
though; on some platforms you may need to pass byte strings toos.walk
in this script to prevent decoding
errors as filenames are
created.
[
18
]
Since Unicode is still relatively new in 3.1, be sure to test for
such errors on your computer and your Python. Also see also Python’s
manuals for more on the treatment of Unicode filenames, and the text
Learning Python
for more on Unicode in general. As
noted earlier, our scripts also had to open text files in binary mode
because some might contain undecodable
content
too.
It might seem surprising that Unicode issues can crop up in basic
printing like this too, but such is life in the brave new Unicode world.
Many real-world scripts don’t need to care much about Unicode, of
course—including those we’ll explore in the next
section.
[
18
]
For a relatedprint
issue,
see
Chapter 14
’s workaround for program
aborts when printing stack tracebacks to standard output from
spawned programs. Unlike the problem described here, that issue does
not appear to be related to Unicode characters that may be
unprintable in shell windows but reflects another regression for
standard output prints in general in Python 3.1, which may or may
not be repaired by the time you read this text. See also the Python
environment variable PYTHONIOENCODING, which can override the
default encoding used for standard streams.
Like most kids,
mine spent a lot of time on the Internet when they were
growing up. As far as I could tell, it was the thing to do. Among their
generation, computer geeks and gurus seem to have been held in the same
sort of esteem that my generation once held rock stars. When kids
disappeared into their rooms, chances were good that they were hacking on
computers, not mastering guitar riffs (well, real ones, at least). It may
or may not be healthier than some of the diversions of my own misspent
youth, but that’s a topic for another kind of book.
Despite the rhetoric of techno-pundits about the Web’s potential to
empower an upcoming generation in ways unimaginable by their predecessors,
my kids seemed to spend most of their time playing games. To fetch new
ones in my house at the time, they had to download to a shared computer
which had Internet access and transfer those games to their own computers
to install. (Their own machines did not have
Internet
access until later, for reasons
that most parents in the crowd could probably expand upon.)
The problem with this scheme is that game files are not small. They
were usually much too big to fit on a floppy or memory stick of the time,
and burning a CD or DVD took away valuable game-playing time. If all the
machines in my house ran Linux, this would have been a nonissue. There are
standard command-line programs on Unix for chopping a file into pieces
small enough to fit on a transfer device (split
), and others for putting the pieces back
together to re-create the original file (cat
). Because we had all sorts of different
machines in the house, though, we needed a more portable
solution.
[
19
]
Since all the computers
in my house ran Python, a simple portable Python script
came to the rescue. The Python program in
Example 6-5
distributes a single
file’s contents among a set of part files and stores those part files in
a directory.
Example 6-5. PP4E\System\Filetools\split.py
#!/usr/bin/python
"""
################################################################################
split a file into a set of parts; join.py puts them back together;
this is a customizable version of the standard Unix split command-line
utility; because it is written in Python, it also works on Windows and
can be easily modified; because it exports a function, its logic can
also be imported and reused in other applications;
################################################################################
"""
import sys, os
kilobytes = 1024
megabytes = kilobytes * 1000
chunksize = int(1.4 * megabytes) # default: roughly a floppy
def split(fromfile, todir, chunksize=chunksize):
if not os.path.exists(todir): # caller handles errors
os.mkdir(todir) # make dir, read/write parts
else:
for fname in os.listdir(todir): # delete any existing files
os.remove(os.path.join(todir, fname))
partnum = 0
input = open(fromfile, 'rb') # binary: no decode, endline
while True: # eof=empty string from read
chunk = input.read(chunksize) # get next part <= chunksize
if not chunk: break
partnum += 1
filename = os.path.join(todir, ('part%04d' % partnum))
fileobj = open(filename, 'wb')
fileobj.write(chunk)
fileobj.close() # or simply open().write()
input.close()
assert partnum <= 9999 # join sort fails if 5 digits
return partnum
if __name__ == '__main__':
if len(sys.argv) == 2 and sys.argv[1] == '-help':
print('Use: split.py [file-to-split target-dir [chunksize]]')
else:
if len(sys.argv) < 3:
interactive = True
fromfile = input('File to be split? ') # input if clicked
todir = input('Directory to store part files? ')
else:
interactive = False
fromfile, todir = sys.argv[1:3] # args in cmdline
if len(sys.argv) == 4: chunksize = int(sys.argv[3])
absfrom, absto = map(os.path.abspath, [fromfile, todir])
print('Splitting', absfrom, 'to', absto, 'by', chunksize)
try:
parts = split(fromfile, todir, chunksize)
except:
print('Error during split:')
print(sys.exc_info()[0], sys.exc_info()[1])
else:
print('Split finished:', parts, 'parts are in', absto)
if interactive: input('Press Enter key') # pause if clicked
By default, this script splits the input file into chunks that are
roughly the size of a floppy disk—perfect for moving big files between
the electronically isolated machines of the time. Most importantly,
because this is all portable Python code, this script will run on just
about any machine, even ones without their own file splitter. All it
requires is an installed Python. Here it is at work splitting a Python
3.1 self-installer executable located in the current working directory
on Windows (I’ve omitted a fewdir
output lines to save space here; usels
on Unix):
-l
C:\temp>cd C:\temp
C:\temp>dir python-3.1.msi
...more...
06/27/2009 04:53 PM 13,814,272 python-3.1.msi
1 File(s) 13,814,272 bytes
0 Dir(s) 188,826,189,824 bytes free
C:\temp>python C:\...\PP4E\System\Filetools\split.py -help
Use: split.py [file-to-split target-dir [chunksize]]
C:\temp>python C:\...\P4E\System\Filetools\split.py python-3.1.msi pysplit
Splitting C:\temp\python-3.1.msi to C:\temp\pysplit by 1433600
Split finished: 10 parts are in C:\temp\pysplit
C:\temp>dir pysplit
...more...
02/21/2010 11:13 AM.
02/21/2010 11:13 AM..
02/21/2010 11:13 AM 1,433,600 part0001
02/21/2010 11:13 AM 1,433,600 part0002
02/21/2010 11:13 AM 1,433,600 part0003
02/21/2010 11:13 AM 1,433,600 part0004
02/21/2010 11:13 AM 1,433,600 part0005
02/21/2010 11:13 AM 1,433,600 part0006
02/21/2010 11:13 AM 1,433,600 part0007
02/21/2010 11:13 AM 1,433,600 part0008
02/21/2010 11:13 AM 1,433,600 part0009
02/21/2010 11:13 AM 911,872 part0010
10 File(s) 13,814,272 bytes
2 Dir(s) 188,812,328,960 bytes free
Each of these generated part files represents one binary chunk of
the file
python-3.1.msi
—
a
chunk small enough to fit comfortably on a floppy disk of the time. In
fact, if you add the sizes of the generated part files given by thels
command, you’ll come up with
exactly the same number of bytes as the original file’s size. Before we
see how to put these files back together again, here are a few points to
ponder as you study this script’s code:
This script is designed to input its parameters in either
interactive
or
command-line
mode; it checks the number of
command-line arguments to find out the mode in which it is being
used. In command-line mode, you list the file to be split and the
output directory on the command line, and you can optionally
override the default part file size with a third command-line
argument.
In interactive mode, the script asks for a filename and
output directory at the console window withinput
and pauses for a key press at the
end before exiting. This mode is nice when the program file is
started by clicking on its icon; on Windows, parameters are typed
into a pop-up DOS box that doesn’t automatically disappear. The
script also shows the absolute paths of its parameters (by running
them throughos.path.abspath
)
because they may not be obvious in interactive mode.
This code is careful to open both input and output files in
binary mode (rb
,wb
), because it needs to portably handle
things like executables and audio files, not just text. In
Chapter 4
, we learned that on Windows,
text-mode files automatically map\r\n
end-of-line sequences to\n
on input and map\n
to\r\n
on output. For true binary data, we
really don’t want any\r
characters in the data to go away when read, and we don’t want any
superfluous\r
characters to be
added on output. Binary-mode files suppress this\r
mapping when the script is run on
Windows and so avoid data corruption.
In Python 3.X, binary mode also means that file data isbytes
objects in our script,
not encodedstr
text, though we
don’t need to do anything special—this script’s file processing
code runs the same on Python 3.X as it did on 2.X. In fact, binary
mode is required in 3.X for this program, because the target
file’s data may not be encoded text at all; text mode requires
that file content must be decodable in 3.X, and that might fail
both for truly binary data and text files obtained from other
platforms. On output, binary mode acceptsbytes
and suppresses Unicode encoding
and line-end translations.
This script also goes out of its way to manually close its
files. As we also saw
in
Chapter 4
,
we can often get
by with a single line:open(partname,
. This shorter form relies on the fact
'wb').write(chunk)
that the current Python implementation automatically closes files
for you when file objects are reclaimed (i.e., when they are
garbage collected, because there are no more references to the
file object). In this one-liner, the file object would be
reclaimed immediately, because theopen
result is temporary in an
expression and is never referenced by a longer-lived name.
Similarly, theinput
file is
reclaimed when thesplit
function exits.
However, it’s not impossible that this automatic-close
behavior may go away in the future. Moreover, the Jython
Java-based Python implementation does not reclaim unreferenced
objects as immediately as the standard Python. You should close
manually if you care about the Java port, your script may
potentially create many files in a short amount of time, and it
may run on a machine that has a limit on the number of open files
per program. Because thesplit
function in this module is intended to be a general-purpose tool,
it accommodates such worst-case
scenarios
. Also see
Chapter 4
’s mention of the file
context manager and thewith
statement; this provides an alternative way to guarantee file
closes.
Back to moving big files around
the house: after downloading a big game program file, you
can run the previous splitter script by clicking on its name in Windows
Explorer and typing filenames. After a split, simply copy each part file
onto its own floppy (or other more modern medium), walk the files to the
destination machine, and re-create the split output directory on the
target computer by copying the part files. Finally, the script in
Example 6-6
is clicked or otherwise
run to put the parts back together.
Example 6-6. PP4E\System\Filetools\join.py
#!/usr/bin/python
"""
################################################################################
join all part files in a dir created by split.py, to re-create file.
This is roughly like a 'cat fromdir/* > tofile' command on unix, but is
more portable and configurable, and exports the join operation as a
reusable function. Relies on sort order of filenames: must be same
length. Could extend split/join to pop up Tkinter file selectors.
################################################################################
"""
import os, sys
readsize = 1024
def join(fromdir, tofile):
output = open(tofile, 'wb')
parts = os.listdir(fromdir)
parts.sort()
for filename in parts:
filepath = os.path.join(fromdir, filename)
fileobj = open(filepath, 'rb')
while True:
filebytes = fileobj.read(readsize)
if not filebytes: break
output.write(filebytes)
fileobj.close()
output.close()
if __name__ == '__main__':
if len(sys.argv) == 2 and sys.argv[1] == '-help':
print('Use: join.py [from-dir-name to-file-name]')
else:
if len(sys.argv) != 3:
interactive = True
fromdir = input('Directory containing part files? ')
tofile = input('Name of file to be recreated? ')
else:
interactive = False
fromdir, tofile = sys.argv[1:]
absfrom, absto = map(os.path.abspath, [fromdir, tofile])
print('Joining', absfrom, 'to make', absto)
try:
join(fromdir, tofile)
except:
print('Error joining files:')
print(sys.exc_info()[0], sys.exc_info()[1])
else:
print('Join complete: see', absto)
if interactive: input('Press Enter key') # pause if clicked
Here is a join in progress on Windows, combining the split files
we made a moment ago; after running thejoin
script, you still may need to run
something likezip
,gzip
, ortar
to unpack an archive file unless it’s
shipped as an executable, but at least the original downloaded file is
set to go
[
20
]
:
C:\temp>python C:\...\PP4E\System\Filetools\join.py -help
Use: join.py [from-dir-name to-file-name]
C:\temp>python C:\...\PP4E\System\Filetools\join.py pysplit mypy31.msi
Joining C:\temp\pysplit to make C:\temp\mypy31.msi
Join complete: see C:\temp\mypy31.msi
C:\temp>dir *.msi
...more...
02/21/2010 11:21 AM 13,814,272 mypy31.msi
06/27/2009 04:53 PM 13,814,272 python-3.1.msi
2 File(s) 27,628,544 bytes
0 Dir(s) 188,798,611,456 bytes free
C:\temp>fc /b mypy31.msi python-3.1.msi
Comparing files mypy31.msi and PYTHON-3.1.MSI
FC: no differences encountered
The join script simply usesos.listdir
to collect
all the part files in a directory created bysplit
, and sorts the filename list to put the
parts back together in the correct order. We get back an exact
byte-for-byte copy of the original file (proved by the DOSfc
command in the code; usecmp
on Unix).
Some of this process is still manual, of course (I never did
figure out how to script the “walk the floppies to your bedroom” step),
but thesplit
andjoin
scripts make it both quick and simple to
move big files around. Because this script is also portable Python code,
it runs on any platform to which we cared to move split files. For
instance, my home computers ran both Windows and Linux at the time;
since this script runs on either platform, the gamers were covered.
Before we move on, here are a couple of implementation details worth
underscoring in thejoin
script’s
code:
First of all, notice that this script deals with files in
binary mode but also reads each part file in blocks of 1 KB each.
In fact, thereadsize
setting
here (the size of each block read from an input part file) has no
relation tochunksize
in
split.py
(the total size of each output part
file). As we learned in
Chapter 4
, this script could instead
read each part file all at once:output.write(open(filepath,
. The downside to this scheme is that it
'rb').read())
really does load all of a file into memory at once. For example,
reading a 1.4 MB part file into memory all at once with the file
objectread
method generates a
1.4 MB string in memory to hold the file’s bytes. Sincesplit
allows users to specify even
larger chunk sizes, thejoin
script plans for the worst and reads in terms of limited-size
blocks. To be completely robust, thesplit
script could read its input data
in smaller chunks too, but this hasn’t become a concern in
practice (recall that as your program runs, Python automatically
reclaims strings that are no longer referenced, so this isn’t as
wasteful as it might seem).
If you study this script’s code closely, you may also notice
that thejoin
scheme it uses
relies completely on the sort order of filenames in the parts
directory. Because it simply calls the listsort
method on the filenames list
returned byos.listdir
, it
implicitly requires that filenames have the same length and format
when created bysplit
. To
satisfy this requirement, the splitter uses zero-padding notation
in a string formatting expression ('part%04d'
) to make sure that filenames
all have the same number of digits at the end (four). When sorted,
the leading zero characters in small numbers guarantee that part
files are ordered for joining correctly.
Alternatively, we could strip off digits in filenames,
convert them withint
, and sort
numerically, by using the listsort
method’skeys
argument, but that would still
imply that all filenames must start with the some type of
substring, and so doesn’t quite remove the file-naming dependency
between thesplit
andjoin
scripts. Because these scripts are
designed to be two steps of the same process, though, some
dependencies between them
seem reasonable.