Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

Programming Python (24 page)

BOOK: Programming Python

5.71Mb size Format: txt, pdf, ePub

ads

File Scanners

Before we leave
our file tools survey, it’s time for something that
performs a more tangible task and illustrates some of what we’ve learned
so far. Unlike some shell-tool languages, Python doesn’t have an
implicit file-scanning loop procedure, but it’s simple to write a
general one that we can reuse for all time. The module in
Example 4-1
defines a general
file-scanning routine, which simply applies a passed-in Python function
to each line in an external file.

Example 4-1. PP4E\System\Filetools\scanfile.py

def scanner(name, function):
file = open(name, 'r')               # create a file object
while True:
line = file.readline()           # call file methods
if not line: break               # until end-of-file
function(line)                   # call a function object
file.close()

The
scannerfunction
doesn’t care what line-processing function is passed in,
and that accounts for most of its generality—it is happy to apply
any
single-argument function that exists now or in
the future to all of the lines in a text file. If we code this module
and put it in a directory on the module search path, we can use it any
time we need to step through a file line by line.
Example 4-2
is a client script that
does simple line translations.

Example 4-2. PP4E\System\Filetools\commands.py

#!/usr/local/bin/python
from sys import argv
from scanfile import scanner
class UnknownCommand(Exception): pass
def processLine(line):                      # define a function
if line[0] == '*':                      # applied to each line
print("Ms.", line[1:-1])
elif line[0] == '+':
print("Mr.", line[1:-1])            # strip first and last char: \n
else:
raise UnknownCommand(line)          # raise an exception
filename = 'data.txt'
if len(argv) == 2: filename = argv[1]       # allow filename cmd arg
scanner(filename, processLine)              # start the scanner

The text file
hillbillies.txt
contains the
following lines:

*Granny
+Jethro
*Elly May
+"Uncle Jed"

and our commands script could be run as follows:

C:\...\PP4E\System\Filetools>
python commands.py hillbillies.txt
Ms. Granny
Mr. Jethro
Ms. Elly May
Mr. "Uncle Jed"

This works, but there are a variety of coding alternatives for
both files, some of which may be better than those listed above. For
instance, we could also code the command processor of
Example 4-2
in the following way;
especially if the number of command options starts to become large, such
a data-driven approach may be more concise
and easier
to maintain than a large
ifstatement with essentially
redundant actions (if you ever have to change the way output lines
print, you’ll have to change it in only one place with this
form):

commands = {'*': 'Ms.', '+': 'Mr.'}     # data is easier to expand than code?
def processLine(line):
try:
print(commands[line[0]], line[1:-1])
except KeyError:
raise UnknownCommand(line)

The scanner could similarly be improved. As a rule of thumb, we
can also usually speed things up by shifting processing from Python code
to built-in tools. For instance, if we’re concerned with speed, we can
probably make our file scanner faster by using the file’s
line
iterator
to step through the file instead of the manual
readlineloop in
Example 4-1
(though you’d have to time
this with your Python to be sure):

def scanner(name, function):
for line in open(name, 'r'):         # scan line by line
function(line)                   # call a function object

And we can work more magic in
Example 4-1
with the iteration tools
like the
mapbuilt-in function, the
list comprehension expression, and the generator expression. Here are
three minimalist’s versions; the
forloop is replaced by
mapor a
comprehension, and we let Python close the file for us when it is
garbage collected or the script exits (these all build a temporary list
of results along the way to run through their iterations, but this
overhead is likely trivial for all but the largest of files):

def scanner(name, function):
list(map(function, open(name, 'r')))
def scanner(name, function):
[function(line) for line in open(name, 'r')]
def scanner(name, function):
list(function(line) for line in open(name, 'r'))

File filters

The preceding
works as planned, but what if we also want to
change
a file while scanning it?
Example 4-3
shows two approaches:
one uses explicit files, and the other uses the standard input/output
streams to allow for redirection on the command line.

Example 4-3. PP4E\System\Filetools\filters.py

import sys
def filter_files(name, function):         # filter file through function
input  = open(name, 'r')              # create file objects
output = open(name + '.out', 'w')     # explicit output file too
for line in input:
output.write(function(line))      # write the modified line
input.close()
output.close()                        # output has a '.out' suffix
def filter_stream(function):              # no explicit files
while True:                           # use standard streams
line = sys.stdin.readline()       # or: input()
if not line: break
print(function(line), end='')     # or: sys.stdout.write()
if __name__ == '__main__':
filter_stream(lambda line: line)      # copy stdin to stdout if run

Notice that the newer
context managers
feature discussed earlier could save us a few lines here
in the file-based filter of
Example 4-3
, and also guarantee
immediate file closures if the processing function fails with an
exception:

def filter_files(name, function):
with open(name, 'r') as input, open(name + '.out', 'w') as output:
for line in input:
output.write(function(line))      # write the modified line

And again, file object
line iterators
could simplify the stream-based filter’s code in this
example as well:

def filter_stream(function):
for line in sys.stdin:                    # read by lines automatically
print(function(line), end='')

Since the standard streams are preopened for us, they’re often
easier to use. When run standalone, it simply parrots
stdinto
stdout:

C:\...\PP4E\System\Filetools>
filters.py < hillbillies.txt
*Granny
+Jethro
*Elly May
+"Uncle Jed"

But this module is also useful when imported as a library
(clients provide the line-processing function):

>>>
from filters import filter_files
>>>
filter_files('hillbillies.txt', str.upper)
>>>
print(open('hillbillies.txt.out').read())
*GRANNY
+JETHRO
*ELLY MAY
+"UNCLE JED"

We’ll see files in action often in the remainder of this book,
especially in the more complete and functional system examples of
Chapter 6
. First though, we turn to
tools for processing our files’
home.

^[
9
]For instance, to process
pipes
, described
in
Chapter 5
. The Python
os.pipecall returns two file descriptors,
which can be processed with
osmodule file tools or wrapped in a file object with
os.fdopen. When used with descriptor-based
file tools in
os, pipes deal in
byte strings, not text. Some device files may require lower-level
control as well.

^[
10
]For related tools, see also the
shutilmodule in Python’s standard
library; it has higher-level tools for copying and removing files
and more. We’ll also write directory compare, copy, and search
tools of our own in
Chapter 6
,
after we’ve had a chance to study the directory tools presented
later in this chapter.

Directory Tools

One of the more
common tasks in the shell utilities domain is applying an
operation to a set of files in a
directory
—a “folder”
in Windows-speak. By running a script on a batch of files, we can automate
(that is,
script
) tasks we might have to otherwise
run repeatedly by hand.

For instance, suppose you need to search all of your Python files in
a development directory for a global variable name (perhaps you’ve
forgotten where it is used). There are many platform-specific ways to do
this (e.g., the
findand
grepcommands in Unix), but Python scripts that
accomplish such tasks will work on every platform where Python
works—Windows, Unix, Linux, Macintosh, and just about any other platform
commonly used today. If you simply copy your script to any machine you
wish to use it on, it will work regardless of which other tools are
available there; all you need is Python. Moreover, coding such tasks in
Python also allows you to perform arbitrary actions along the
way—replacements, deletions, and whatever else you can code in the Python
language.

Walking One Directory

The most common
way to go about writing such tools is to first grab a list
of the names of the files you wish to process, and then step through
that list with a Python
forloop or
other iteration tool, processing each file in turn. The trick we need to
learn here, then, is how to get such a directory list within our
scripts. For scanning directories there are at least three options:
running shell listing commands with
os.popen, matching filename patterns with
glob.glob, and getting directory
listings with
os.listdir. They vary
in interface, result format, and portability.

Running shell listing commands with os.popen

How did you go
about getting directory file listings before you heard
of Python? If you’re new to shell tools programming, the answer may be
“Well, I started a Windows file explorer and clicked on things,” but
I’m thinking here in terms of less GUI-oriented command-line
mechanisms.

On Unix, directory listings are usually obtained by typing
lsin a shell; on Windows, they can
be generated with a
dircommand
typed in an MS-DOS console box. Because Python scripts may use
os.popento run any command line
that we can type in a shell, they are the most general way to grab a
directory listing inside a Python program. We met
os.popenin the prior chapters; it runs a
shell command string and gives us a file object from which we can read
the command’s output. To illustrate, let’s first assume the following
directory structures—I have both the usual
dirand a Unix-like
lscommand from Cygwin on my Windows
laptop:

c:\temp>
dir /B
parts
PP3E
random.bin
spam.txt
temp.bin
temp.txt
c:\temp>
c:\cygwin\bin\ls
PP3E  parts  random.bin  spam.txt  temp.bin  temp.txt
c:\temp>
c:\cygwin\bin\ls parts
part0001  part0002  part0003  part0004

The
parts
and
PP3E
names are a nested subdirectory in
C:\temp
here (the latter is a copy of the prior
edition’s examples tree, which I used occasionally in this text). Now,
as we’ve seen, scripts can grab a listing of file and directory names
at this level by simply spawning the appropriate platform-specific
command line and reading its output (the text normally thrown up on
the console window):

C:\temp>
python
>>>
import os
>>>
os.popen('dir /B').readlines()
['parts\n', 'PP3E\n', 'random.bin\n', 'spam.txt\n', 'temp.bin\n', 'temp.txt\n']

Lines read from a shell command come back with a trailing
end-of-line character, but it’s easy enough to slice it off; the
os.popenresult also gives us a
line iterator just like normal files:

>>>
for line in os.popen('dir /B'):
...
print(line[:-1])
...
parts
PP3E
random.bin
spam.txt
temp.bin
temp.txt
>>>
lines = [line[:-1] for line in os.popen('dir /B')]
>>>
lines
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']

For pipe objects, the effect of iterators may be even more
useful than simply avoiding loading the entire result into memory all
at once:
readlineswill always
block the caller until the spawned program is completely finished,
whereas the iterator might not.

The
dirand
lscommands let us be specific about filename patterns to
be matched and directory names to be listed by using name patterns;
again, we’re just running shell commands here, so anything you can
type at a shell prompt goes:

>>>
os.popen('dir *.bin /B').readlines()
['random.bin\n', 'temp.bin\n']
>>>
os.popen(r'c:\cygwin\bin\ls *.bin').readlines()
['random.bin\n', 'temp.bin\n']
>>>
list(os.popen(r'dir parts /B'))
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']
>>>
[fname for fname in os.popen(r'c:\cygwin\bin\ls parts')]
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']

These calls use general tools and work as advertised. As I noted
earlier, though, the downsides of
os.popenare that it requires using a
platform-specific shell command and it incurs a performance hit to
start up an independent program. In fact, different listing tools may
sometimes produce different results:

>>>
list(os.popen(r'dir parts\part* /B'))
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']
>>>
>>>
list(os.popen(r'c:\cygwin\bin\ls parts/part*'))
['parts/part0001\n', 'parts/part0002\n', 'parts/part0003\n', 'parts/part0004\n']

The next two alternative techniques do better on both
counts.

The glob module

The term
globbing
comes
from the
*wildcard character
in filename patterns; per computing folklore, a
*matches a “glob” of characters. In less
poetic terms, globbing simply means collecting the names of all
entries in a directory—files and subdirectories—whose names match a
given filename pattern. In Unix shells, globbing expands filename
patterns within a command line into all matching filenames before the
command is ever run. In Python, we can do something similar by calling
the
glob.globbuilt-in—a
tool that accepts a filename pattern to expand, and returns a list
(not a generator) of matching file names:

>>>
import glob
>>>
glob.glob('*')
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
>>>
glob.glob('*.bin')
['random.bin', 'temp.bin']
>>>
glob.glob('parts')
['parts']
>>>
glob.glob('parts/*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']
>>>
glob.glob('parts\part*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']

The
globcall accepts the
usual filename pattern syntax used in shells:
?means any one character,
*means any number of characters, and
[]is a character selection
set.
^[
11
]The pattern should include a directory path if you wish
to glob in something other than the current working directory, and the
module accepts either Unix or DOS-style directory separators (
/or
\).
This call is implemented without spawning a shell command (it uses
os.listdir, described in the next
section) and so is likely to be faster and more portable and uniform
across all Python platforms than the
os.popenschemes shown earlier.

Technically speaking,
globis
a bit more powerful than described so far. In fact, using it to list
files in one directory is just one use of its pattern-matching skills.
For instance, it can also be used to collect matching names across
multiple directories, simply because each level in a passed-in
directory path can be a pattern too:

>>>
for path in glob.glob(r'PP3E\Examples\PP3E\*\s*.py'): print(path)
...
PP3E\Examples\PP3E\Lang\summer-alt.py
PP3E\Examples\PP3E\Lang\summer.py
PP3E\Examples\PP3E\PyTools\search_all.py

Here, we get back filenames from two different directories that
match the
s*.pypattern; because
the directory name preceding this is a
*wildcard, Python collects all possible
ways to reach the base filenames. Using
os.popento spawn shell commands achieves
the same effect, but only if the underlying shell or listing command
does, too, and with possibly different result formats across tools and
platforms.

The os.listdir call

The
osmodule’s
listdircall provides yet another way to collect filenames in a
Python list. It takes a simple directory name string, not a filename
pattern, and returns a list containing the names of all entries in
that directory—both simple files and nested
directories—
for use in the calling
script:

>>>
import os
>>>
os.listdir('.')
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
>>>
>>>
os.listdir(os.curdir)
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
>>>
>>>
os.listdir('parts')
['part0001', 'part0002', 'part0003', 'part0004']

This, too, is done without resorting to shell commands and so is
both fast and portable to all major Python platforms. The result is
not in any particular order across platforms (but can be sorted with
the list
sortmethod or
sortedbuilt-in function); returns base
filenames without their directory path prefixes; does not include
names “.” or “..” if present; and includes names of both files and
directories at the listed level.

To compare all three listing techniques, let’s run them here
side by side on an explicit directory. They differ in some ways but
are mostly just variations on a theme for this task—
os.popenreturns end-of-lines and may sort
filenames on some platforms,
glob.globaccepts a pattern and returns
filenames with directory prefixes, and
os.listdirtakes a simple directory name and
returns names without directory prefixes:

>>>
os.popen('dir /b parts').readlines()
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']
>>>
glob.glob(r'parts\*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']
>>>
os.listdir('parts')
['part0001', 'part0002', 'part0003', 'part0004']

Of these three,
globand
listdirare generally better
options if you care about script portability and result uniformity,
and
listdirseems fastest in recent
Python releases (but gauge its performance yourself—implementations
may change over time).

Splitting and joining listing results

In the last example, I
pointed out that
globreturns names with directory paths, whereas
listdirgives raw base filenames. For
convenient processing, scripts often need to split
globresults into base files or expand
listdirresults into full paths.
Such translations are easy if we let the
os.pathmodule do all the work for us. For
example, a script that intends to copy all files elsewhere will
typically need to first split off the base filenames from
globresults so that it can add different
directory names on the front:

>>>
dirname = r'C:\temp\parts'
>>>
>>>
import glob
>>>
for file in glob.glob(dirname + '/*'):
...
head, tail = os.path.split(file)
...
print(head, tail, '=>', ('C:\\Other\\' + tail))
...
C:\temp\parts part0001 => C:\Other\part0001
C:\temp\parts part0002 => C:\Other\part0002
C:\temp\parts part0003 => C:\Other\part0003
C:\temp\parts part0004 => C:\Other\part0004

Here, the names after the
=>represent names that files might be
moved to. Conversely, a script that means to process all files in a
different directory than the one it runs in will probably need to
prepend
listdirresults with the
target directory name before passing filenames on to other
tools:

>>>
import os
>>>
for file in os.listdir(dirname):
...
print(dirname, file, '=>', os.path.join(dirname, file))
...
C:\temp\parts part0001 => C:\temp\parts\part0001
C:\temp\parts part0002 => C:\temp\parts\part0002
C:\temp\parts part0003 => C:\temp\parts\part0003
C:\temp\parts part0004 => C:\temp\parts\part0004

When you begin writing realistic directory processing tools of
the sort we’ll develop in
Chapter 6
,
you’ll find these calls to be almost
habit.

BOOK: Programming Python

5.71Mb size Format: txt, pdf, ePub

Read Book Download Book

ads

Other books

The Worth of War by Benjamin Ginsberg

Soul Mate (The Mating Series) by Swan, S.

The Sellsword by Cam Banks

Nothing Is True and Everything Is Possible: The Surreal Heart of the New Russia by Peter Pomerantsev

Girl From Above #4: Trust by Pippa DaCosta

The Identical Boy by Matthew Stott

Kiss and Makeup by Taryn Leigh Taylor

Bacacay by Witold Gombrowicz, Bill Johnston

Danny Boy by Malachy McCourt

The Other Side of Anne by Kelly Stuart