There’s one last
step on our path to software maintenance nirvana: we must
recode the reply page script itself to import data that was factored out
to the common module and import the reusable form mock-up module’s
tools. While we’re at it, we move code into functions (in case we ever
put things in this file that we’d like to import in another script), and
all HTML code to triple-quoted string blocks. The result is
Example 15-23
. Changing HTML is
generally easier when it has been isolated in single strings like this,
instead of being sprinkled throughout a program.
Example 15-23. PP4E\Internet\Web\cgi-bin\languages2reply.py
#!/usr/bin/python
"""
Same, but for easier maintenance, use HTML template strings, get the
Language table and input key from common module file, and get reusable
form field mockup utilities module for testing.
"""
import cgi, sys
from formMockup import FieldMockup # input field simulator
from languages2common import hellos, inputkey # get common table, name
debugme = False
hdrhtml = """Content-type: text/html\nRead Programming Python (158 page) Page 158 Read Book Online,Top Vampire Books Read Online Free Syntax
"""
langhtml = """%s
%s
"""
def showHello(form): # HTML for one language
choice = form[inputkey].value # escape lang name too
try:
print(langhtml % (cgi.escape(choice),
cgi.escape(hellos[choice])))
except KeyError:
print(langhtml % (cgi.escape(choice),
"Sorry--I don't know that language"))
def main():
if debugme:
form = {inputkey: FieldMockup(sys.argv[1])} # name on cmd line
else:
form = cgi.FieldStorage() # parse real inputs
print(hdrhtml)
if not inputkey in form or form[inputkey].value == 'All':
for lang in hellos.keys():
mock = {inputkey: FieldMockup(lang)} # not dict(n=v) here!
showHello(mock)
else:
showHello(form)
print('
')
if __name__ == '__main__': main()
When globaldebugme
is set toTrue
, the script can be tested
offline from a simple command line as before:
C:\...\PP4E\Internet\Web\cgi-bin>python languages2reply.py Python
Content-type: text/htmlRead Programming Python (158 page) Page 158 Read Book Online,Top Vampire Books Read Online Free Syntax
Python
print('Hello World')
When run online using either the page in
Figure 15-25
or an explicitly
typed URL with query parameters, we get the same reply pages we saw for
the original version of this example (we won’t repeat them here again).
This transformation changed the program’s architecture, not its user
interface. Architecturally, though, both the input and reply pages are
now created by Python CGI scripts, not static HTML files.
Most of the code changes in this version of the reply script are
straightforward. If you test-drive these pages, the only differences
you’ll find are the URLs at the top of your browser (they’re different
files, after all), extra blank lines in the generated HTML (ignored by
the browser), and a potentially different ordering of language names in
the main page’s pull-down selection list.
Again, this selection list ordering difference arises because this
version relies on the order of the Python dictionary’s keys list, not on
a hardcoded list in an HTML file. Dictionaries, you’ll recall,
arbitrarily order entries for fast fetches; if you want the selection
list to be more predictable, simply sort the keys list before iterating
over it using the listsort
method or
thesorted
function introduced in
Python 2.4:
for lang in sorted(hellos): # dict iterator instead of .keys()
mock = {inputkey: FieldMockup(lang)}
Faking Inputs with Shell Variables
If you’re familiar with shells, you might also be able to test
CGI scripts from the command line on some platforms by setting the
same environment variables that HTTP servers set, and then launching
your script. For example, we might be able to pretend to be a web
server by storing input parameters in theQUERY_STRING
environment variable, using the same syntax we employ at the end of a
URL string after the?
:
$setenv QUERY_STRING "name=Mel&job=trainer,+writer"
$python tutor5.py
Content-type: text/htmlRead Programming Python (158 page) Page 158 Read Book Online,Top Vampire Books Read Online Free Syntax
a<b
Sorry--I don't know that language
The original version in
Example 15-18
doesn’t escape the
language name, such that the embeddedis interpreted as an HTML tag (which makes
the rest of the page render in bold font!). As you can probably tell by
now, text escapes are pervasive in CGI scripting—even text that you may
think is safe must generally be escaped before being inserted into the
HTML code in the reply stream.
In fact, because the Web is a text-based medium that combines
multiple language syntaxes, multiple formatting rules may apply: one for
URLs and another for HTML. We met HTML escapes earlier in this chapter;
URLs, and combinations of HTML and URLs, merit a few additional
words.
Notice that in the prior section, although it’s wrong to embed an
unescaped<
in the HTML code
reply, it’s perfectly all right to include it literally in the URL
string used to trigger the reply. In fact, HTML and URLs define
completely different characters as special. For instance, although&
must be escaped as&
inside HTML code, we have to use
other escaping schemes to code a literal&
within a URL string (where it normally
separates parameters). To pass a language name likea&b
to our script, we have to type the
following URL:
http://localhost/cgi-bin/languages2reply.py?language=a%26b
Here,%26
represents&
—the&
is replaced with a%
followed by the hexadecimal value (0x26) of
its ASCII code value (38). Similarly, as we suggested at the end of
Chapter 13
, to name C++ as a query
parameter in an explicit URL,+
must
be escaped as%2b
:
http://localhost/cgi-bin/languages2reply.py?language=C%2b%2b
SendingC++
unescaped will not
work, because+
is special in URL
syntax—it represents a space. By URL standards, most nonalphanumeric
characters are supposed to be translated to such escape sequences, and
spaces are replaced by+
signs.
Technically, this convention is known as the
application/x-www-form-urlencoded
query string
format, and it’s part of the magic behind those bizarre URLs you often
see at the top of your browser as you surf the Web.
If you’re like me, you probably don’t have the hexadecimal value
of the ASCII code for&
committed
to memory (though Python’shex(ord(c))
can help). Luckily, Python
provides tools that automatically implement URL escapes, just ascgi.escape
does for HTML escapes. The
main thing to keep in mind is that HTML code and URL strings are written
with entirely different syntax, and so employ distinct escaping
conventions. Web users don’t generally care, unless they need to type
complex URLs explicitly—browsers handle most escape code details
internally. But if you write scripts that must generate HTML or URLs,
you need to be careful to escape characters that are reserved in either
syntax.
Because HTML and URLs have different syntaxes, Python provides two
distinct sets of tools for escaping their text. In the standard Python
library:
cgi.escape
escapes text to
be embedded in HTML.
urllib.parse.quote
andquote_plus
escape text to be
embedded in URLs.
Theurllib.parse
module
also has tools for undoing URL escapes (unquote
,unquote_plus
), but HTML escapes are undone
during HTML parsing at large (e.g., by Python’shtml.parser
module). To illustrate the two
escape conventions and tools, let’s apply each tool set to a few simple
examples.
Somewhat inexplicably, Python 3.2 developers have opted to move
and rename thecgi.escape
function used throughout
this book tohtml.escape
, to make use of its
longstanding original name deprecated, and to alter its quoting
behavior slightly. This is despite the fact that this function has
been around for ages and is used in almost every Python CGI-based web
script: a glaring case of a small group’s notion of aesthetics
trouncing widespread practice in 3.X and breaking working code in the
process. You may need to use the newhtml.escape
name in a future Python version; that is, unless Python users complain
loudly enough (yes, hint!).
As we saw earlier,cgi.escape
translates code for inclusion within HTML. We normally call this utility
from a CGI script, but it’s just as easy to explore its behavior
interactively:
>>>import cgi
>>>cgi.escape('a < b > c & d "spam"', 1)
'a < b > c & d "spam"'
>>>s = cgi.escape("1<2 hello")
>>>s
'1<2 <b>hello</b>'
Python’scgi
module
automatically converts characters that are special in HTML syntax
according to the HTML convention. It translates<
,>
,
and&
with an extra true
argument,"
, into escape sequences of
the form&X;
, where theX
is a mnemonic that denotes the original
character. For instance,<
stands for the “less than” operator (<
) and&
denotes a literal ampersand
(&
).
There is no
un
escaping tool in the CGI
module, because HTML escape code sequences are recognized within the
context of an HTML parser, like the one used by your web browser when a
page is downloaded. Python comes with a full HTML parser, too, in the
form of the standard modulehtml.parser
. We won’t go into details on the
HTML parsing tools here (they’re covered in
Chapter 19
in conjunction with text processing), but
to illustrate how escape codes are eventually undone, here is the HTML
parser module at work reading back the preceding output:
>>>import cgi, html.parser
>>>s = cgi.escape("1<2 hello")
>>>s
'1<2 <b>hello</b>'
>>>
>>>html.parser.HTMLParser().unescape(s)
'1<2 hello'
This uses a utility method on the HTML parser class to unquote. In
Chapter 19
, we’ll see that using this class
for more substantial work involves subclassing to override methods run
as callbacks during the parse upon detection of tags, data, entities,
and more. For more on full-blown HTML parsing, watch for the rest of
this story in
Chapter 19
.
By contrast, URLs reserve other characters as special and must
adhere to different escape conventions. As a result, we use different
Python library tools to escape URLs for transmission. Python’surllib.parse
module provides two tools that do
the translation work for us:quote
,
which implements the standard%XX
hexadecimal URL escape code sequences for most nonalphanumeric
characters, andquote_plus
, which
additionally translates spaces to+
signs. Theurllib.parse
module also
provides functions for unescaping quoted characters in a URL string:unquote
undoes%XX
escapes, andunquote_plus
also changes plus signs back to
spaces. Here is the module at work, at the interactive prompt:
>>>import urllib.parse
>>>urllib.parse.quote("a & b #! c")
'a%20%26%20b%20%23%21%20c'
>>>urllib.parse.quote_plus("C:\stuff\spam.txt")
'C%3A%5Cstuff%5Cspam.txt'
>>>x = urllib.parse.quote_plus("a & b #! c")
>>>x
'a+%26+b+%23%21+c'
>>>urllib.parse.unquote_plus(x)
'a & b #! c'
URL escape sequences embed the hexadecimal values of nonsafe
characters following a%
sign (this
is usually their ASCII codes). Inurllib.parse
, nonsafe characters are usually
taken to include everything except letters, digits, and a handful of
safe special characters (any in'_.-'
), but the two tools differ on forward
slashes, and you can extend the set of safe characters by passing an
extra string argument to the quote calls to customize the
translations:
>>>urllib.parse.quote_plus("uploads/index.txt")
'uploads%2Findex.txt'
>>>urllib.parse.quote("uploads/index.txt")
'uploads/index.txt'
>>>
>>>urllib.parse.quote_plus("uploads/index.txt", '/')
'uploads/index.txt'
>>>urllib.parse.quote("uploads/index.txt", '/')
'uploads/index.txt'
>>>urllib.parse.quote("uploads/index.txt", '')
'uploads%2Findex.txt'
>>>
>>>urllib.parse.quote_plus("uploads\index.txt")
'uploads%5Cindex.txt'
>>>urllib.parse.quote("uploads\index.txt")
'uploads%5Cindex.txt'
>>>urllib.parse.quote_plus("uploads\index.txt", '\\')
'uploads\\index.txt'
Note that Python’scgi
module
also translates URL escape sequences back to their original characters
and changes+
signs to spaces during
the process of extracting input information. Internally,cgi.FieldStorage
automatically callsurllib.parse
tools which unquote if needed to
parse and unescape parameters passed at the end of URLs. The upshot is
that CGI scripts get back the original, unescaped URL strings, and don’t
need to unquote values on their own. As we’ve seen, CGI scripts don’t
even need to know that inputs came from a URL at all.
We’ve seen how to escape text inserted into both HTML and URLs.
But what do we do for URLs inside HTML? That is, how do we escape when
we generate and embed text inside a URL, which is itself embedded inside
generated HTML code? Some of our earlier examples used hardcoded URLs
with appended input parameters inside
hyperlink tags; the file
languages2.py
, for instance, prints HTML that
includes a URL:
Because the URL here is embedded in HTML, it must at least be
escaped according to HTML conventions (e.g., any<
characters must become<
), and any spaces should be translated
to+
signs per URL conventions. Acgi.escape(url)
call followed by the
stringurl.replace(" ", "+")
would
take us this far, and would probably suffice for most cases.
That approach is not quite enough in general, though, because HTML
escaping conventions are not the same as URL conventions. To robustly
escape URLs embedded in HTML code, you should instead callurllib.parse.quote_plus
on the URL string, or
at least most of its components, before adding it to the HTML text. The
escaped result also satisfies HTML escape conventions, becauseurllib.parse
translates more characters thancgi.escape
, and the%
in URL escapes is not special to
HTML.
But there is one more
astonishingly subtle (and thankfully rare) wrinkle: you
may also have to be careful with&
characters in URL strings that are
embedded in HTML code (e.g., within
hyperlink tags). The&
symbol is both a query parameter
separator in URLs (?a=1&b=2
)
and the start of escape codes in HTML (<
). Consequently, there is a
potential for collision if a query parameter name happens to be the
same as an HTML escape sequence code. The query parameter nameamp
, for instance, that shows up as&=1
in parameters two and
beyond on the URL may be treated as an HTML escape by some HTML
parsers, and translated to&=1
.
Even if parts of the URL string are URL-escaped, when more than
one parameter is separated by a&
, the&
separator might also have to be
escaped as&
according to
HTML conventions. To see why, consider the following HTML hyperlink
tag with query parameter namesname
,job
,amp
,sect
, andlt
:
hello
When rendered in most browsers tested, including Internet
Explorer on Windows 7, this URL link winds up looking incorrectly like
this (theS
character in the first
of these is really a non-ASCII section marker):
file.py?name=a&job=b&=cS=d<=e
result in IE
file.py?name=a&job=b&=c%A7=d%3C=e
result in Chrome (0x3C is <)
The first two parameters are retained as expected (name=a
,job=b
), becausename
is not preceded with an&
and&job
is not recognized as a valid HTML
character escape code. However, the&
,§
, and<
parts are interpreted as special
characters because they do name valid HTML escape codes, even without
a trailing semicolon.
To see this for yourself, open the example package’s
test-escapes.html
file in your browser, and
highlight or select its link; the query names may be taken as HTML
escapes
. This text appears to
parse correctly in Python’s own HTML parser module described earlier
(unless the parts in question also end in a semicolon); that might
help for replies fetched manually withurllib.request
, but not when rendered in
browsers:
>>>from html.parser import HTMLParser
>>>html = open('test-escapes.html').read()
>>>HTMLParser().unescape(html)
'\nhello\n'
What to do then? To make this work as expected in all cases, the&
separators should generally
be escaped if your parameter names may clash with an HTML escape
code:
hello
Browsers render this fully escaped link as expected (open
test-escapes2.html
to test), and
Python’s HTML parser does the right thing as well:
file.py?name=a&job=b&=c§=d<=e
result in both IE and Chrome
>>>h = 'hello'
>>>HTMLParser().unescape(h)
'hello'
Because of this conflict between HTML and URL syntax, most
server tools (including Python’surlib.parse
query-parameter parsing tools
employed by Python’scgi
module)
also allow a semicolon to be used as a separator instead of&
. The following link, for example,
works the same as the fully escaped URL, but does not require an extra
HTML escaping step (at least not for the;
):
file.py?name=a;job=b;amp=c;sect=d;lt=e
Python’shtml.parser
unescape
tool allows the semicolons to pass unchanged, too, simply because they
are not significant in HTML code. To fully test all three of these
link forms for yourself at once, place them in an HTML file, open the
file in your browser using its
http://localhost/badlink.html
URL, and view the
links when followed. The HTML file in
Example 15-24
will suffice.
Example 15-24. PP4E\Internet\Web\badlink.html
When these links are clicked, they invoke the simple CGI script
in
Example 15-25
. This script
displays the inputs sent from the client on the standard error stream
to avoid any additional translations (for our locally running web
server in
Example 15-1
, this
routes the printed text to the server’s console window).
Example 15-25. PP4E\Internet\Web\cgi-bin\badlink.py
import cgi, sys
form = cgi.FieldStorage() # print all inputs to stderr; stodout=reply page
for name in form.keys():
print('[%s:%s]' % (name, form[name].value), end=' ', file=sys.stderr)
Following is the (edited for space) output we get in our local
Python-coded web server’s console window for following each of the
three links in the HTML page in turn using Internet Explorer. The
second and third yield the correct parameters set on the server as a
result of the HTML escaping or URL conventions employed, but the
accidental HTML escapes cause serious issues for the first unescaped
link—the client’s HTML parser translates these in unintended ways
(results are similar under Chrome, but the first link displays the
non-ASCII section mark character with a different escape
sequence
):
mark-VAIO - - [16/Jun/2010 10:43:24] b'[:c\xa7=d<=e] [job:b] [name:a] '
mark-VAIO - - [16/Jun/2010 10:43:24] CGI script exited OK
mark-VAIO - - [16/Jun/2010 10:43:27] b'[amp:c] [job:b] [lt:e] [name:a] [sect:d]'
mark-VAIO - - [16/Jun/2010 10:43:27] CGI script exited OK
mark-VAIO - - [16/Jun/2010 10:43:30] b'[amp:c] [job:b] [lt:e] [name:a] [sect:d]'
mark-VAIO - - [16/Jun/2010 10:43:30] CGI script exited OK
The moral of this story is that unless you can be sure that the
names of all but the leftmost URL query parameters embedded in HTML
are not the same as the name of any HTML character escape code likeamp
, you should generally either
use a semicolon as a separator, if supported by your tools, or run the
entire URL throughcgi.escape
after
escaping its parameter names and values withurllib.parse.quote_plus
:
>>>link = 'file.py?name=a&job=b&=c§=d<=e'
# escape for HTML
>>>import cgi
>>>cgi.escape(link)
'file.py?name=a&job=b&=c§=d<=e'
# escape for URL
>>>import urllib.parse
>>>elink = urllib.parse.quote_plus(link)
>>>elink
'file.py%3Fname%3Da%26job%3Db%26amp%3Dc%26sect%3Dd%26lt%3De'
# URL satisfies HTML too: same
>>>cgi.escape(elink
)
'file.py%3Fname%3Da%26job%3Db%26amp%3Dc%26sect%3Dd%26lt%3De'
Having said that, I should add that some examples in this book
do not escape&
URL separators
embedded within HTML simply because their URL parameter names are
known not to conflict with HTML escapes. In fact, this concern is
likely to be rare in practice, since your program usually controls the
set of parameter names it expects. This is not, however, the most
general solution, especially if parameter names may be driven by a
dynamic database; when in doubt, escape much
and often.
“How I Learned to Stop Worrying and Love the Web”
Lest the HTML and URL formatting rules sound too clumsy (and
send you screaming into the night!), note that the HTML and URL
escaping conventions are imposed by the Internet itself, not by
Python. (As you’ve learned by now, Python has a different mechanism
for escaping special characters in string constants with
backslashes.) These rules stem from the fact that the Web is based
on the notion of shipping formatted text strings around the planet,
and are almost surely influenced by the tendency of different
interest groups to develop very different notations.
You can take heart, though, in the fact that you often don’t
need to think in such cryptic terms; when you do, Python automates
the translation process with library tools. Just keep in mind that
any script that generates HTML or URLs dynamically probably needs to
call Python’s escaping tools to be robust. We’ll see both the HTML
and the URL escape tool sets employed frequently in later examples
in this chapter and the next. Moreover, web development frameworks
and tools such as Zope and others aim to get rid of some of the
low-level complexities that CGI scripters face. And as usual in
programming, there is no substitute for brains; amazing technologies
like the Internet come at an inevitable cost in complexity.