Although more limited
in scope, Python’shtml.parser
standard library module also supports HTML-specific
parsing, useful in “screen scraping” roles to extract information from
web pages. Among other things, this parser can be used to process Web
replies fetched with theurllib.request
module
we met in the Internet part of this book, to extract plain
text from HTML email messages, and more.
Thehtml.parser
module has an
API reminiscent of the XML SAX model of the prior section: it provides a
parser which we subclass to intercept tags and their data during a
parse. Unlike SAX, we don’t provide a handler class, but extend the
parser class directly. Here’s a quick interactive example to demonstrate
the basics (I copied all of this section’s code into file
htmlparser.py
in the examples package if you
wish to experiment with it yourself):
>>>from html.parser import HTMLParser
>>>class ParsePage(HTMLParser):
...def handle_starttag(self, tag, attrs):
...print('Tag start:', tag, attrs)
...def handle_endtag(self, tag):
...print('tag end: ', tag)
...def handle_data(self, data):
...print('data......', data.rstrip())
...
Now, create a web page’s HTML text string; we hardcode one here,
but it might also be loaded from a file, or fetched from a website withurllib.request
:
>>>page = """
......
Spam!
......
Click this python link
"""
Finally, kick off the parse by feeding text to a parser
instance—tags in the HTML text trigger class method callbacks, with tag
names and attribute sequences passed in as arguments:
>>>parser = ParsePage()
>>>parser.feed(page)
data......
Tag start: html []
data......
Tag start: h1 []
data...... Spam!
tag end: h1
data......
Tag start: p []
data...... Click this
Tag start: a [('href', 'http://www.python.org')]
data...... python
tag end: a
data...... link
tag end: p
data......
tag end: html
As you can see, the parser’s methods receive callbacks for events
during the parse. Much like SAX XML parsing, your parser class will need
to keep track of its state in attributes as it goes if it wishes to do
something more specific than print tag names, attributes, and content.
Watching for specific tags’ content, though, might be as simple as
checking names and setting state flags. Moreover, building object trees
to reflect the page’s structure during the parse would be
straightforward.
Here’s another HTML
parsing example: in
Chapter 15
, we used a simple method exported
by this module to unquote HTML escape sequences (a.k.a. entities) in
strings embedded in an HTML reply page:
>>>import cgi, html.parser
>>>s = cgi.escape("1<2 hello")
>>>s
'1<2 <b>hello</b>'
>>>
>>>html.parser.HTMLParser().unescape(s)
'1<2 hello'
This works for undoing HTML escapes, but that’s all. When we saw
this solution, I implied that there was a more general approach; now
that you know about the method callback model of the HTML parser
class, the more idiomatic way to handle entities during a parse should
make sense—simply catch entity callbacks in a parser subclass, and
translate as needed:
>>>class Parse(html.parser.HTMLParser):
...def handle_data(self, data):
...print(data, end='')
...def handle_entityref(self, name):
...map = dict(lt='<', gt='>')
...print(map[name], end='')
...
>>>p = Parse()
>>>p.feed(s); print()
1<2 hello
Better still, we can use Python’s relatedhtml.entities
module to avoid hardcoding
entity-to-character mappings for HTML entities. This module defines
many more entity names than the simple dictionary in the prior example
and includes all those you’ll likely encounter when parsing HTML text
in the wild:
>>>s
'1<2 <b>hello</b>'
>>>
>>>from html.entities import entitydefs
>>>class Parse(html.parser.HTMLParser):
...def handle_data(self, data):
...print(data, end='')
...def handle_entityref(self, name):
...print(entitydefs[name], end='')
...
>>>P = Parse()
>>>P.feed(s); print()
1<2 hello
Strictly speaking, thehtml.entities
module is able to map entity
name to Unicode code point and vice versa; its table used here simply
converts code point integers to characters withchr
. See this module’s documentation, as
well as its source code in the Python standard library for more
details.
Now that you understand the basic principles of the HTML parser
class in Python’s standard library, the plain text extraction module
used by
Chapter 14
’s
PyMailGUI (
Example 14-8
) will
also probably make significantly more sense (this was an unavoidable
forward reference which we’re finally able to close).
Rather than repeating its code here, I’ll simply refer you back
to that example, as well as its self-test and test input files, for
another example of HTML parsing in Python to study on your own. It’s
essentially a minor elaboration on the examples here, which detects
more types of tags in its parser callback methods.
Because of space concerns, we have to cut short our treatment of
HTML parsing here; as usual, knowing that it exists is enough to get
started. For more details on the API, consult the Python library
manual. And for additional HTML support, check the Web
for the 3.X status of third-party HTML parser packages like those
mentioned in
Chapter 14
.
If you have a
background in parsing theory, you may know that neither
regular expressions nor string splitting is powerful enough to handle more
complex language grammars. Roughly,
regular expressions don’t have the stack “memory” required
by true language grammars, and so cannot support arbitrary nesting of
language constructs—nestedif
statements in a programming language, for instance. In fact, this is why
the XML and HTML parsers of the prior section are required at all: both
are languages of potentially arbitrary nesting, which are beyond the scope
of regular expressions in general.
From a theoretical perspective, regular expressions are really
intended to handle just the first stage of parsing—separating text into
components, otherwise
known as
lexical analysis
. Though
patterns can often be used to extract data from text, true language
parsing requires more. There are a number of ways to fill this gap with
Python:
In most applications, the Python language itself can replace
custom languages and parsers—user-entered code can be passed to
Python for evaluation with tools
such aseval
andexec
. By augmenting the system
with custom modules, user code in this scenario has access to both
the full Python language and any
application
-
specific
extensions required. In a
sense, such systems
embed
Python in Python.
Since this is a common Python role, we’ll revisit this approach
later in this chapter.
For some sophisticated language analysis tasks, though, a
full-blown parser may still be required. Such parsers can always be
written by hand, but since Python is built for integrating C tools,
we can write integrations to traditional parser generator systems
such as yacc and bison, tools that create parsers from language
grammar
definitions. Better
yet, we could use an integration that already exists—interfaces to
such common parser generators are freely available in the open
source domain (run a web search for up-to-date details and
links).
In addition, a number of Python-specific parsing systems are
available on the Web. Among them:
PLY is an implementation of lex and yacc parsing
tools in and for Python; the
kwParsing system is a parser generator written in
Python; PyParsing
is a pure-Python class library that makes it easy to
build recursive-descent parsers quickly; and the
SPARK toolkit is a lightweight system that employs
the Earley algorithm
to work around technical problems with LALR parser
generation
(if you don’t know what that means, you probably
don’t need to care).
Of special interest to this chapter
, YAPPS (Yet Another Python Parser System) is a parser
generator written in Python. It uses supplied grammar rules to
generate human-readable Python code that implements a recursive
descent parser; that is, it’s Python code that generates Python
code. The parsers generated by YAPPS look much like (and were
inspired by) the handcoded custom expression parsers shown in the
next section. YAPPS creates LL(1) parsers, which are not as powerful
as LALR parsers but are sufficient for many language tasks. For more
on YAPPS, see
http://theory.stanford.edu/~amitp/Yapps
or search the
Web at large.
Even more
demanding language analysis tasks require techniques
developed in artificial intelligence research, such as semantic
analysis and machine learning. For instance, the Natural Language
Toolkit, or NLTK,
is an open source suite of Python libraries and
programs for symbolic and statistical natural language processing.
It applies linguistic techniques to textual data, and it can be used
in the development of natural language recognition software and
systems. For much more on this subject, be sure to also see the
O’Reilly book
Natural Language
Processing with Python
, which explores, among
other things, ways to use NLTK in Python. Not every system’s users
will pose questions in a natural language, of course, but there are
many applications which can make good use of such utility.
Though widely useful, parser generator systems and natural language
analysis toolkits are too complex for us to cover in any sort of useful
detail in this text. Consult
http://python.org/
or
search the Web for more information on language analysis tools available
for use in Python programs. For the purposes of this chapter, let’s move
on to explore a more basic and manual approach that illustrates concepts
underlying the
domain—
recursive descent
parsing
.
Lesson 2: Don’t Reinvent the Wheel (Usually)
Speaking of parser generators, to use some of these tools in
Python programs, you’ll need an extension module that integrates them.
The first step in such scenarios should always be to see whether the
extension already exists in the public domain. Especially for common
tools like these, chances are that someone else has already implemented
an integration that you can use off-the-shelf instead of writing one
from scratch.
Of course, not everyone can donate all their extension modules to
the public domain, but there’s a growing library of available components
that you can pick up for free and a community of experts to query. Visit
the
PyPI site at
http://www.python.org
for links to Python software resources, or search the Web at large. With
at least one million Python users out there as I write this book, much
can be found in the prior-art
department
.
Unless, of course, the wheel does not work
:
We’ve also seen a handful of cases in this book where standard libraries
were either not adequate or were broken altogether. For instance, the
Python 3.1email
package issues we
explored in
Chapter 13
required us to code
workarounds of our own. In such cases, you may still ultimately need to
code your own infrastructure support. The “not invented here” syndrome
can still claim victory when software dependencies break down.
Still, you’re generally better off trying to use the standard
support provided by Python in most cases, even if doing so requires
manually coded fixes. In theemail
package example, fixing its problems seems much easier than coding an
email parser and generator from scratch—a task far too large to have
even attempted in this book. Python’s
batteries
included
approach to development can be amazingly
productive—even when some of those batteries require a charge.