Although toolkits
abound in this domain, Python’s status as a general-purpose
programming language also makes it a reasonable vehicle for writing
hand-coded parsers for custom language analysis tasks. For
instance
,
recursive descent parsing
is a fairly
well-known technique for analyzing language-based information. Though not
as powerful as some language tools, recursive descent parses are
sufficient for a wide variety of language-related goals.
To illustrate, this section develops a custom parser for a simple
grammar—it parses and evaluates arithmetic expression strings. Though
language analysis is the main topic here, this example also demonstrates
the utility of Python as a general-purpose programming language. Although
Python is often used as a frontend or rapid development language in
tactical modes, it’s also often useful for the kinds of strategic work you
may have formerly done in a systems development language such as C or
C++.
The grammar
that our parser will recognize can be described as
follows:
goal ->END [number, variable, ( ]
goal ->END [set]
assign -> 'set'[set]
expr ->[number, variable, ( ]
expr-tail -> ^ [END, ) ]
expr-tail -> '+'[+]
expr-tail -> '-'[-]
factor ->[number, variable, ( ]
factor-tail -> ^ [+, -, END, ) ]
factor-tail -> '*'[*]
factor-tail -> '/'[/]
term ->[number]
term ->[variable]
term -> '('')' [(]
tokens: (, ), num, var, -, +, /, *, set, end
This is a fairly typical grammar for a simple expression language,
and it allows for arbitrary expression nesting (some example expressions
appear at the end of thetestparser
module
listing in
Example 19-15
).
Strings to be parsed are either an expression or an assignment to a
variable name (set
). Expressions
involve numbers, variables, and the operators+
,−
,*
, and/
. Becausefactor
is nested inexpr
in the grammar,*
and/
have higher precedence (i.e., they bind tighter) than+
and−
.
Expressions can be enclosed in parentheses to override precedence, and
all operators are left associative—that is, they group on the left
(e.g.,1-2-3
is treated the same as(1-2)-3
).
Tokens
are just the most primitive components
of the expression language. Each grammar rule listed earlier is followed
in square brackets by a list of tokens used to select it. In recursive
descent parsing, we determine the set of tokens that can possibly start
a rule’s substring, and we use that information to predict which rule
will work ahead of time. For rules that iterate (the-tail
rules), we use the set of possibly
following tokens to know when to stop. Typically, tokens are recognized
by a string processor (a
scanner
), and a
higher-level processor (a
parser
) uses the token
stream to predict and step through grammar rules and substrings.
The system is
structured as two modules, holding two classes:
The scanner handles low-level character-by-character
analysis.
The parser embeds a scanner and handles higher-level grammar
analysis.
The parser is also responsible for computing the expression’s
value and testing the system. In this version, the parser evaluates the
expression while it is being parsed. To use the system, we create a
parser with an input string and call itsparse
method. We can also callparse
again later with a new expression
string.
There’s a deliberate division of labor here. The scanner extracts
tokens from the string, but it knows nothing about the grammar. The
parser handles the grammar, but it is naive about the string itself.
This modular structure keeps the code relatively simple. And it’s
another example of the object-oriented programming (OOP) composition
relationship at work: parsers embed and delegate to scanners.
The module in
Example 19-13
implements the
lexical analysis task—detecting the expression’s basic tokens by
scanning the text string left to right on demand. Notice that this is
all straightforward logic; such analysis can sometimes be performed with
regular expressions instead (described earlier), but the pattern needed
to detect and extract tokens in this example would be too complex and
fragile for my tastes. If your tastes vary, try recoding this module
withre
.
Example 19-13. PP4E\Lang\Parser\scanner.py
"""
###############################################################################
the scanner (lexical analyser)
###############################################################################
"""
import string
class SyntaxError(Exception): pass # local errors
class LexicalError(Exception): pass # used to be strings
class Scanner:
def __init__(self, text):
self.next = 0
self.text = text + '\0'
def newtext(self, text):
Scanner.__init__(self, text)
def showerror(self):
print('=> ', self.text)
print('=> ', (' ' * self.start) + '^')
def match(self, token):
if self.token != token:
raise SyntaxError(token)
else:
value = self.value
if self.token != '\0':
self.scan() # next token/value
return value # return prior value
def scan(self):
self.value = None
ix = self.next
while self.text[ix] in string.whitespace:
ix += 1
self.start = ix
if self.text[ix] in ['(', ')', '-', '+', '/', '*', '\0']:
self.token = self.text[ix]
ix += 1
elif self.text[ix] in string.digits:
str = ''
while self.text[ix] in string.digits:
str += self.text[ix]
ix += 1
if self.text[ix] == '.':
str += '.'
ix += 1
while self.text[ix] in string.digits:
str += self.text[ix]
ix += 1
self.token = 'num'
self.value = float(str)
else:
self.token = 'num'
self.value = int(str) # subsumes long() in 3.x
elif self.text[ix] in string.ascii_letters:
str = ''
while self.text[ix] in (string.digits + string.ascii_letters):
str += self.text[ix]
ix += 1
if str.lower() == 'set':
self.token = 'set'
else:
self.token = 'var'
self.value = str
else:
raise LexicalError()
self.next = ix
The parser module’s class creates and embeds a scanner for its
lexical chores and handles interpretation of the expression grammar’s
rules and evaluation of the expression’s result, as shown in
Example 19-14
.
Example 19-14. PP4E\Lang\Parser\parser1.py
"""
################################################################################
the parser (syntax analyser, evaluates during parse)
################################################################################
"""
class UndefinedError(Exception): pass
from scanner import Scanner, LexicalError, SyntaxError
class Parser:
def __init__(self, text=''):
self.lex = Scanner(text) # embed a scanner
self.vars = {'pi': 3.14159} # add a variable
def parse(self, *text):
if text: # main entry-point
self.lex.newtext(text[0]) # reuse this parser?
try:
self.lex.scan() # get first token
self.Goal() # parse a sentence
except SyntaxError:
print('Syntax Error at column:', self.lex.start)
self.lex.showerror()
except LexicalError:
print('Lexical Error at column:', self.lex.start)
self.lex.showerror()
except UndefinedError as E:
name = E.args[0]
print("'%s' is undefined at column:" % name, self.lex.start)
self.lex.showerror()
def Goal(self):
if self.lex.token in ['num', 'var', '(']:
val = self.Expr()
self.lex.match('\0') # expression?
print(val)
elif self.lex.token == 'set': # set command?
self.Assign()
self.lex.match('\0')
else:
raise SyntaxError()
def Assign(self):
self.lex.match('set')
var = self.lex.match('var')
val = self.Expr()
self.vars[var] = val # assign name in dict
def Expr(self):
left = self.Factor()
while True:
if self.lex.token in ['\0', ')']:
return left
elif self.lex.token == '+':
self.lex.scan()
left = left + self.Factor()
elif self.lex.token == '-':
self.lex.scan()
left = left - self.Factor()
else:
raise SyntaxError()
def Factor(self):
left = self.Term()
while True:
if self.lex.token in ['+', '-', '\0', ')']:
return left
elif self.lex.token == '*':
self.lex.scan()
left = left * self.Term()
elif self.lex.token == '/':
self.lex.scan()
left = left / self.Term()
else:
raise SyntaxError()
def Term(self):
if self.lex.token == 'num':
val = self.lex.match('num') # numbers
return val
elif self.lex.token == 'var':
if self.lex.value in self.vars.keys(): # keys(): EIBTI!
val = self.vars[self.lex.value] # look up name's value
self.lex.scan()
return val
else:
raise UndefinedError(self.lex.value)
elif self.lex.token == '(':
self.lex.scan()
val = self.Expr() # sub-expression
self.lex.match(')')
return val
else:
raise SyntaxError()
if __name__ == '__main__':
import testparser # self-test code
testparser.test(Parser, 'parser1') # test local Parser
If you study this code closely, you’ll notice that the parser
keeps a dictionary (self.vars
) to
manage variable names: they’re stored in the dictionary on aset
command and are fetched from it when they
appear in an expression. Tokens are represented as strings, with an
optional associated value (a numeric value for numbers and a string for
variable names).
The parser uses iteration (while
loops) rather than recursion for theexpr-tail
andfactor-tail
rules. Other than this
optimization, the rules of the grammar map directly onto parser methods:
tokens become calls to the scanner, and nested rule references become
calls to other methods.
When the file
parser1.py
is run as a
top-level program, its self-test code is executed, which in turn simply
runs a canned test in the module shown in
Example 19-15
. Notice how the
scanner converts numbers to strings withint
; this ensures that all integer math
invoked by the parser supports unlimited precision, simply because it
uses Python integers which always provide the extra precision if needed
(the separate Python 2.Xlong
type
and syntax is no more).
Also notice that mixed integer/floating-point operations cast up
to floating point since Python operators are used to do the actual
calculations along the way. The expression language’s/
division operator also inherits Python 3.X’s
true division model which retains remainders and returns floating point
results regardless of operand types. We could simply run a//
in our evaluation logic to retain prior
behavior (or allow for both/
and//
in our grammar), but we’ll follow
Python 3.X’s lead here.
Example 19-15. PP4E\Lang\Parser\testparser.py
"""
############################################################################
parser test code
############################################################################
"""
def test(ParserClass, msg):
print(msg, ParserClass)
x = ParserClass('4 / 2 + 3') # allow different Parsers
x.parse()
x.parse('3 + 4 / 2') # like eval('3 + 4 / 2')...
x.parse('(3 + 4) / 2') # 3.X: / is now true div
x.parse('4 / (2 + 3)') # // is not supported (yet)
x.parse('4.0 / (2 + 3)')
x.parse('4 / (2.0 + 3)')
x.parse('4.0 / 2 * 3')
x.parse('(4.0 / 2) * 3')
x.parse('4.0 / (2 * 3)')
x.parse('(((3))) + 1')
y = ParserClass()
y.parse('set a 4 / 2 + 1')
y.parse('a * 3')
y.parse('set b 12 / a')
y.parse('b')
z = ParserClass()
z.parse('set a 99')
z.parse('set a a + 1')
z.parse('a')
z = ParserClass()
z.parse('pi')
z.parse('2 * pi')
z.parse('1.234 + 2.1')
def interact(ParserClass): # command-line entry
print(ParserClass)
x = ParserClass()
while True:
cmd = input('Enter=> ')
if cmd == 'stop':
break
x.parse(cmd)
Correlate the following results to print call statements in the
self-test module:
C:\...\PP4E\Lang\Parser>python parser1.py
parser1
5.0
5.0
3.5
0.8
0.8
0.8
6.0
6.0
0.666666666667
4
9.0
4.0
100
3.14159
6.28318
3.334
As usual, we can also test and use the system interactively to
work through more of its utility:
C:\...\PP4E\Lang\Parser>python
>>>import parser1
>>>x = parser1.Parser()
>>>x.parse('1 + 2')
3
Error cases are trapped and reported in a fairly friendly fashion
(assuming users think in zero-based terms):
>>>x.parse('1 + a')
'a' is undefined at column: 4
=> 1 + a
=> ^
>>>x.parse('1+a+2')
'a' is undefined at column: 2
=> 1+a+2
=> ^
>>>x.parse('1 * 2 $')
Lexical Error at column: 6
=> 1 * 2 $
=> ^
>>>x.parse('1 * - 1')
Syntax Error at column: 4
=> 1 * - 1
=> ^
>>>x.parse('1 * (9')
Syntax Error at column: 6
=> 1 * (9
=> ^
>>>x.parse('1 + 2 / 3')
# 3.X division change
1.66666666667
>>>x.parse('1 + 2 // 3')
Syntax Error at column: 7
=> 1 + 2 // 3
=> ^
Lesson 3: Divide and Conquer
As the custom parser system demonstrates, modular program design
is almost always a major win. By using Python’s program structuring
tools (functions, modules, classes, and so on), big tasks can be
broken down into small, manageable parts that can be coded and tested
independently.
For instance, the scanner can be tested without the parser by
making an instance with an input string and calling itsscan
ormatch
methods repeatedly. We can even test
it like this interactively, from Python’s command line. When we
separate programs into logical components, they become easier to
understand and modify. Imagine what the parser would look like if the
scanner’s logic was embedded rather than called.
Pathologically big numbers are handled well, too, because Python’s
built-in objects and operators are used along the way:
>>>x.parse('888888888888888888888888888888888888888888888.9999999')
8.88888888889e+44
>>>x.parse('99999999999999999999999999999999999999999 + 2')
100000000000000000000000000000000000000001
>>>x.parse('999999999999999999999999999999.88888888888 + 1.1')
1e+30
In addition, there is an interactive loop interface in thetestparser
module if you want to use
the parser as a simple command-line calculator (or if you get tired of
typing parser method calls). Pass theParser
class, sotestparser
can make one of its own:
>>>import testparser
>>>testparser.interact(parser1.Parser)
Enter=>4 * 3 + 5
17
Enter=>5 + 4 * 3
17
Enter=>(5 + 4) * 3
27
Enter=>set a 99
Enter=>set b 66
Enter=>a + b
165
Enter=># + 1
Lexical Error at column: 0
=> # + 1
=> ^
Enter=>a * b + c
'c' is undefined at column: 8
=> a * b + c
=> ^
Enter=>a * b * + c
Syntax Error at column: 8
=> a * b * + c
=> ^
Enter=>a
99
Enter=>a * a * a
970299
Enter=>stop
>>>