Parser writing can be tedious, and, more importantly a very difficult exercice.
Whenever I have to write a parser in python (and it happens quite often) I use ply, a lex/yacc Python library. If you don’t know what lex/yacc is, it will be the subject of articles in the near future.
In this article I will explain to you how to parse an output file of ply: parsetab.py. The goal is to retrieve the grammar in the list_lr_productions
and to output it LaTeX using the grammar environment, to output a .tex file that pretty-prints the grammar.
Installing ply
Please run in your virtual environment:
pip install ply
# Lexing
Tokens
The first step of this it to do the lexing. We need to define a class, let’s call it MyLexer, and to define the tokens.
import ply.lex as lex
class MyLexer:
"""
A lexing class for parsetab.py files.
"""
tokens = [
"GRAMMARRULE", # The grammar rule
"LPAREN", # Left parenthesis
"RPAREN", # Right parenthesis
]
States
You might think that is very few tokens for such a file. It’s because we are going to define a new state called data that will be exclusive:
states = [
("data", "exclusive"),
]
An exclusive state overrides the behavior of the lexer in such a fashion that it will only return the state’s tokens.
Regular expressions for tokens
The nexst step is to define our tokens, especially the tokens that will be ignored.
Ignored tokens
Some of the tokens we need to ignore are the following:
t_ignore = " \n" # Ignores spaces and \n, except if they are defined in following rules
t_ignore_COMMENT = r"\#.*" # Ignores the python comments (useful for the beginning of the file
t_ignore_LRMETHOD = r"_lr_method.*" # Ignores the `_lr_method` = … lineParsing
You need to define ignore tokens for the initial state for everything that will be in the input file.
Data tokens
Let’s define the data tokens that are useful. The first one needs to match exactly the grammar rule. You can define it as follows (if I’m correct):
t_grammar_GRAMMARRULE = r"(\"|\')[A-Za-z][a-zA-Z0-9_\']*\ ->\ .+?(?=,)"
We also need to define those tokens:
t_grammar_LPAREN = r"\("
t_grammar_RPAREN = r"\)"
State switching
Since we have to states (initial and data) we need to be able to switch from one to another:
@staticmethod
def t_begin_data(t):
r"_lr_productions\ =\ \["
t.lexer.push_state("data")
@staticmethod
def t_data_end(t):
r"\]"
t.lexer.pop_state()
Error handling
Errors happen a lot while tokenizing because your regular expressions aren’s perfect on the first try, because you might not have thought of some things etc. To deal (very basically) with errors:
@staticmethod
def t_error(t):
print("Illegal character in INITIAL state '%s'" % t.value[0])
t.lexer.skip(1)
@staticmethod
def t_data_error(t):
print("Illegal character in GRAMMAR state '%s'" % t.value[0])
t.lexer.skip(1)
In production, you might want to raise a custom-defined error instead of printing the error.
init
def __init__(self, **kwargs):
self.lexer = lex.lex(module=self, **kwargs)
Tokenisation functions
For testing purposes, it is useful to have tests functions such as:
def tokenize(self, data):
"""
Tokenize input data to stdout for testing purposes.
"""
self.lexer.input(data)
while True:
tok = self.lexer.token()
if not tok:
break
print(tok)
def tokenize_file(self, filepath):
"""
Tokenize input file to stdout for testing purposes.
:param fspec: Input file to parse.
"""
with open(filepath, "r") as content:
data = content.read()
return self.tokenize(data)
Parsing
Defining the class
To work, the parser needs to have access to the token list.
import ply.yacc as yacc
from lexer import MyLexer
class MyParser:
"""
Parses parsetab.py files
"""
def __init__(self, the_lexer=None):
if the_lexer is None:
the_lexer = MyLexer()
self._lexer = the_lexer
self.tokens = self._lexer.tokens
self._parser = yacc.yacc(module=self)
Grammar rules
Once this done, the grammar rules need to be defined:
@staticmethod
def p_relevantdata(p):
"""RelevantData : DataLines"""
p[0] = p[1]
@staticmethod
def p_datalines(p):
"""DataLines : DataLine
| DataLines DataLine"""
if len(p) == 2:
p[0] = [p[1]]
else:
p[0] = p[1] + [p[2]]
@staticmethod
def p_grammarrule(p):
"""DataLine : LPAREN GRAMMARRULE RPAREN"""
# Let's discard the parenthesis, we don't need those
p[0] = p[2]
Error handling
Here’s a quick error handling function. In production, it should raise appropriate errors.
@staticmethod
def p_error(p):
"""
Error handling.
"""
if p:
print("Syntax error at token ", p)
parser.errok()
else:
print("Syntax error at EOF")
Using the parser
We need some functions to use the parser:
def parse(self, data):
"""
Parses input data.
"""
return self._parser.parse(data)
def parse_file(self, filepath):
"""
Parses data stored in file.
"""
with open(filepath, "r") as content:
data = content.read()
result = self._parser(data)
return result
AST and rewriting
So yeah \o/ but it’s not really useful. We need to create an AST to represent the data and to be able to rewrite it into LaTeX code.
Basic Node class
First step first: we need a Node class. Here’s a really basic one:
class Node:
"""
Really really basic Node Class
Takes into parameter either a list of children
or a single child
"""
def __init__(self, children=None):
self.pretty = "Basic Node"
if not children:
self.children = []
elif hasattr(children, "__len__"):
self.children = children
else:
self.children = [children]
def __str__(self):
return "Node"
def __repr__(self):
return self.pretty
Creating the required nodes and writing rewrite()
Here’s the interesting part: we need to think about how we want to implement the rewrite() functions in every node we define. It needs to produce valid LaTeX (and, if possible, human-readable LaTeX). Here’s how I implemented it (you need to know how the grammar environment in the syntax package of LaTeX work):
class RelevantDataNode(Node):
"""
A node that contains all grammar rules.
"""
pretty = "Relevant Data"
def rewrite(self):
code = "\\documentclass{article}\n"
code += "\\usepackage{syntax}\n"
code += "\\begin{document}\n"
code += "\\begin{grammar}\n"
previous_left = "S"
for child in self.children:
if child.left_member() == previous_left:
code += child.rewrite_end()
else:
previous_left = child.left_member()
code += child.rewrite_full()
code += "\\end{grammar}\n\\end{document}\n"
return code
class GrammarRuleNode(Node):
"""
A node that contains a grammer rule.
"""
pretty = "Grammar Rule"
def left_member(self):
return self.children.strip("'").strip('"').split(" ")[0]
def rewrite_full(self):
data = self.children.strip("'").strip('"').split(" ")
if data[0].isupper() and data[0] != "S'":
code = '\n\n "%s" ::=' % data[0]
else:
code = '\n\n <%s> ::= ' % data[0]
for child in data[2:]:
if child.isupper():
code += ' "%s"' % child
else:
code += " <%s>" % child
code += "\n"
return code
def rewrite_end(self):
data = self.children.strip("'").strip('"').split(" ")
code = " \\alt"
for child in data[2:]:
if child.isupper():
code += ' "%s"' % child
else:
code += " <%s>" % child
code += "\n"
return code
Back to the parser!
At the beginnig we need to import the AST file:
import ast as AST
Then, we need to redefine some p_
functions in order to create the AST:
@staticmethod
def p_relevantdata(p):
"""RelevantData : DataLines"""
p[0] = AST.RelevantDataNode(p[1])
@staticmethod
def p_grammarrule(p):
"""DataLine : LPAREN GRAMMARRULE RPAREN"""
# Let's discard the parenthesis, we don't need those
p[0] = AST.GrammarRuleNode(p[2])
Let’s make it work!
In your ipython call:
import myyacc
parser = myyacc.MyParser()
ast = parser.parse_file("/tmp/parsetab.py")
code = ast.rewrite()
with open("/tmp/grammar.tex", "w") as grammar:
grammar.write(code)
And then compile your grammar.tex
file and see if you have the correct output!