Let's write a Parser in Python With Ply

2020-05-11 Python

Parser writing can be tedious, and, more importantly a very difficult exercice.

Whenever I have to write a parser in python (and it happens quite often) I use ply, a lex/yacc Python library. If you don’t know what lex/yacc is, it will be the subject of articles in the near future.

In this article I will explain to you how to parse an output file of ply: parsetab.py. The goal is to retrieve the grammar in the list_lr_productions and to output it LaTeX using the grammar environment, to output a .tex file that pretty-prints the grammar.

Installing ply

Please run in your virtual environment:

pip install ply

# Lexing

Tokens

The first step of this it to do the lexing. We need to define a class, let’s call it MyLexer, and to define the tokens.

import ply.lex as lex

class MyLexer:
    """
    A lexing class for parsetab.py files.
    """
    tokens = [
        "GRAMMARRULE", # The grammar rule
        "LPAREN", # Left parenthesis
        "RPAREN", # Right parenthesis
    ]

States

You might think that is very few tokens for such a file. It’s because we are going to define a new state called data that will be exclusive:

states = [
     ("data", "exclusive"),
]

An exclusive state overrides the behavior of the lexer in such a fashion that it will only return the state’s tokens.

Regular expressions for tokens

The nexst step is to define our tokens, especially the tokens that will be ignored.

Ignored tokens

Some of the tokens we need to ignore are the following:

t_ignore = " \n"                    # Ignores spaces and \n, except if they are defined in following rules
t_ignore_COMMENT = r"\#.*"          # Ignores the python comments (useful for the beginning of the file
t_ignore_LRMETHOD = r"_lr_method.*" # Ignores the `_lr_method` = … lineParsing

You need to define ignore tokens for the initial state for everything that will be in the input file.

Data tokens

Let’s define the data tokens that are useful. The first one needs to match exactly the grammar rule. You can define it as follows (if I’m correct):

t_grammar_GRAMMARRULE = r"(\"|\')[A-Za-z][a-zA-Z0-9_\']*\ ->\ .+?(?=,)"

We also need to define those tokens:

t_grammar_LPAREN = r"\("
t_grammar_RPAREN = r"\)"

State switching

Since we have to states (initial and data) we need to be able to switch from one to another:

@staticmethod
def t_begin_data(t):
    r"_lr_productions\ =\ \["
    t.lexer.push_state("data")

@staticmethod
def t_data_end(t):
    r"\]"
    t.lexer.pop_state()

Error handling

Errors happen a lot while tokenizing because your regular expressions aren’s perfect on the first try, because you might not have thought of some things etc. To deal (very basically) with errors:

@staticmethod
def t_error(t):
    print("Illegal character in INITIAL state '%s'" % t.value[0])
    t.lexer.skip(1)

@staticmethod
def t_data_error(t):
    print("Illegal character in GRAMMAR state '%s'" % t.value[0])
    t.lexer.skip(1)

In production, you might want to raise a custom-defined error instead of printing the error.

init

def __init__(self, **kwargs):
    self.lexer = lex.lex(module=self, **kwargs)

Tokenisation functions

For testing purposes, it is useful to have tests functions such as:

def tokenize(self, data):
    """
    Tokenize input data to stdout for testing purposes.
    """
    self.lexer.input(data)
        while True:
            tok = self.lexer.token()
            if not tok:
                break
            print(tok)


def tokenize_file(self, filepath):
    """
    Tokenize input file to stdout for testing purposes.
    :param fspec: Input file to parse.
    """
    with open(filepath, "r") as content:
        data = content.read()
    return self.tokenize(data)

Parsing

Defining the class

To work, the parser needs to have access to the token list.

import ply.yacc as yacc
from lexer import MyLexer

class MyParser:
    """
    Parses parsetab.py files
    """
    def __init__(self, the_lexer=None):
        if the_lexer is None:
            the_lexer = MyLexer()
         self._lexer = the_lexer
         self.tokens = self._lexer.tokens
         self._parser = yacc.yacc(module=self)

Grammar rules

Once this done, the grammar rules need to be defined:

@staticmethod
def p_relevantdata(p):
    """RelevantData : DataLines"""
    p[0] = p[1]

@staticmethod
def p_datalines(p):
    """DataLines : DataLine
                 | DataLines DataLine"""
    if len(p) == 2:
        p[0] = [p[1]]
    else:
        p[0] = p[1] + [p[2]]

@staticmethod
def p_grammarrule(p):
    """DataLine : LPAREN GRAMMARRULE RPAREN"""
    # Let's discard the parenthesis, we don't need those
    p[0] = p[2]

Error handling

Here’s a quick error handling function. In production, it should raise appropriate errors.

@staticmethod
def p_error(p):
    """
    Error handling.
    """
    if p:
        print("Syntax error at token ", p)
        parser.errok()
    else:
        print("Syntax error at EOF")

Using the parser

We need some functions to use the parser:

def parse(self, data):
    """
    Parses input data.
    """
    return self._parser.parse(data)

def parse_file(self, filepath):
    """
    Parses data stored in file.
    """
    with open(filepath, "r") as content:
        data = content.read()
        result = self._parser(data)
    return result

AST and rewriting

So yeah \o/ but it’s not really useful. We need to create an AST to represent the data and to be able to rewrite it into LaTeX code.

Basic Node class

First step first: we need a Node class. Here’s a really basic one:

class Node:
    """
    Really really basic Node Class

    Takes into parameter either a list of children
    or a single child
    """

    def __init__(self, children=None):
        self.pretty = "Basic Node"
        if not children:
            self.children = []
        elif hasattr(children, "__len__"):
            self.children = children
        else:
            self.children = [children]

    def __str__(self):
        return "Node"

    def __repr__(self):
        return self.pretty

Creating the required nodes and writing rewrite()

Here’s the interesting part: we need to think about how we want to implement the rewrite() functions in every node we define. It needs to produce valid LaTeX (and, if possible, human-readable LaTeX). Here’s how I implemented it (you need to know how the grammar environment in the syntax package of LaTeX work):

class RelevantDataNode(Node):
    """
    A node that contains all grammar rules.
    """
    pretty = "Relevant Data"

    def rewrite(self):
        code = "\\documentclass{article}\n"
        code += "\\usepackage{syntax}\n"
        code += "\\begin{document}\n"
        code += "\\begin{grammar}\n"
        previous_left = "S"
        for child in self.children:
            if child.left_member() == previous_left:
                code += child.rewrite_end()
            else:
                previous_left = child.left_member()
                code += child.rewrite_full()
        code += "\\end{grammar}\n\\end{document}\n"
        return code


class GrammarRuleNode(Node):
    """
    A node that contains a grammer rule.
    """
    pretty = "Grammar Rule"

    def left_member(self):
        return self.children.strip("'").strip('"').split(" ")[0]

    def rewrite_full(self):
        data = self.children.strip("'").strip('"').split(" ")
        if data[0].isupper() and data[0] != "S'":
            code = '\n\n    "%s" ::=' % data[0]
        else:
            code = '\n\n    <%s> ::= ' % data[0]
        for child in data[2:]:
            if child.isupper():
                code += ' "%s"' % child
            else:
                code += " <%s>" % child
        code += "\n"
        return code

    def rewrite_end(self):
        data = self.children.strip("'").strip('"').split(" ")
        code = "        \\alt"
        for child in data[2:]:
            if child.isupper():
                code += ' "%s"' % child
            else:
                code += " <%s>" % child
        code += "\n"
        return code

Back to the parser!

At the beginnig we need to import the AST file:

import ast as AST

Then, we need to redefine some p_ functions in order to create the AST:

@staticmethod
def p_relevantdata(p):
    """RelevantData : DataLines"""
    p[0] = AST.RelevantDataNode(p[1])

@staticmethod
def p_grammarrule(p):
    """DataLine : LPAREN GRAMMARRULE RPAREN"""
    # Let's discard the parenthesis, we don't need those
    p[0] = AST.GrammarRuleNode(p[2])

Let’s make it work!

In your ipython call:

import myyacc

parser = myyacc.MyParser()
ast = parser.parse_file("/tmp/parsetab.py")
code = ast.rewrite()
with open("/tmp/grammar.tex", "w") as grammar:
    grammar.write(code)

And then compile your grammar.tex file and see if you have the correct output!