Parsing Text from the ground up by building a Tokenizer, Parser, AST generator and walker in C++

I'd previously read a text file line by line, and used regex to split it up into words to do frequency analysis with Ruby. Nowadays every major programming language provides libraries for parsing text files, but how do we get from reading in a stream of characters to more abstract concepts like words, lines, or sentences?

This happens in something like two parts, the first is Lexical analysis with the lexing phase, where we read a stream of characters from a file, and chunk it up into tokens. These tokens can be whatever we want, words, objects, concepts etc. In programming, open and close brackets are special tokens for example.

Once we have a stream of tokens, we can then parse them, where the order of the tokens can be used to understand something, for example, a left bracket token and a right bracket token group tokens within the two from those without. We can turn these tokens into everything from English sentences to programming languages, HTML web pages or say sheet music.

Character Class

In college as an assignment we were tasked to build a lexical parsing system from the ground up that can be configured for arbitrary rules of tokenization by providing an input specification file (that has to be parsed also!) like this:

$DIGIT [0-9]
$CHAR [a-zA-Z]
$UPPER [^a-z] IN $CHAR
These labels $DIGIT or $UPPER are examples of character classes for identifying individual characters coming in on a stream.

The first step is to parse this somewhat human readable file with special code like [a-Z] or [^0] IN $DIGIT into a list of all possible characters, so we need an understanding of the special characters and expansion rules. Our goal output in this case with C++ is a HashMap<String, HashSet>:

$UPPER(26) : [D, E, F, G, A, B, C, L, M, N, O, H, I, J, K, U, T, W, V, Q, P, S, R, Y, X, Z]
$DIGIT(10) : [3, 2, 1, 0, 7, 6, 5, 4, 9, 8]
$CHAR(52) : [D, E, F, G, A, B, C, L, M, N, O, H, I, J, K, U, T, W, V, Q, P, S, R, Y, X, Z, f, g, d, e, b, c, a, n, o, l, m, j, k, h, i, w, v, u, t, s, r, q, p, z, y, x]
$LOWER(26) : [f, g, d, e, b, c, a, n, o, l, m, j, k, h, i, w, v, u, t, s, r, q, p, z, y, x]
$NON-ZERO(9) : [3, 2, 1, 7, 6, 5, 4, 9, 8]
With a HashSet there's not order, but we get O(1) IO operations when checked using the next stage, which is building a DFA (Deterministic Finite Automaton) table.

Identifier Parsing

Now we have a specification for the character class, ie. what type of character coming in ($DIGIT for example), we then define how to identify tokens as combinations of these, for example a state is two upper case characters, and an ID is a letter and a number:
A parsing system then converts things like that, for example ($DIGIT)+ \. ($DIGIT)+ into:
We provide a set of logic for these expressions, the parsing phase:

<expr>  = <term> '|' <expr>  |  <term> | <term> <expr>
<term>  = <base> <count>
<count> = '*' | '+' | <EPS>
<base>  = <char> |  '\' <char>   |  '(' <expr> ')'  
From this, we can understand our specification file on a character by character basis:

Trying to Recursively Parse '$LOWER($LOWER|$DIGIT)*'...
Finished Recursive Parse.

Now that we can parse these rules, we need to place these rules into some sort of system for reading actual input text files. One way is using a DFA table.

A DFA Table can just be thought of as a state transition table, where based on your current position in the table (your state), based on the next character in the sequence, it tells you where to move in the table. Eventually ending up on an identifier of the type of token you just lexed.

From all of this we can now write scripts the program can parse and run, for example providing an input file:

begin matches = find '[A-Za-z]*ment[A-Za-z]*' in "file1.txt" inters find '(A|a)[A-Za-z]*' in "file2.txt";
n_matches = #matches;
print (n_matches);
replace '[A-Za-z]*ment' with "" in "file1.txt" >! "file3.txt";

Which can be parsed with the following input specification file:

$LETTER [a-zA-Z] $DIGIT [0-9]
$ASCII !#$%&()*+,-./:;<=>?@[\\]^_`{|}~
$REGEX '(_|\(|\)|\||-|$LETTER|$DIGIT|$ASCII|[|]|{|}|\+|\*)*'
$FIND find
$BEGIN begin
$END end
$REPLACE replace
$WITH with
$RECURSIVE_REPLACE recursivereplace
$IN in
$INTERS inters
$PRINT print
$UNION union|\|
$MAXFREQSTRING maxfreqstring

Source code on Github

Popular posts from this blog

Building a PID hover controller for Kerbal Space Program with kOS and IPython Notebook

Learning TensorFlow #1 - Using Computer Vision to turn a Chessboard image into chess tiles

Learning TensorFlow #2 - Predicting chess pieces from images using a single-layer classifier