Parsing Text from the ground up by building a Tokenizer, Parser, AST generator and walker in C++
I'd previously read a text file line by line, and used regex to split it up into words to do frequency analysis with Ruby . Nowadays every major programming language provides libraries for parsing text files, but how do we get from reading in a stream of characters to more abstract concepts like words, lines, or sentences? This happens in something like two parts, the first is Lexical analysis with the lexing phase, where we read a stream of characters from a file, and chunk it up into tokens . These tokens can be whatever we want, words, objects, concepts etc. In programming, open and close brackets are special tokens for example. Once we have a stream of tokens, we can then parse them, where the order of the tokens can be used to understand something, for example, a left bracket token and a right bracket token group tokens within the two from those without. We can turn these tokens into everything from English sentences to programming languages, HTML web pages or say she