Trying out flex and bison to process QIFs

by mark | 6 Nov 2022, 9:51 p.m.

I have gotten hacked off enough with QIFs to try writing my own parsing library for them. That was a bigger job than I thought. What I ended up doing was figuring out how to use flex and bison to convert QIFs to an abstract tree, which I can then create QIF objects from, and then I can do some operations on these objects before saving them to disk. For my sins I decided to use C++ for this and I didn't fancy dealing with boost::spirit as I haven't handled C++ for a while.

I am going to start off with some conceptual stuff first, then move on to talking about more gnarly details. This might take a few posts.

What is Flex?

Flex is the Fast Lexical Analyser. Immediately begs two questions: what is a Lexical Analyser, and why is this particular one Fast? A lexical analyser is, very simply, a program that consumes an input stream (like a text file) and recognises lexemes defined by regular expressions within it. This sounds like nerdy nonsense, because it is. If you have a string of text like this, you can say there are two kinds of lexemes, words and spaces. Words are [a-zA-Z] and spaces are literal spaces. A lexical analyser is a function that can be called repeatedly; it will return a bunch of lexemes (WORD SPACE WORD SPACE WORD SPACE .... WORD EndOfFile) and the text that generated them ('a', ' ', 'string', ' ', ...). What Flex does is produce a C++ object which will consume a std::istream and return the lexemes one at a time to the programmer.

There was, a long time ago, a program called lex, written by one of the guys who went on to found Google (really). Flex is lex with some extra features and a really cunning internal algorithm that means it generates fast lexers.

Regular expressions by themselves can't express most structured text out there, but the lexemes produced by flex are the first step in processing structured text. Very simple programs can be produced in flex though; for example, the venerable unix word counter wc has several implementations visible here from the flex maintainers. I like the following as an example. There are White Space (i.e. spaces and tabs) and Non White Space (excludes newline) character sets defined, which are then used to define what a word is - optional intial whitespace followed by actual characters. What happens when each of the lexemes is seen is coded up; basically character and word counters are increased. A cunning thing is done to track multiplie sequential newlines.

At the end of the input file the counters are printed and the program terminates. This shouldn't be hard to follow. (The stuff with yy at the start is flex specific; yyleng is how long the matched lexeme is).

ws    [ \t]
nonws [^ \t\n]
word  {ws}*{nonws}+

%option main noyywrap
%%
	int cc = 0, wc = 0, lc = 0;

{word}{ws}*	    cc += yyleng; ++wc;
{word}{ws}*\n	cc += yyleng; ++wc; ++lc;

{ws}+	    	cc += yyleng;

\n+	        	cc += yyleng; lc += yyleng;

<<EOF>>		{
		printf( "%8d %8d %8d\n", lc, wc, cc );
		yyterminate();
		}

Copy the above to a file called wc.l and type make wc. Pipe a text file to its standard input and out pops some counts. It's mad. (Also the make tool knows how to make executables from complete .l files which is madder). The make tool is likely to call lex, not flex, but it is likely that your lex is actually flex as flex is open source.

If you capture the c file that lex produces and inspect it you can... well you can't see what it is doing as it is designed for compilers, not humans. It produces a state machine which rapidly goes through the input until it matches a regex. It keeps going until it can't add any more characters, at which point it stops and returns the lexeme. You need to call it again to get the next lexeme.

The example above is C but it can produce C++ lexing objects. More on that later. Flex is a useful tool if you want to do something with regular expressions and it can be made incredibly fast with some care (e.g. - an optimisation is to know that maybe 10 words is the usual max for a line, so enumerate those 10 cases as switching from lexing to the actions is relatively expensive).

What is bison?

Flex is a program that creates lexers - a function that returns lexemes. Bison is a program that creates parsers - programs which consume lexemes and produce parse trees.

Eh?

Essentially any sensible structured format can be expressed as a tree. For a simple structure like QIF you have the whole file as a root node, with individual records as children of the root, with individual fields as children of the records. For more complex text files, like computer source code, you have very deep parse trees for dealing with nested loops and so on. You could try it with XML subsets, or maybe JSON.

Bison has an interesting history. Way back in the 1970s there was a program called yacc ("Yet Another Compiler Compiler"). It let people describe a programming languange and how to interpret the various structures that result. It is relatively straightforward to produce an interactive calculator in it, for example, and it's a (biggish) step from that to being able to compile a set of calculator operations into a runnable program. As yacc's name implies, it ended up being used everywhere to generate parsers for structured inputs and it has an incestuous relationship with lex. It was not open source, so the GNU guys went for a three way pun and started one called bison instead. No matter which ungulate they're named after, parser generators are one of those programs that no-one has heard of them except for huge nerds but they've been used to make uncountable billions of executables over the past decades.

The power of bison is that you can input the formal grammatical structure of your input in terms of lexemes ('terminal symbols') and groups thereof ('nonterminal symbols'). You can literally say, for example:

file: lists;
lists: list | lists list;
list: HEADER records;
records: record | records record;
record: data SEPERATOR;
data: DATUM | data DATUM;

in the bison file to express the simple listy structure that you find in QIFs. (QIFs are actually a bit more complex... but later). But here we only need three things from the input itself: we need to know what a HEADER is (in QIFs, these are rows beginning with !, like !Account); what a SEPERATOR is (these are the ^ rows), and the actual data rows (everything else - note we don't, at this stage, try cutting up data rows into their leading character and the payload yet).

Of course it is not that simple. We need to tell bison to build a tree. I'll save that for later, but basically we need to create child tree nodes for the individual DATUM elements and attach these to an overall record note. We then need another layer to collect the records, and yet another layer to hold the header info alongside the records. Finally a last layer is used to gather all the lists in the data into one. Four layers of tree nodes. The tree can then be walked to produce an answer.

Now Flex is kind of straightforward but bison is much more complicated as it can handle context-free grammars, which are intrinsically much more complex than regular expressions. So there are no nice short examples like the word counter. They do work similarly, in that they both produce C++ code which produce objects which consume inputs for you.

Bison and Flex can work together but they are seperate programs. So you need to put some plumbing in to get them to talk to each other. Bison normally works as a 'pull parser' i.e. it asks flex for the next lexeme one at a time. You can make it work as a 'push parser' i.e. flex tells bison it has a new symbol for it but I've not explored this. In any event Bison keeps a stack of symbols in its memory. Pushing onto the stack is called 'shifting'. When it can match a grammar rule with the top entries in its stack it reduces the stack to one symbol. It doesn't like it if it can't decide whether it should reduce the stack, or push another symbol onto it in the hopes of reducing another rule later. It'l do the latter and complain about 'shift/reduce conflicts'.

Plumbing

This is a list of the problems / issues you'll have getting flex and bison talking together in c++ land. All overcomeable but it's hard to find documentation. I'll write about them eventually.

How to tell flex what symbols bison expects it to return
How to get bison to produce useful error messages
How to get flex to keep track of its location in the input file, so bison can use them for syntax error messages
How to handle Windows and Unix files with their different line endings cleanly
How to handle missing final newlines
How to build the parse tree
How to extract the parse tree from the bison call

I bet you can't wait.

No comments

Back to all articles