Blog postings

An irritating Django feature · by mark | 17 Nov 2022, 7:34 p.m.

I have made some backend changes to the payslip thingy. (You won't notice - it just makes it easier to allow for within-year parameter changes, like the movements on primary threshold and contribution rates we have seen with NI this year).

I had it all set up and tested nicely. Then it failed pushing to heroku with psycopg2.errors.UndefinedTable errors. Oh no!

What had happened: I had used data migrations to create and fill the new DB tables. I then included a thing in the form that would pull the latest year from the tables as the default. this was the problem. Pulling a number from a table will fail if the table doesn't exist, which it won't at that point in the __init__() cycle, as the migrations haven't yet run. Urgh. So I had to do an ugly hotfix to replace a table lookup with a bare number to get it to deploy. Now it exists I can put in the code I wanted to. But I don't want to right now.

Lesson learned: test by migrating the app to zero and restarting the web server before deploy. This will throw any init-needs-something-to-exist-which-hasn't-yet-been-init'd errors before you deploy. It's basically a race condition.

Flex: inheriting from its base class · by mark | 12 Nov 2022, 1:45 p.m. (updated 13 Nov 2022, 8:16 a.m.)

As mentioned earlier I am building a qif parsing thing so I can force various financial institutions' data into a semi coherent whole. There is some documentation available on getting Flex to behave with C++ but it is scant and mostly in contradictory stack overflow posts. So here are some pointers.

Flex Options

The declarations section of flex lets you specify %options. If you are building a C++ lexing class you already have a class name in mind for your scanner. I landed on the following layout:

%option c++ noyywrap outfile="Scanner.cpp" noyylineno batch yyclass="qif::Scanner" prefix="qif"

What this does, one at a time:

c++: Says this is a C++ scanner. The classic C interface reaches back to the 1970s which I am not interested in.
noyywrap: flex offers the ability to read multiple input files one at a time through a yywrap() call. This is C++; you create objects one at a time, one for each input stream; you don't want flex to wrap inputs too.
outfile: This one just sets what flex should call its generated scanner source, instead of the legacy 1970s default it will use otherwise
noyylineno: You do want line numbers. However you get better features using bison locations instead of flex positioning. So we disable flex's line counts here.
batch: flex can do interactive scanners if you are going to be responding to a command line. But I am not. So I put it into batch mode for a marginal performance gain.
yyclass: I want the output class to be called something specific.
prefix: I want the base class from which we inherit to be called qifFlexLexer, not yyFlexLexer, in case I define multiple scanners in the same project

Flex Includes

You need to produce, yourself, two include files. This is because the flex system header #include <FlexLexer.h> is designed to be included multiple times to build multiple classes. This leads to very difficult situations if you don't take great care around where and when you include stuff. So I've figured it out.

First file you make is called Scanner.hpp (or whatever, aligned with outfile). This is very simple:

#pragma once 
#define yyFlexLexer qifFlexLexer
#include <FlexLexer.h> 
#undef yyFlexLexer
#include "Scanner-Internal.hpp"

That's it. The other internal file referred to in the header is more complicated:

#pragma once
#include "Parser.hpp"

namespace qif {
    class Scanner : public qifFlexLexer {
    public:
        Scanner(std::istream& arg_yyin, std::ostream& arg_yyout)
            : qifFlexLexer(arg_yyin, arg_yyout) {}
        Scanner(std::istream* arg_yyin = nullptr, std::ostream* arg_yyout = nullptr)
            : qifFlexLexer(arg_yyin, arg_yyout) {}
        int lex(Parser::semantic_type *yylval, Parser::location_type *yyloc); 
    };
}

Of course this depends on a bison created file so actually compilling this will be a bit of a pain. You can create a stub bison file to produce the bison header so it compiles for now, or you can use the base lex definition for now (i.e. don't specify int lex() in the class). The important things about the internal scanner header are:

This inherits from qifFlexLexer, and the FlexLexer file itself can be included again as it doesn't have any include guards

The lex function is set up to be useful, in that you can use it to communicate with Bison, both in terms of lexemes and locations. There is a nasty chicken and egg situation as you need the parser to generate parser include file, but you can't parse without first lexing...

The two files have the following purpose. The main one is included whenever you need to refer to the scanner. This will ensure the system wide FlexLexer.h file is brought in exactly once per scanner (with the right defines) per translation unit so everything is OK. The internal one is needed so that you can include it in your flex file directly, as this already includes the system wide header and you can't get around it.

If you keep my copy above you'll need to add the following into the prologue of your .l file:

#undef YY_DECL
#define YY_DECL int qif::Scanner::lex(qif::Parser::semantic_type* const yylval, qif::Parser::location_type* yyloc)

YYDECL is by default a simple int function. Need it to match what we have declared. Of course the default 1970s definition is still there so you need to undefine it first.

That is the first of several flex related headaches. There will be others. I'll talk about newlines next time. They need a little bit of care.

Sunrise is getting later · by mark | 12 Nov 2022, 1:01 p.m.

Halfway between the equinox and the solstice (ish). Sunrise is now late enough at these high latitudes. Took advantage this morning and got out. Still way warm for this time of year!

Pegwell Bay at dawn

Trying out flex and bison to process QIFs · by mark | 6 Nov 2022, 9:51 p.m.

I have gotten hacked off enough with QIFs to try writing my own parsing library for them. That was a bigger job than I thought. What I ended up doing was figuring out how to use flex and bison to convert QIFs to an abstract tree, which I can then create QIF objects from, and then I can do some operations on these objects before saving them to disk. For my sins I decided to use C++ for this and I didn't fancy dealing with boost::spirit as I haven't handled C++ for a while.

I am going to start off with some conceptual stuff first, then move on to talking about more gnarly details. This might take a few posts.

What is Flex?

Flex is the Fast Lexical Analyser. Immediately begs two questions: what is a Lexical Analyser, and why is this particular one Fast? A lexical analyser is, very simply, a program that consumes an input stream (like a text file) and recognises lexemes defined by regular expressions within it. This sounds like nerdy nonsense, because it is. If you have a string of text like this, you can say there are two kinds of lexemes, words and spaces. Words are [a-zA-Z] and spaces are literal spaces. A lexical analyser is a function that can be called repeatedly; it will return a bunch of lexemes (WORD SPACE WORD SPACE WORD SPACE .... WORD EndOfFile) and the text that generated them ('a', ' ', 'string', ' ', ...). What Flex does is produce a C++ object which will consume a std::istream and return the lexemes one at a time to the programmer.

There was, a long time ago, a program called lex, written by one of the guys who went on to found Google (really). Flex is lex with some extra features and a really cunning internal algorithm that means it generates fast lexers.

Regular expressions by themselves can't express most structured text out there, but the lexemes produced by flex are the first step in processing structured text. Very simple programs can be produced in flex though; for example, the venerable unix word counter wc has several implementations visible here from the flex maintainers. I like the following as an example. There are White Space (i.e. spaces and tabs) and Non White Space (excludes newline) character sets defined, which are then used to define what a word is - optional intial whitespace followed by actual characters. What happens when each of the lexemes is seen is coded up; basically character and word counters are increased. A cunning thing is done to track multiplie sequential newlines.

At the end of the input file the counters are printed and the program terminates. This shouldn't be hard to follow. (The stuff with yy at the start is flex specific; yyleng is how long the matched lexeme is).

ws    [ \t]
nonws [^ \t\n]
word  {ws}*{nonws}+

%option main noyywrap
%%
	int cc = 0, wc = 0, lc = 0;

{word}{ws}*	    cc += yyleng; ++wc;
{word}{ws}*\n	cc += yyleng; ++wc; ++lc;

{ws}+	    	cc += yyleng;

\n+	        	cc += yyleng; lc += yyleng;

<<EOF>>		{
		printf( "%8d %8d %8d\n", lc, wc, cc );
		yyterminate();
		}

Copy the above to a file called wc.l and type make wc. Pipe a text file to its standard input and out pops some counts. It's mad. (Also the make tool knows how to make executables from complete .l files which is madder). The make tool is likely to call lex, not flex, but it is likely that your lex is actually flex as flex is open source.

If you capture the c file that lex produces and inspect it you can... well you can't see what it is doing as it is designed for compilers, not humans. It produces a state machine which rapidly goes through the input until it matches a regex. It keeps going until it can't add any more characters, at which point it stops and returns the lexeme. You need to call it again to get the next lexeme.

The example above is C but it can produce C++ lexing objects. More on that later. Flex is a useful tool if you want to do something with regular expressions and it can be made incredibly fast with some care (e.g. - an optimisation is to know that maybe 10 words is the usual max for a line, so enumerate those 10 cases as switching from lexing to the actions is relatively expensive).

What is bison?

Flex is a program that creates lexers - a function that returns lexemes. Bison is a program that creates parsers - programs which consume lexemes and produce parse trees.

Eh?

Essentially any sensible structured format can be expressed as a tree. For a simple structure like QIF you have the whole file as a root node, with individual records as children of the root, with individual fields as children of the records. For more complex text files, like computer source code, you have very deep parse trees for dealing with nested loops and so on. You could try it with XML subsets, or maybe JSON.

Bison has an interesting history. Way back in the 1970s there was a program called yacc ("Yet Another Compiler Compiler"). It let people describe a programming languange and how to interpret the various structures that result. It is relatively straightforward to produce an interactive calculator in it, for example, and it's a (biggish) step from that to being able to compile a set of calculator operations into a runnable program. As yacc's name implies, it ended up being used everywhere to generate parsers for structured inputs and it has an incestuous relationship with lex. It was not open source, so the GNU guys went for a three way pun and started one called bison instead. No matter which ungulate they're named after, parser generators are one of those programs that no-one has heard of them except for huge nerds but they've been used to make uncountable billions of executables over the past decades.

The power of bison is that you can input the formal grammatical structure of your input in terms of lexemes ('terminal symbols') and groups thereof ('nonterminal symbols'). You can literally say, for example:

file: lists;
lists: list | lists list;
list: HEADER records;
records: record | records record;
record: data SEPERATOR;
data: DATUM | data DATUM;

in the bison file to express the simple listy structure that you find in QIFs. (QIFs are actually a bit more complex... but later). But here we only need three things from the input itself: we need to know what a HEADER is (in QIFs, these are rows beginning with !, like !Account); what a SEPERATOR is (these are the ^ rows), and the actual data rows (everything else - note we don't, at this stage, try cutting up data rows into their leading character and the payload yet).

Of course it is not that simple. We need to tell bison to build a tree. I'll save that for later, but basically we need to create child tree nodes for the individual DATUM elements and attach these to an overall record note. We then need another layer to collect the records, and yet another layer to hold the header info alongside the records. Finally a last layer is used to gather all the lists in the data into one. Four layers of tree nodes. The tree can then be walked to produce an answer.

Now Flex is kind of straightforward but bison is much more complicated as it can handle context-free grammars, which are intrinsically much more complex than regular expressions. So there are no nice short examples like the word counter. They do work similarly, in that they both produce C++ code which produce objects which consume inputs for you.

Bison and Flex can work together but they are seperate programs. So you need to put some plumbing in to get them to talk to each other. Bison normally works as a 'pull parser' i.e. it asks flex for the next lexeme one at a time. You can make it work as a 'push parser' i.e. flex tells bison it has a new symbol for it but I've not explored this. In any event Bison keeps a stack of symbols in its memory. Pushing onto the stack is called 'shifting'. When it can match a grammar rule with the top entries in its stack it reduces the stack to one symbol. It doesn't like it if it can't decide whether it should reduce the stack, or push another symbol onto it in the hopes of reducing another rule later. It'l do the latter and complain about 'shift/reduce conflicts'.

Plumbing

This is a list of the problems / issues you'll have getting flex and bison talking together in c++ land. All overcomeable but it's hard to find documentation. I'll write about them eventually.

How to tell flex what symbols bison expects it to return
How to get bison to produce useful error messages
How to get flex to keep track of its location in the input file, so bison can use them for syntax error messages
How to handle Windows and Unix files with their different line endings cleanly
How to handle missing final newlines
How to build the parse tree
How to extract the parse tree from the bison call

I bet you can't wait.

Some stuff from Mull · by mark | 23 Oct 2022, 9:05 p.m.

I'm on an island on the west coast of Scotland called Mull. I mean Great Britain is itself an island but it has island of its own. And also islands off islands which I will get to later.

To get to Mul you have to get a ferry from the mainland. The one from Oban is the main one but busy. There are a couple of smaller ones as well, from Kilchoan to Tobermory and from elsewhere to Fishnish. Lots of single track roads. You can just drive up to the wee ferries but the one from Oban needs booking in advance. Shops and so forth are scarce, stock up when you can. Views are fantastic.

Anyway. Pictures.

Calgary Bay

This is Calgary Bay, yes, like that one in Canada, although this came first. It is much quicker to get to the Canadian Calgary in Alberta as you can fly there from Heathrow. It is very pretty this one. Shown are two Spanish tourists for scale. They found it cold.

Sheep

Here is the road to Dervaig from Tobermory. These ovine creatures are everywhere and do not obey proper passing place etiquette. Don't even move should you beep at them.

Tobermory

This is Tobermory. Famous view from a children's TV programme. It's actually really nice, compared to other wee island towns like Portree which has nothing. Excellent chocolate shop. Petrol station. Everything you could want.

Loch Na Keal

Loch Na Keal, with Ulva off to the right and the islet of Eorsa in the foreground. Very nice clouds and cliffs. Reminds me of the Faroes.

Random waxcaps

Here are some random waxcaps on the coast. I can't tell if they are Golden Waxcaps or something else. I left them alone.