Flex and Line Endings

by mark | 5 Mar 2023, 11:52 a.m.

Updated by mark | 5 Mar 2023, 11:53 a.m.

I said some time ago I'd write about how I got line endings working. This is that post.

There are three problems with Flex newlines:

Windows
Line counts
Posix line definition

Windows

Windows isses come from the fact that Flex is primarily a Unix tool so expects lines to end \n. But if you read Windows files you get \r\n. You can possibly get \r or even \n\r if you read files from old Macs or old BBC Micros, somehow. The Flex issue is then telling it how to handle Windows line endings. It is actually straightforward. You have a rule

(\n|\r\n?)

which will catch Unix line endings, Windows line endings or old Mac line endings. I last used a BBC Micro in 2001 as part of a sixth year project (they had a pH probe that was connected to a BBC micro), so I am not interested in reading those files. But it could be adapted. In practice Unix and Windows line endings are all you need. The only other one in common use is mainframe but you won't likely see one of these files as a normal person, because the people that own them go to great lengths to hide EBCDIC from you.

An alternative is just to consume \r with an empty action and let the trailing \n drive any actions. But I have a small handful of mac files I need to deal with. Finally you can have two versions of your program for the different line endings. QIFs are just plain ASCII so the control codes only appear where they should.

Line counts

This is probably the nastiest part of the whole thing as you really need to update the column count as well. If you are not insane you are using Bison's complete symbols which provide a location type. So "all" you have to do is update this with every token. And if you are using complete symbols you define your own parsing context type that is passed to the lexer. This means you can access the context (i.e., the embedded location object) in every action. Phew!

Your context object will look like this:

#pragma once
#include "location.hh"

namespace my_ns {

    class Context // again more like a struct
    {
    public:
        Context() : done(false) { loc.initialize(); }
        bool done;               // set to true at EOF
        std::unique_ptr node;    // each lexeme is actually a tree node
        location loc;            // bison provided location class
    };

}

and your lexer call declarion for flex will be

#define YY_DECL my_ns::Parser::symbol_type my_ns::Scanner::lex(my_ns::Context *context)

You will need a custom user action that ensures consumed characters are reflected in the location; this goes immediately after the preceding

#define YY_USER_ACTION context->loc.step(); context->loc.columns(yyleng);

and finally your new line reader updates the line counter when it is touched:

(\n|\r\n?) { context->loc.lines(); return my_ns::Parser::make_ENDL(context->loc); }

Phew. The YY_USER_ACTION gets performed for every match so that, even if you discard characters, the location is correct. You have to be very careful with newline matching; anything that could possibly match a newline isn't good enough, your patterns either always contain newlines or they never contain newlines.

Posix

Posix declares that every line ends with a newline. In practice, many last lines end with an EOF not a newline. What do? Flex lets you match an EOF and you can use a push parser approach so that a raw EOF is always turned into a newline and an EOF. You can track whether the last thing was a newline in the context. But in reality it is better handled in the grammar. If there are only a small number (possibly one or two) tokens that a legally constructed file, modulo terminal newline, can end on then you can use the Bison pseudotoken YYEOF to say "EOF acceptable". For example, QIF files always end in a record separator ^ but this is not always a well formed POSIX line. So the bison rule is

separator_row: SEPARATOR ENDL | SEPARATOR YYEOF;

where separator is just the ^ character. This means a file can end in any number (including zero) of newlines (including windows new lines) as long as the last record includes a record separator at the end. Injecting optional newlines via flex into the token stream can be done but it is very irrating and ugly. Bison does the right thing with end of files (i.e. it'll reduce as much as it can), so you can generally forget about the end of file marker except if you need to catch optional end of lines.

The context class above contains a bit that can be used to tell Bison that the EOF flex returned is an actual EOF if you are doing something interesting like an interpreter (where you use EOFs to capture line endings, making Bison reduce the input and execute it). It is easy to set this. It is already initialised to false, so to make it true:

<<EOF>> { context->done = true; return my_ns::Parser::make_YYEOF(context->loc); }

What happens in interpreters is that newlines are used to make a YYEOF which causes bison to reduce the input (this almost always means evaluate the syntax tree) and return from yyparse(). The thing that called yyparse() will examine the context to see if flex actually reached the end of input or not. If it is just a newline and not a "real" EOF the calling routine loops.

This is what command line interpreters really do - basically loop over a fancy tree building and reduction routine. The context can hold symbol tables etc which allows things to persist between bison calls. Neat!

No comments

Back to all articles