Jump to content


Check out our Community Blogs





- - - - -

New project: implementing a simple language

Posted by DarkLordCthulhu, 05 December 2014 · 8539 views

lexical analysis syntactic analysis straight line implementation c project
I have now started on my first large-scale programming project since the good old days of 2011. I'm writing an interpreter for Boolean straight line programs. Straight line programs are a model of computation equivalent to finite automata and Boolean logic circuits. A straight line program is basically a simple program that has no flow control. Every instruction is either input, output, or computation. An instruction in a straight line program has a number, which increments by 1 for each instruction, followed by a command - either READ, OUTPUT, or a computation command - and then the operands of the command, which are either input variables or numbers of previous instructions. For example, if the commands are AND, OR, and NOT, a program for XOR would look like this:
( 1 READ x )
( 2 READ y )
( 3 OR 1 2 )
( 4 AND 1 2)
( 5 NOT 4 )
( 6 AND 3 5 )
( 7 OUTPUT 6 )
I chose to do Boolean straight lines because that's the kind that was described in the book I was reading, which gives this project as an exercise. There are other forms of straight line programs, for example, programs where the commands are arithmetic operators. Any straight line programming language with a sufficient set of operators is equivalent to a finite automaton. Of course the straight line model is not Turing-Complete; that would require branching or other flow control constructs.

As you can see, this provides a fairly simple language which is relatively easy to write a compiler or interpreter for. It's an ideal place to start, since I won't have to be worrying a lot about the fine details of the language and I can focus on the discipline of parsing. Here and possibly in future blog posts I will chronicle what I've done toward this end. I hope my writings will serve as a sort of amateur tutorial on the discipline of compiler design, written as a narrative of my own efforts to learn this discipline through practice.

Writing a compiler or interpreter requires a lot of initial planning before you start coding. This begins with specifying the syntax, semantics, and pragmatics or the language. Syntax refers to how the code is structured, e.g. right and left curly brackets have to match, etc. Semantics refers to what the code actually means and what it does; type-checking is generally considered a semantic issue. Pragmatics is mainly an optimization issue: what is the best way to implement a programming language?

Context-free grammars are generally sufficient for expressing the syntax of a programming language. The standard method is Backus-Naur form (BNF), which is a variation on this theme. I'm not going to launch into an explanation of formal grammars and BNF here, because that would make this blog entry much longer than I'd like it to be. I'd like to keep this on the topic of my project. As the first step in designing my interpreter, I came up with the BNF specification for the language of Boolean straight line programs:
<program> ::= <command> <program> | <command>
<command> ::= <input> | <output> | <computation>
<input> ::= ( <number> READ <variable> )
<output> ::= ( <number> OUTPUT <number> )
<computation> ::= ( <number> NOT <number> ) | ( <number> AND <number> <number> ) | ( <number> OR <number> <number> )
<variable> ::= u | v | w | x | y | z
<number> ::= <nonzero_numeral><sub_number> | <nonzero_numeral>
<nonzero_numeral> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<sub_number> ::= <sub_number><numeral> | <numeral>
<numeral> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
I have also determined what the tokens and lexemes for the language are (note that there are several possibilities for the choice of tokens and lexemes):
Tokens:         Lexemes:
 
<lparen>        (
<rparen>        )
<number>        1, 2, 3, 4, etc.
<command>       READ, OUTPUT, AND, OR, NOT
The next step after specifying the syntax is to figure out how to implement it in the code. I chose C as the language to write the interpreter in, simply because that is the language I'm most comfortable with. For the lexical analysis, a finite automaton is generally sufficient. I planned an automaton that would use a vector of bit flags for the state. The first few bits would denote the position within a lexeme, while the higher-order bits would be flags determining whether the scanner is inside a READ, OUTPUT, number, etc.

I came up with what I think was a rather clever trick for implementing token-lexeme pairs. I would use a union type where the tokens are the fields of the union and the lexemes are the values. Like so:
enum lparen { L };
enum rparen { R };
enum command { READ, OUTPUT, AND, OR, NOT };
 
union token {
        enum lparen lp;
        enum rparen rp;
        enum command cmd;
        int num;
};
Once the tokens are scanned, they can be arranged one-by-one into a doubly linked list. A doubly linked list will allow easy access to either end of the list, so that tokens can be accessed in the same order in which they were scanned.

After the syntactic analysis, the token-lexeme pairs will be arranged into an abstract syntax tree by the parser. I have not put much thought into how to implement this part. I'm also not entirely sure how the semantic analysis should be carried out, whether I should find some way to implement an attribute grammar.

Overall, this is a very exciting project that will take a lot of creativity to complete. I will be employing some of the most advanced knowledge and techniques I currently know.

  • 2



so - i'm guessing your not using a parser generator as you wish to learn it yourself. In that case i'd recommend the book (http://www.cs.prince...appel/modern/c/) if you've not already read it.

 

Also in regard to your token union; I'm not sure you need such a clever trick. The enum's will decompose into an int value - therefore you're unioning 4 single int's The only advantage being that you don't need to explicitly cast in your code. But you would need an indicator to which field to use (since L,R & READ have the same bit representation.) 

 

I think you'd need somthing like:

struct token {
    unsigned char type;
    union data { ... }
}
 
...
if( token->type == NUMBER ) {
   token->data.num;
}

 

Ofcause you could remove the enum:

 

struct token {
    unsigned char type;
    unsigned int data;
}
 
....
 
if (token->type == COMMAND) {
    if( ((enum command)token->data) == OUTPUT) {
         ...
    }
}
 
    • 0