This is part five of my tutorial series Creating a Simple Compiler. Today we're going to get into a bit about state machines, and how they're going to make our job compiling stuff easier.
Creating a Language
First we've got to create our own language. Because this is just a tutorial series, we're going to start out with something really basic, and then move on to more complicated stuff in later tutorials. We'll call this language Eusil. (Extremely USeless Instructional Language.) We'll give it a few basic features and then see what we can do from there.
Rules:
-> All programs must start with enter and end with exit. They must be on their own line.
-> Comments begin with # and continue to the end of the line.
-> Whitespace is not significant except in string constants.
-> All statements must end in a semicolon.
-> There is only one data type, the signed integer. Its keyword is int.
-> Variables declarations will begin with var followed by the datatype. A variable name may only consist of alphabetic characters. Names are case-sensitive.
-> Only basic arithmetic operations are allowed, i.e. + - * /. No binary operations.
-> Other keywords: print and prompt, which are equivalent to castrated versions of printf() and scanf() respectively.
A sample program:
Trying to tackle this all at once in one huge function is going to turn into a huge mess that'll make your head explode. Not exactly desirable. If, however, we break this into smaller pieces, it'll be much easier to code and fix. Plus we can change the order of execution and how they call each other as well. How are we going to do this? We can use a state machine as a model.Code:enter # this is just a sample program of our Eusil language. var int i; i = 0; i = i + 5; i = i / 2; # # Note that prompt takes one string and one variable. It's equivalent to # printf("%s", <prompt-string>); # scanf("%d", &<variable>); # prompt("Please enter a value.", i); i = i + 1; print "Your value plus one is: "; print i; print "\n"; exit
Basic State Machines
Do you remember the context-free grammars we talked about earlier? (If not go back and read the other three parts.) A state machine can be derived from context-free grammars. Think of it as a flowchart:
At each step we get to make decisions on where to go next, and all together these essentially describe the life of a programmer. We can do the same thing with our language, except it's a bit more complicated. Why don't we start out with the big picture, and work our way down to the details.Code:----------------------> _ Die / / WakeUp --> Code -----> Tired --> Sleep ^ ^ \_ Hungry | | | \_ Eat | | |____________| | |___________________________________/
First we have:
Great and obvious. Let's expand that a bit.Code:Program
Ok...now we're getting somewhere. Each program is made up of one or more statements. So we have...Code:enter --> code --> exit
Each statement is made up of one or more expressions. Expand that...Code:enter --> (statement)+ --> exit
Now it gets ugly. We can have a few kinds of expressions in Eusil:Code:enter --> ((expression)+)+ --> exit
1) Comment
2) Arithmetic operation
3) Assignment
4) Variable declaration
5) Function call
If we try to expand that:
Yikes. Imagine what it'd look like in C++.Code:enter | _____________expression___________ / | \ | _| _ | |__Comment____/ /|\ \___Call______| | / | \ | |__Arith.____ / | \_Var dec. ___/ \_____Assmt_/ | | exit
How are we going to do this in our Eusil compiler? We can create functions for each step, and then call them as necesary from the other ones. If you do it right, it'll match your flowchart exactly. Let's get started by converting our flowchart into some functions:
If we follow our flow chart, we get something like:Code:#ifndef __EUSIL_LANGUAGE_H__ #define __EUSIL_LANGUAGE_H__ int eusil_program(); int eusil_enter(); int eusil_statement_list(); int eusil_statement(); int eusil_expression_list(); int eusil_expression(); int eusil_comment(); int eusil_call(); int eusil_arithmetic(); int eusil_assignment(); int eusil_declaration(); int eusil_exit(); #endif
You get the idea. A few other things we need are:Code:int eusil_program() { eusil_enter(); eusil_statement_list(); eusil_exit(); return 0; } int eusil_enter() { /* Figure out where the "enter" statement in our program is */ } int eusil_statement_list() { while( blah ) { ... eusil_statement(); ... } return 0; } int eusil_statement() { ... eusil_expression_list(); ... return 0; } int eusil_expression_list() { while( blah ) { ... eusil_expression(); ... } } int eusil_expression() { /* Figure out the appropriate expression type and call those functions */ } . . .
-> A way to keep track of what variables have been declared.
-> A way to parse integers, comments, and identifiers (variable names and keywords).
-> A way to differentiate between those three.
We need to define some rules for our basic types.
-> An integer consists of one or more decimal digits 0-9, optionally preceded by a - to indicate a negative number. The context-free grammar form would look something like this:
(Note that ? means "zero or one" and + means "one or more.")Code:INTEGER := (-)?([0-9]+)
A comment is defined as a # followed by anything, up until the end of the line:
. means a single character, and * means "zero or more." So ".*" means "zero or more of any character." If we want to match a literal period or asterisk, we use \. or \* . The problem with this definition is that it allows newlines to be matched as well, so we want to specify everything except a newline. We can do this by using character classes, just like in scanf() :Code:COMMENT := #(.*)\n
Better.Code:COMMENT := #([^\n]*)\n
Identifiers--i.e. variable names and keywords--consist of only alphabetic characters. That's an easy one:
Arithmetic operators:Code:IDENTIFIER := [A-Za-z]+
Note that we need to escape * and \ to force them to be interpreted as literal instead of using their special functions. We also have to put - at the very end or else we're going to accidentally interpret it as a range, like I showed you above in the identifier expression.Code:OPERATOR := [\*\\+-]
Assignment operator:
We're going to need to write functions to identify and load each of these. For example, we could do:Code:ASSIGN := =
What if we don't know what's going to come next? Hehehe...you'll find out next time.Code:int get_identifier(char *buf) { return scanf("%[A-Za-z]", buf); }![]()
sudo rm -rf /
Thanks you! I have been waiting for this part for a while, it ought to keep me going while I wait for the next part!+rep
Last edited by spyder; 08-08-2010 at 12:41 PM.
I C!(and Python, and C++, and ...)
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks