Hello, and welcome to part two of my tutorial series, Creating a Simple Compiler. In this part we're going to fully flesh out our CFG for Nano, and get started writing a few parsing utilities.
Creating A Simple Compiler - Part 1
A Full Grammar For Nano
In the interest of time, I've written out the entire CFG for the Nano language. Take a few minutes to study it and figure out what's going on; try doing a few resolutions yourself if it helps. Pay particular attention to the rules for EXPRESSION, ASSIGNMENT, and STATEMENT, as they're some of the more important ones.
So now that we've got our language defined, what're we going to do? We'll create a function for each of our rules. But where do we start? There's no starting state. Hm...let's add a few rules.Code:<DIGIT> ::= '0'..'9' <NUMBER> ::= <DIGIT>+ <ANYCHAR> ::= '\x00' .. '\xFF' <COMMENT> ::= "/*" <ANYCHAR>* "*/" <LETTER> ::= 'A'..'Z' | 'a'..'z' <IDCHAR> ::= <LETTER> | '_' <VARIABLE> ::= <IDCHAR>+ (<DIGIT> | <IDCHAR>)* <BIOP> ::= '+' | '-' | '*' | '/' | '%' <OPERATION> ::= (<VARIABLE> | <NUMBER>) <BIOP> (<VARIABLE> | <NUMBER>) <EXPRESSION>::= <OPERATION> (<BIOP> <OPERATION>)* <ASSIGNMENT>::= <VARIABLE> '=' <EXPRESSION> <VARLIST> ::= <VARIABLE> (',' <VARIABLE>)* <DECLARATION>::= "var" <VARLIST> <STATEMENT> ::= ( <ASSIGNMENT> | <DECLARATION> | ("output" <VARLIST>) | ("input" <VARLIST>) ) ';'
So our program consists of a bunch of statements followed by EOF. Perfect.Code:<EOF> := $ <START> := <STATEMENT>* <EOF>
Creating Parsing Functions
Let's create our functions to parse a Nano program. (We're assuming that we have a function called match() that consumes the next character in the input stream; if it matches the argument then it returns, otherwise it spits out an error message and exits the program.
If we divide our tasks up like this, all we need to do is change our functions to match the CFG we generate. Relatively simple. But we need some utilities to help us. First we need match(), as you probably noticed in start(). We're also going to need to be able to determine the type of a character, so let's make a chartype() function as well. (We're going to need a few more, just hang on for a bit.)Code:#define TOKEN_EOF -1 /*other #define character classes here*/ void start(ifstream& input) { while(!input.eof()) statement(); match(input, TOKEN_EOF); } void statement(ifstream& input) { int statement_type; /*somehow figure out if we're dealing with an assignment, a declaration, an output statement, or an input statement. */ switch(statement_type) { case STATEMENT_TYPE_ASSIGNMENT: assignment(); break; case STATEMENT_TYPE_DECLARATION: declaration(); break; case STATEMENT_TYPE_INPUT: doinput(); break; case STATEMENT_TYPE_OUTPUT: dooutput(); break; default: cerr << "FATAL ERROR: Unexpected statement type." << endl; exit(-1); break; } } void doinput(ifstream& input) { vector<VARIABLE_T> variables; varlist(input, variables); /*iterate through all the variables and generate the code to output their values*/ }
match()
We're going to have to have two overloads of this function, because we need two versions of this: one to match a specific character, and one to match a range of characters. Let's make the one to match a single character first:
Not too difficult, right? Let's try matching a character in a set of characters:Code:void match(ifstream& input, int expected) { int ch = input.peek(); if(ch != expected) { cerr << "ERROR: Expected \'" << (char)expected << "\' but found \'" << (char)ch << "\'" << endl; exit(-1); } }
In case you don't know what strchr() doesCode:void match(ifstream& input, const char *charclass) { int ch = input.peek(); if(strchr(charclass, ch) == NULL) { cerr << "ERROR: Expected a character in \'" << charclass << "\' but found \'" << (char)ch << "\'" << endl; exit(-1); } }
chartype()
Hm...how are we going to go about this? My way would be to have chartype() return an integer (an enum, really, but I hate using enums for some reason) describing the type of character it found. We'll have a few classes of characters, namely:
This isn't too hard:Code:#define CHARTYPE_EOF -1 #define CHARTYPE_UNKNOWN 0 /* everything else */ #define CHARTYPE_WHITESPACE 1 /* '\x00' .. '\x1F' */ #define CHARTYPE_DIGIT 2 /* 0..9 */ #define CHARTYPE_LETTER 3 /* A..Z a..z */ #define CHARTYPE_OPERATOR 4 /* + - * / % = */ #define CHARTYPE_LPAREN 5 /* ( */ #define CHARTYPE_RPAREN 6 /* ) */ #define CHARTYPE_USCORE 7 /* _ */
Slightly annoying, but it gets the job done. We'll add character classes and cases as this series goes on and our language gets more complicated.Code:#include <cctype> #include <cstring> int chartype(int ch) { if( ')' == (char)ch ) return CHARTYPE_RPAREN; else if( '(' == (char)ch ) return CHARTYPE_LPAREN; else if( strchr("+-*/%=", ch) ) return CHARTYPE_OPERATOR; else if( isalpha(ch) ) return CHARTYPE_LETTER; else if( isdigit(ch) ) return CHARTYPE_DIGIT; else if( isspace(ch) || iscntrl(ch) ) return CHARTYPE_WHITESPACE; else if( (char)ch == '_' ) return CHARTYPE_USCORE; else if( ch == -1 ) return CHARTYPE_EOF; else return CHARTYPE_UNKNOWN; }
All right, I think this is long enough. I'll stop here for now. Next time I'm going to show you how we're going to use match() and chartype(), and then create a few more utility functions before we plough on.
Stay tuned!
Last edited by dargueta; 12-22-2009 at 05:38 PM. Reason: Forgot underscore
sudo rm -rf /
Ahh... the wonderful world of parsing+rep
Wonderful indeed. The great thing about it is that it looks complicated, but if you know how to divvy it up the right way, it's really not that bad.
sudo rm -rf /
Excellent! +rep!
Wow I changed my sig!
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks