Do you want to write your own compiler, but don't know how? Let me help you! I'm not exactly a master at this, but I'd like to share my meager knowledge with you.
For this tutorial series, I'm going to assume you're capable of doing quite a bit more than a Hello World program. I'll explain stuff as needed, but I'm going to assume you know how to use printf, scanf, iostream and all that fun stuff. Oh, and as a warning, I have an incredibly bad habit of mixing C with C++. I'm going to try to keep to C++, but if I slip up, just slap me and I'll fix it.
For this tutorial series we're going to write a very simple compiler for a language I shall call Nano, because it's small. Our compiler will generate a simple assembly language program from our Nano code. Neither Nano nor the assembly language output is meant to be terribly useful, but you'll be able to apply the concepts you learn to make a better compiler that is useful.
Let's get started!
First we have to define the syntax for our language. Syntax is the set of rules that define what you can do and what you can't. If you don't have rules, then writing a compiler is going to be a real pain in the butt, if not impossible. So we need to have rules.
1) All statements will end in a semicolon.
2) Whitespace is not significant, i.e. a+b is the same as a + b.
3) Comments begin with /* and end with */ and they can extend across multiple lines.
4) The only keywords are var, input, and output. Their function is as follows:
a) var declares a new variable. All variables must be declared before they're used.
b) input reads a variable from the console.
c) output displays the value of a variable on the console.
5) All variables are integers. (This will make our life easier.)
6) The only operations allowed are +, -, *, /, %, and =. Parentheses may be used to force operation ordering.
7) Variable names may only consist of uppercase and lowercase letters, underscores (_), and numbers. A variable name may not start with a number.
(As you can see, this is a very restricted language. As we progress I'm going to amend the syntax to add if-else constructs and all that fun stuff you're used to.)
Just so you can see an example of what a Nano program might look like:
We need some way of expressing the rules of our syntax in a more systematic format that'll make writing our compiler easier. Allow me to introduce you to context-free grammar. A context-free grammar is a set of production rules, or a list of tokens and their possible evaluations. There are two parts to every production rule: the variable, and its resolutions. Variables can resolve to other variables, or they can resolve to a single terminal--a token that has no further resolutions. Let me show you what I mean. Here's a very simple grammar:Code:var a, b, c; /*this is a comment*/ a = 5; b = 6; c = (a * b) + (a / b);
The goal of a context-free grammar is to produce a set of terminals. We always begin with START. Now we have "The", which is a terminal, a dead end. But that's not the end of the rule. We have a variable--two actually--that we need to resolve. So we go on to PERSON.Code:<START> ::= "The" <PERSON> <VERB> "the man" <PERSON> ::= "dog " | "cat " <VERB> ::= "ate " | "bit "
Here we have a choice between "dog " and "cat ." I like dogs, so I'm going to pick "dog." Now we have "The dog " VERB "the man." We need to resolve VERB. So we look up VERB, and find we have two possibilities: "ate " and "bit ." Just for kicks I'll pick "ate ." So now we have "The dog ate the man." No variables left, so we have a valid token!
Other possibilities include "The cat ate the man," "The dog bit the man," and so on.
So how do we apply this to expressing the syntax of a language? Let's start with something simple. Why don't we define what an integer can be? First, what is an integer? It's a series of digits, from 0 to 9 inclusive. (We're only going to deal with decimal for now.) So why don't we make a CFG rule for a number:
Note: The trailing + means "one or more." We'll use a * to indicate "zero or more," and ? to indicate "zero or one." To indicate a range we will use .., so a .. c expands to a b c. I'm lazy.Code:<DIGIT> ::= 0 .. 9 <NUMBER> ::= <DIGIT>+
Simple enough, right? Let's create a few more:
The last rule is a little complicated. The first part says we must have one or more IDCHARs at the beginning of an identifier, followed by any combination of IDCHARs and digits. So a few example resolutions would be foo, __sprintf, a_b_1578249367890, and so on.Code:<LETTER> ::= 'A' .. 'Z' | 'a' .. 'z' <IDCHAR> := <LETTER> | '_' <VARIABLE> ::= <IDCHAR>+ (<DIGIT> | <IDCHAR>)*
Why don't we try something harder? How about addition?
Do you see how everything builds on the basic definitions of a variable and digit? We can extend this to do variable declarations, all possible arithmetic operations, and more.Code:<ADDITION> ::= (<VARIABLE> '+' <NUMBER>) | \ (<NUMBER> '+' <NUMBER>) | \ (<VARIABLE> '+' <VARIABLE>) | \ (<NUMBER> '+' <VARIABLE>)
Well, I think that's enough for today. I'll present you with a full CFG for Nano next time, and then we'll begin with constructing basic parsing tools. Hope you enjoyed my tutorial!


LinkBack URL
About LinkBacks






Reply With Quote






< Holy cow! Two images!



Bookmarks