This is part four of my tutorial series Creating a Simple Compiler. Today we're going to talk about more parsing techniques, starting with some simple arithmetic parsing, and then getting on to operator precedence.
Previous Tutorial Sections
Part 1 | Part 2 | Part 3
Polish Notation
A parsing tree is a sort of tree used for parsing expressions. Take the expression a + b * c. Because of operator precedence, we have to evaluate b * c before we evaluate a + <result>. Imagine a stack that can hold an arbitrary number of symbols, namely identifiers and operators. If we push an expression symbol by symbol onto the stack as we encounter them (with one little exception), we get what's called Polish notation when we pop the symbols off. The little exception is we have to push the operator after its operands. So b * c is pushed onto the stack as b, c, *. Let's try this with a + b * c:
If we now pop symbols off the tree in order, we get * b c + a. Wonderful...but how do we use this? It's relatively simple: for those of you who know assembly language, you'll realize that there's almost a one-to-one correspondence with instructions and their operands.Code:* <-- top of stack c b + a <-- bottom of stack
Obviously the actual format depends on the restrictions of the target language you're using. I'm going with Intel x86, so this is completely illegal, but I'm using it to illustrate a point. If you're curious, in proper Intel assembly language (NASM format) this would look something like:Code:MUL b, c, TEMP1 ADD a, TEMP1, TEMP2
(For those of you who are anal-retentive Intel programmers, I'm ignoring the possibility of EAX overflow into EDX for now. If you don't know what this means, don't worry about it.)Code:mov eax, [b] mul [c] add eax, [a]
Clearly we have a problem, though: what if we have a * b + c? Our parser is going to produce + b c * a, which is incorrect. It should be * a b + c. Hmm...we could run the expression through a preparser that'd insert parentheses in the proper places, or we could use a parsing tree.
Parsing Trees
A parsing tree is, well, a tree. Its nodes can contain operands or operators, and it's used to force precedence. A simple parsing tree for a + b would look like this:
And for our troublesome expression:Code:+ / \ a b
Each internal node in the tree is an operator, and each of the leaves are operands. The real question now is: how do we force * to have a higher precedence over + ? Deep waters, Watson.Code:+ / \ * c / \ a b
Parsing Step-by-Step
For this we're going to have to get our identifier and the next identifier as well. That is, we read in a and then * first; when we hit * we realize that we're dealing with multiplication. So we create a partial tree:
Not very exciting. Then we hit b +. We know that + has a lower precedence than * so it must go above the multiplication in the tree:Code:* / a
Clearly we need at least one more operand to complete the tree. We're going to read in c and then the end-of-statement marker (in this case), so we know that our final tree looks like this:Code:+ / \ * / \ a b
Dealing With ParenthesesCode:+ / \ * c / \ a b
Just to strengthen your grasp of this, let's try something harder: a * ( b + c ) + 100. Hm...we haven't done parentheses before. Never fear, it's not as difficult as it seems.
First we read in a *. So far so good:
Then we hit the fun part. Next we read in ( b. We see the opening parenthesis, and immediately know that we must force precedence. The way we do it is by inserting some special symbol that doesn't do anything, but merely forces the addition down a level in the tree (i.e. forces higher precedence). Just for fun I'm going to use the dollar sign, since it isn't an operator in our language.Code:* / a
To resume normal operation, we can just read in one more token (the plus sign) and use our original parsing function to generate the next part of the tree:Code:* / \ a $
After that, we read in c ). The closing parenthesis is just a signal to move back up in the parsing tree to the level above the $. Now we have:Code:* / \ a $ | + / b
Lastly we get + 100 from our input stream, and parse it as usual:Code:* / \ a $ | + / \ b c
I would show you the basic code for this, but this tutorial is getting really long and I want to finish with a few key concepts, so I'll save that for next time. If you're curious, here's what the Polish notation for our ugly little expression looks like (ignoring $ as a null operation)Code:+ / \ * 100 / \ a $ | + / \ b c
Ew. This is going to be a pain in the butt to turn into assembly code. Never fear, there's an easier way. I didn't teach you about parsing trees for nothing.Code:+ * a + b c 100
Generating Polish and Reverse-Polish From Parsing Trees
If you've dealt with binary trees before, you may notice that we can generate Polish notation by traversing the tree in preorder. If you're a little rusty or not all that familiar with binary trees, read the articles I've linked to above.
Anyway, we want to generate our assembly instructions by drilling down to the bottom of the tree (i.e. the + b c) and working our way back up, generating instructions whenever we hit an operator. For example:
Code:TREE CODE + add b, c, TEMP1 / \ * 100 / \ a $ | + / \ b cCode:TREE CODE + add b, c, TEMP1 / \ mul a, TEMP1, TEMP2 * 100 / \ a $ | + / \ b cThat's it!Code:TREE CODE + add b, c, TEMP1 / \ mul a, TEMP1, TEMP2 * 100 add TEMP2, 100, TEMP3 / \ a $ | + / \ b c
This tutorial has gone on long enough, so I'm going to dedicate a significant portion of the next tutorial showing you how to do all this stuff with real C/C++ code, and then move on to...something. Dunno.
Stay tuned!
sudo rm -rf /
Well done! +rep
Gotta love parsing trees as well![]()
thanks
I really like this series.![]()
Root Beer == System Administrator's Beer
Download the new operating system programming kit! (some assembly required)
good job man thanks
PLease continue this tutorial??
nice
+rep
I will eventually, but school's killing me right now.![]()
sudo rm -rf /
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks