Jump to content


Check out our Community Blogs

Register and join over 40,000 other developers!


Recent Status Updates

View All Updates

Photo
- - - - -

Creating A Simple Compiler: Part 6


  • This topic is locked This topic is locked
12 replies to this topic

#1 dargueta

dargueta

    I chown trolls.

  • Moderator
  • 4854 posts
  • Programming Language:C, Java, C++, PHP, Python, JavaScript, Perl, Assembly, Bash, Others
  • Learning:Objective-C

Posted 31 August 2010 - 11:17 PM

Welcome back to the sixth part of my series, Creating A Simple Compiler. In this part we're going to learn what to do when we don't know what kind of token we're going to get, and how to deal with it. If you don't remember what the syntax rules of Eusil are, take a look at the previous tutorials in this series:

http://forum.codecal...r-part-1-a.html
http://forum.codecal...r-part-2-a.html
http://forum.codecal...r-part-3-a.html
http://forum.codecal...r-part-4-a.html
http://forum.codecal...r-part-5-a.html

So we're at the very beginning of our program, and we have no idea what we're going to get as our first token. We can usually tell by looking at the first character of whatever it is that comes next. For example:

int c = fgetc(finput);
chartype_t type = eusil_typeof(c);

switch(type)
{
    case CHARTYPE_INT:
        /* do int-related stuff here */
        break;

    case CHARTYPE_HASH:
        /* this signals the beginning of a comment.
        ignore the rest of the line and continue. */
        break;

    case CHARTYPE_WHITESPACE:
        /* ignore whitespace */
        break;
    .
    .
    .

    default:
        /* no idea what we just got. throw an error. */
}

The problem is, different rules apply to the same token type. Obviously we can't assign anything to an integer, so 5 = i should cause an error, but i = 5 shouldn't. Remember the state machine stuff I mentioned earlier? We can split these two cases into two separate states, and that way each state will only have one rule to worry about. We can do something like this:

get_statement --> get_lhs --> get_operator --> get_rhs

An integer constant is always illegal in the get_lhs state, but not in get_rhs. An identifier, i.e. a variable (later we'll add functions and such) is legal in get_lhs and get_rhs but not get_operator.

With this in mind, we're going to need to define some character classes so we can make decisions on what to do where. This is about all we need:

IDENTIFIER: 'A'-'Z' and 'a'-'z'
OPERATOR: + - * / % = (no bitwise or conditional operations for now)
NUMBER: '0'-'9'
HASH: '#'
QUOTE: "
SEMICOLON: ;
WHITESPACE: ' ' '\b' '\f' '\t' '\v' '\n'

Whitespace and comments are always legal when we're starting get_statement. Why? Whitespace doesn't matter, just like in C, and comments are essentially statements themselves. So we'll ignore comments and whitespace at the beginning of get_statement():

void get_statement(...arguments...)
{
    skip_whitespace_and_comments(input_file, &line, &col);
    .
    .
    .
}

To avoid a lot of functions with lots of arguments, I just put it all in a struct and pass that around.
typedef struct
{
    FILE *filedesc;
    unsigned int line;
    unsigned int col;
    const char *filename;

    (Other stuff will be added here later.)
} compiler_data_t;

Let's write skip_whitespace_and_comments() since it's relatively straightforward. Because the first thing we look at might not be a comment, we choose to skip whitespace first, then comments.

A relatively simple function that skips whitespace:
/* don't forget this! */
#include <stdbool.h>
#include <ctype.h>

bool skip_whitespace(compiler_data_t *cd)
{
    int c;
    bool read_data = false;

    /* gag if we have null pointers or hit EOF */
    if( !cd || feof(cd->filedesc))
        return false;

    do
    {
        c = fgetc(cd->filedesc);

        if( c == EOF )
            return read_data;

        /* check to see if we hit whitespace */
        if( isspace(c) )
        {
            /* c is a whitespace character. update line and column so that if
            we hit an error later, the line and column that it occurs on will be
            accurate. */

            read_data = true;

            if( c == '\n' )
            {
                ++*line;
                *col = 1;
            }
            else
                ++*col;
        }
        else
        {
            /* not whitespace. back up and return. */
            ungetc(c, cd->filedesc);
            return read_data;
        }
    } while( !feof(cd->filedesc) );
}

This returns true if any whitespace characters were read, false otherwise. Note that this is simple but inefficient; it's better to read an entire line first and then look in there. If you implement your own full compiler, then you might want to change that. Right now I'm focusing on getting the concepts down, so I'll keep it simple rather than efficient.

Anyway, skipping comments is much easier. All we need to do is check if the first character in the stream is '#'. If so, we skip the rest of the line. If not, we put the character back into the stream and do nothing. Because I'm lazy and I think you'd learn better by doing some work yourself, coding this function is left as an exercise for the reader.

Our skip_comments_and_whitespace() function will then use something like this:
void skip_comments_and_whitespace(compiler_data_t *cd)
{
    /* gag on null pointer */
    if( !cd )
        return;

    while( !feof(cd->filedesc) )
    {
        /* we don't care about the return value of skip_whitespace here because
        if we have a comment with no whitespace in front of it, the function
        will exit prematurely. */

        skip_whitespace(cd);

        if( !skip_comments(cd) )
            break;
    }
}
That's it. Pretty simple, no? If we put this at the beginning of get_statement(), then we can be assured that whatever we get next is something other than whitespace or a comment, which is what we want. I think I mentioned that earlier.

What do we do after that, then? Well, since we don't have pointers, functions, directives, nor other fancy stuff like that, pretty much the only thing we can have at the beginning of the statement is an identifier of some sort. So we expect anything within the IDENTIFIER class, and if we hit anything else then we barf.

void get_statement(compiler_data_t *cd)
{
    char identifier[64];
    int rval;

    if( !cd )
        return;

    skip_whitespace_and_comments(cd);

    /* for now all we expect is an identifier. this will change later once we
    add other operators and features. */

    /* yes, I know an unsized scanf is dangerous. do I care right now? no. */
    rval = fscanf(fd, "%[A-Za-z]", identifier);
    if( rval != 1 )
    {
        /* if we get here then that means that we either hit EOF prematurely
        or we have garbage we don't expect. */
        if( rval == EOF )
        {
            printf("ERROR: Unexpected end of file on line %u.\n", cd->line);
            exit(-1);
        }
        else
        {
            /* we didn't hit EOF, we've just got some garbage data. */
            printf("ERROR: On line %u, column %u I expected an identifier and "
                   "got something else. Probably a typo.\n", cd->line, cd->col);
            exit(-1);
        }
    }

    /* ok, we're happy. we got an identifier. since fscanf didn't update our
    column variable we're going to have to do it ourselves. */
    cd->col += strlen(identifier);

    /* now that we've gotten that straightened out, we need to figure out what
    we're going to do with that identifier we just read. */
}

So what do we do with that identifier? We're going to have to somehow remember what variables were declared, what are keywords, and so on. We can do this with the help of a symbol table. It's basically a map of identifiers to structs containing the relevant data, e.g whether it's an integer or a character, function or variable, and so on. For Eusil, our symbol table entries will be rather minimalistic.

typedef enum
{
    IDC_UNKNOWN = -1,           /* only returned from functions on an error. */
    IDC_KEYWORD,                /* "begin", "end", "var", "int", and so on. */
    IDC_VARIABLE,               /* user-declared variables */
    IDC_STRING_CONSTANT         /* for "print" and "prompt" later. */
} idclass_t;

typedef struct
{
    const char *id_name;        /* id name */
    const char *file;           /* file it was declared in */
    unsigned int line;          /* line it was declared on */
    idclass_t id_class;         /* keyword, variable, function, etc. */
    symtab_entry_t *type;       /* pointer to the data type, e.g. int or char */
} symtab_entry_t;

typedef struct
{
    bool sorted;
    unsigned int symbol_count;
    symtab_entry_t *symbol_list;
} symtab_t;

/* this will be used when searching and sorting. */
int cmp_symtab_entries(const void *a, const void *b)
{
    return strcmp(((symtab_entry_t *)a)->id_name, ((symtab_entry_t *)b)->id_name);
}
(We'll need to add a symtab_t variable to our compiler_data_t struct.)

When the compiler starts up, we'll preload it with the keywords in our language so the user can't declare a variable called "int." Otherwise the compiler won't be able to tell the difference between the variable and the datatype, and subsequently will choke at some unknown point.

To look up a symbol in the symbol table, we need to make sure it's sorted first using qsort(), and then we can use bsearch() to find it quickly. Obviously this'll get very time-consuming once we have a large symbol table, so we'll tack declared variables on the end and only sort before a lookup if .sorted is false. This is another one I'm going to make you write. :) All you need to do in your lookup function is:

- Check to see if the table is sorted. If not, sort it with qsort().
- Use bsearch() to find the entry.
- Return the entry.

qsort() and bsearch() are declared in stdlib.h, so make sure you include that.

Once we've got a functioning symbol table, we can look up anything in there and make sure it's of the expected type. Wait...aren't only variables legal on the left-hand-side of an operation? No. Remember, keywords can begin statements:

begin;

var int blah;
print "Hello, World!";
prompt "Number:", blah;
print blah;

end;

That's a perfectly legal program that consisted entirely of keywords at the beginning of a statement, i.e. the LHS. In part 7 we're going to decide exactly what identifiers can come at the beginning of a statement, and then start incorporating our symbol table into get_statement().

Happy coding!
  • 2

sudo rm -rf / && echo $'Sanitize your inputs!'


#2 Hamed

Hamed

    CC Devotee

  • Senior Member
  • PipPipPipPipPipPip
  • 413 posts

Posted 21 September 2010 - 05:44 AM

Can you give us files? I can not understand how to code!
  • 0

#3 dargueta

dargueta

    I chown trolls.

  • Moderator
  • 4854 posts
  • Programming Language:C, Java, C++, PHP, Python, JavaScript, Perl, Assembly, Bash, Others
  • Learning:Objective-C

Posted 21 September 2010 - 07:14 AM

Everything I've written is up there. What part don't you understand? I'll give you some tips and edit the tutorial as necessary.
  • 0

sudo rm -rf / && echo $'Sanitize your inputs!'


#4 Hamed

Hamed

    CC Devotee

  • Senior Member
  • PipPipPipPipPipPip
  • 413 posts

Posted 21 September 2010 - 07:27 AM

I don't know how to start.
and also how to run?
If you can please give me source files.
  • 0

#5 dargueta

dargueta

    I chown trolls.

  • Moderator
  • 4854 posts
  • Programming Language:C, Java, C++, PHP, Python, JavaScript, Perl, Assembly, Bash, Others
  • Learning:Objective-C

Posted 21 September 2010 - 07:54 AM

Starting: Read parts 1-6.
Running: It's not done yet, that's why.
There are no source files yet; I've put everything up there.
  • 0

sudo rm -rf / && echo $'Sanitize your inputs!'


#6 H2010

H2010

    CC Newcomer

  • Just Joined
  • PipPip
  • 21 posts

Posted 11 November 2010 - 02:43 PM

THANK YOU dargueta
  • 0

#7 se7en

se7en

    CC Regular

  • Member
  • PipPipPip
  • 32 posts

Posted 03 March 2011 - 06:34 AM

can you please continue this series?
  • 0

#8 dargueta

dargueta

    I chown trolls.

  • Moderator
  • 4854 posts
  • Programming Language:C, Java, C++, PHP, Python, JavaScript, Perl, Assembly, Bash, Others
  • Learning:Objective-C

Posted 03 March 2011 - 01:22 PM

I'm currently completely swamped with work (notice that I've stopped my Intro to Intel Assembly Language series as well), but will resume once I have my normal life back.
  • 0

sudo rm -rf / && echo $'Sanitize your inputs!'


#9 ZekeDragon

ZekeDragon

    CC Leader

  • Retired Mod
  • PipPipPipPipPipPipPip
  • 1263 posts

Posted 03 March 2011 - 08:11 PM

printf()? fgets()? typedef struct and FILE pointers?! I thought you were writing in C++. :)

I just finished reading all of these back again, great series. I'm actually working on a relevant project, this was very helpful. Can't wait for you to resume! +rep
  • 0
If you enjoy reading this discussion and are thinking about commenting, why not click here to register and start participating in under a minute?

#10 dargueta

dargueta

    I chown trolls.

  • Moderator
  • 4854 posts
  • Programming Language:C, Java, C++, PHP, Python, JavaScript, Perl, Assembly, Bash, Others
  • Learning:Objective-C

Posted 03 March 2011 - 09:12 PM

Fail. :D
  • 0

sudo rm -rf / && echo $'Sanitize your inputs!'


#11 se7en

se7en

    CC Regular

  • Member
  • PipPipPip
  • 32 posts

Posted 04 March 2011 - 12:02 PM

ok dude, when ever you are ready.
  • 0

#12 manishpatel0404

manishpatel0404

    CC Lurker

  • Just Joined
  • Pip
  • 1 posts

Posted 15 August 2015 - 05:04 AM

Hi. Great tutorial series. I have thoroughly enjoyed the first 6 parts.

 

I also noticed that its been 4 years since you wrote part 6. Is part 7 coming any time soon?


  • 0