Jump to content

Compiler Developement - How does it compile to an executable

- - - - -

  • Please log in to reply
17 replies to this topic

#1
liamzebedee

liamzebedee

    Programmer

  • Members
  • PipPipPipPip
  • 129 posts
I've been wondering for quite some time now. The C++ compiler is obviously very complex, I just want to know a less technical explanation. How does it produce the executable? Does it convert the code into assembly, then use a seperate compiler to compile that?
The reason for this is for a little project to amuse myself, I'm going to create an extremely small programming language like BrainFuck. I looked at BrainFuck however, and saw it was only an interpreter, and I thought, how about to understand more about compilers, I create one?
Does anyone have any examples/code snippets that may help?
KTHXBYE

#2
WingedPanther

WingedPanther

    A spammer's worst nightmare

  • Moderators
  • 16,831 posts
  • Location:Upstate, South Carolina
  • Programming Language:C, C++, PL/SQL, Delphi/Object Pascal, Pascal, Transact-SQL, Others
  • Learning:Java, C#, PHP, JavaScript, Lisp, Fortran, Haskell, Others
It really depends on the compiler, but originally it converted the C++ code into C code, converted that to assembly, and converted that into machine language.
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

#3
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,719 posts
  • Programming Language:C, Java, C++, PHP, Python, Perl, Assembly, Bash, Others
  • Learning:JavaScript
Well, I wrote a compiler for Brainf*** that compiles it into NASM-format assembly language, which can then be compiled into a native executable.

#include <cstdlib>
#include <fstream>
#include <iostream>
#include <stack>

// Yes, I know this is bad practice and largely frowned upon. However, this is
// a really small program and I'm feeling really lazy right now. Don't judge. :)
using namespace std;

// Global file objects so we can easily clean them up on an abnormal exit.
static ifstream inpfile;
static ofstream asmfile;

static const char *header_text =    "bits 64\n"
                                    "extern  putchar\n"
                                    "extern  getc\n"
                                    "extern  memset\n"
                                    "\n"
                                    "global  main\n"
                                    "\n"
                                    "section .bss\n"
                                    "memory: resb 65536\n"
                                    "\n"
                                    "section .text\n"
                                    "main:\n"
                                    "    ; initialize memory\n"
                                    "    mov     rdi, memory\n"
                                    "    xor     rsi, rsi\n"
                                    "    mov     rdx, 65536\n"
                                    "    call    memset\n"
                                    "\n"
                                    "    ; set data pointer to 0\n"
                                    "    xor     rdx, rdx\n"
                                    "    ; BEGIN USER CODE\n";

void cleanup()
{
    if( inpfile.is_open() )
        inpfile.close();
        
    if( asmfile.is_open() )
        asmfile.close();
}

int main(int argc, char **argv)
{
    // Register our cleanup function. This reduces exit code duplication for
    // when we're aborting the program for some reason or another.
    atexit( cleanup );

    // Make sure we have enough arguments - we need an input and output file.
    if( argc < 3 )
    {
        cout << "Usage:" << endl
             << argv[0] << " <input-program> <output-asm>" << endl;

        return -1;
    }
    
    // Open input program
    inpfile.open(argv[1]);
    if( inpfile.fail() )
    {
        cerr << "Failed to open input file for reading: " << argv[1] << endl;
        return -1;
    }
    
    // Open assembly output
    asmfile.open(argv[2]);
    if( asmfile.fail() )
    {
        cerr << "Failed to open output file for writing: " << argv[2] << endl;
        return -1;
    }
    
    // Write out assembly header.
    asmfile << "; Generated from " << argv[1] << endl << header_text;
    
    // This is used for the call stack.
    stack<streampos> call_stack;
    
    // Start assembling.
    while( inpfile.good() )
    {
        // Read character from input file
        int c = inpfile.get();
        
        // Check operation and output the necessary instruction.
        switch( c )
        {
            case '>':
                // Increment data pointer. WARNING: This DOESN'T do any boundary
                // checking and will NOT wrap around.
                asmfile << "    inc     rdx" << endl;
                break;
                
            case '<':
                // Decrement data pointer. WARNING: This DOESN'T do any boundary
                // checking and will NOT wrap around.
                asmfile << "    dec     rdx" << endl;
                break;
                
            case '+':
                // Increment the byte at the data pointer.
                asmfile << "    inc     BYTE [memory + rdx]" << endl;
                break;
                
            case '-':
                // Decrement the byte at the data pointer.
                asmfile << "    dec     BYTE [memory + rdx]" << endl;
                break;
                
            case '.':
                // Print the character at the data pointer.
                //
                // First save RDX because putchar is likely to trash it. Move
                // the current character into RDI as per GCC's 64-bit calling
                // convention (that's how they pass the first argument) and then
                // call putchar to print it out. Finally, we restore RDX and
                // continue on our merry way.
                asmfile << "    push    rdx" << endl
                        << "    movzx   rdi, BYTE [memory + rdx]" << endl
                        << "    call    putchar" << endl
                        << "    pop     rdx" << endl;
                break;

            case ',':
                // Read a character from the terminal into the character at the
                // data pointer.
                //
                // First save RDX because getchar is likely to trash it; we call
                // getchar(), which returns a character from the console in AL.
                // Restore RDX, and stick AL at the current memory location.
                asmfile << "    push    rdx" << endl
                        << "    call    getchar" << endl
                        << "    pop     rdx" << endl
                        << "    mov     [memory + rdx], al" << endl;
                break;
                
            case '[':
                // This is the beginning of a while loop that checks the current
                // byte and skips the loop if it's 0. First we need to push the
                // address of the beginning of the loop. For simplicity's sake
                // we use the offset into the source code as our jump label.
                //
                // Push the address onto the call stack.
                call_stack.push(inpfile.tellg());
                
                // Create the jump label at the beginning, before the test code.
                // Once we hit the end of the loop, we'll jump back up here to
                // check the condition.
                // Next we compare the byte currently pointed to by the data
                // pointer to zero; if true, jump to the end of the loop.
                asmfile << "._begin_" << call_stack.top() << ":" << endl
                        << "    cmp     BYTE [memory + rdx], 0" << endl
                        << "    je      ._end_" << call_stack.top() << endl;

                break;

            case ']':
                // This is the end of the while loop. We need to make sure that
                // there's an address to return to first. If the call stack is
                // empty, then there's an error in the program code.
                if( call_stack.empty() )
                {
                    cerr << "ERROR: Call stack empty, missing a [ somewhere."
                         << endl;

                    return -1;
                }
            
                // Hard-coded jump back to the top of the loop, and we have a
                // label at the end of the loop so that the testing condition
                // at the top can jump past the loop code once the condition is
                // false.
                asmfile << "    jmp     ._begin_" << call_stack.top() << endl
                        << "._end_" << call_stack.top() << ":" << endl;

                // Now that we hit the end of the loop, we can remove the jump
                // address from the call stack.
                call_stack.pop();
                break;

            default:
                // Ignore everything that isn't one of the operators.
                continue;
        }
    }
    
    if( !call_stack.empty() )
    {
        cerr << "ERROR: Call stack not empty, missing " << call_stack.size()
             << " ]'es." << endl;
    }
    
    asmfile << "    ; END USER CODE" << endl
            << "    ret";
    
    return 0;
}

Note that I use 64-bit GCC calling conventions because I use the C/C++ standard library for console I/O. If you choose to use a different library, you're going to have to change those yourself.

Obviously, true compilers are far more complicated than this (in fact this is closer to an assembler) but some of the basic concepts are there. Compilers do translate source code into some assembly code--often with other intermediate languages between the two--then call an assembler to translate the assembly code into machine-executable code.

Edited by dargueta, 08 November 2011 - 12:59 PM.
Fixed bug

sudo rm -rf /

#4
liamzebedee

liamzebedee

    Programmer

  • Members
  • PipPipPipPip
  • 129 posts
Wow thats really nice. Could someone provide some source for an assembler? I'm wondering whether to implement my own assembler, or run an external one such as NASM.

#5
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,719 posts
  • Programming Language:C, Java, C++, PHP, Python, Perl, Assembly, Bash, Others
  • Learning:JavaScript
Assemblers are not trivial to write, especially for modern architectures like Intel and MIPS. Perhaps something like the 4004 might be a better place to start from. This is one of my favorite topics, so if you have any more in-depth questions or need help with writing it, just let me know.

Here's a link to the datasheet that has the instruction set and machine code.

Edited by dargueta, 08 November 2011 - 04:11 PM.
Fixed broken link

sudo rm -rf /

#6
liamzebedee

liamzebedee

    Programmer

  • Members
  • PipPipPipPip
  • 129 posts
When assembly is compiled, what exactly is compiled in the executable? Are they assembly instructions for that specific processor? I'm talking about when compiling to the PE format. EDIT: To be more specific, I am referring to the .text section of the PE file executable format, and what data it contains.

Another Question...
I know windows has a library called msvcp100.dll, which is the Microsoft C Runtime Library. Does this has anything to do with executables. What does it do?

#7
RhetoricalRuvim

RhetoricalRuvim

    JavaScript Programmer

  • Members
  • PipPipPipPipPipPipPipPip
  • 1,254 posts
  • Location:C:\Countries\US

dargueta said:

Well, I wrote a compiler for Brainf*** that compiles it into NASM-format assembly language, ...

Thanks for not using the f word.

* * *

I wonder if there could be an assembler that assembles Intel code for another type of processor; probably twice as not small as just an ordinary assembler.

#8
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,719 posts
  • Programming Language:C, Java, C++, PHP, Python, Perl, Assembly, Bash, Others
  • Learning:JavaScript

liamzebedee said:

When assembly is compiled, what exactly is compiled in the executable? Are they assembly instructions for that specific processor? I'm talking about when compiling to the PE format. EDIT: To be more specific, I am referring to the .text section of the PE file executable format, and what data it contains.
For the .text section, yes, it's machine code directly executable by the processor. Note: assembly language is the human-readable text with instruction names like mov, cmpxchg, and so on. Machine code is raw binary data.

liamzebedee said:

Another Question...
I know windows has a library called msvcp100.dll, which is the Microsoft C Runtime Library. Does this has anything to do with executables. What does it do?
Yes. When you write a program in C, it doesn't compile with all of the functions it needs because:
1) All Windows machines have them in common
2) Packaging those functions with every executable would make it huge.

A DLL is a Dynamically Linked Library, which means that those extra functions needed are loaded at runtime (hence the name). This makes common functions sharable across multiple programs.

* * * * *


@RhetoricalRuvim: I'm sure there is, but why would you compile assembly code from one architecture into another's machine code? Wouldn't it be easier to create an emulator?
sudo rm -rf /

#9
RhetoricalRuvim

RhetoricalRuvim

    JavaScript Programmer

  • Members
  • PipPipPipPipPipPipPipPip
  • 1,254 posts
  • Location:C:\Countries\US

dargueta said:

Note: assembly language is the human-readable text with instruction names like mov, cmpxchg, and so on. Machine code is raw binary data.

There's a cmpxchg instruction? I thought cmp and xchg were separate instructions.

dargueta said:

@RhetoricalRuvim: I'm sure there is, but why would you compile assembly code from one architecture into another's machine code? Wouldn't it be easier to create an emulator?

But it might be faster that way, depending on how much optimization is put into that, I think. I was thinking, why would browsers interpret JavaScript? Why not first compile it and then run it, to make things faster?

#10
Alexander

Alexander

    It's Science!

  • Moderators
  • 4,118 posts
  • Location:Vancouver, Eh! Cleverness: 200

RhetoricalRuvim said:

There's a cmpxchg instruction? I thought cmp and xchg were separate instructions.

This allows for an atomic operation (see Linearizability). If the compare fails (a thread modifies a value prior to swapping) the operation will fail.

Quote

I was thinking, why would browsers interpret JavaScript? Why not first compile it and then run it, to make things faster?
This is no longer true, V8 (Google), SpiderMonkey/Rhino (Mozilla), Chakra (IE9) all support compilation of Javascript, often with JIT.
Be sure to read the updated FAQ! || Health is achieved through the same 10,000 steps.
If a suggested code/method fails, informing us is less important than telling us why or what errors occurred.

#11
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,719 posts
  • Programming Language:C, Java, C++, PHP, Python, Perl, Assembly, Bash, Others
  • Learning:JavaScript
EDIT: Oops...I took too long to respond and Alexander got there before me. :D

RhetoricalRuvim said:

There's a cmpxchg instruction? I thought cmp and xchg were separate instructions.
Yep.
CMPXCHG - Compare and Exchange



RhetoricalRuvim said:

But it might be faster that way, depending on how much optimization is put into that, I think. I was thinking, why would browsers interpret JavaScript? Why not first compile it and then run it, to make things faster?
Who's to say they don't? Perl is a scripting language, but interpreters typically compile it at runtime into an internal binary format and then run that. Jumping around in the source code, looking for functions, etc. would waste a lot of time with needless I/O.

Check out JIT (Just-In-Time compiling). I think that might be what you're talking about.

Edited by dargueta, 08 November 2011 - 11:25 PM.

sudo rm -rf /

#12
liamzebedee

liamzebedee

    Programmer

  • Members
  • PipPipPipPip
  • 129 posts
I've decided to try creating an executable file (PE-32 Format). I created a data struct in C++. Can anyone evaluate if this is correct? I'm having trouble defining the relocation struct




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users