Archive for May, 2008

Building a compiler for the .NET Framework - II : The scanner

I am going to talk about about my scanner that I have made for this project.

I decided to write my own scanner instead of using a tool such as GOLD Parser. This is because I want to learn more about how a scanner works. I also think that it is cheating using a tool. The architecture of the scanner is simple to understand and implement. I will explain more as we walk through its parts.

As I mentioned in the last blogpost, everything is object oriented. The code is based on an article that was published in MSDN Magazine in February this year. You can read the whole article at http://msdn.microsoft.com/en-us/magazine/cc136756.aspx. The source code is also available for download and study.

When we are scanning we must assume that the parts of the code, what we call tokens, follow a specified set of rules depending on what type they are. For example an integral literal must only contain integers. Real literals on the other hand can contain punctuation, ‘.’. Identifiers are a bit trickier to scan. An identifier can start with a ‘@’ or letter and later it can contain both letters and integers. All this can be achieved by using regular expressions to create a so called Finite State Automaton (or Finite State Machine, DFM).

You can construct it this way in C#.


 char ch; //Lookahead char-variable
            ch = (char)input.Peek(); //Peek the first char from stream without removing it

            while (input.Peek() > -1)
            {
                //Read lookahead
                ch = (char)input.Peek();

                if (ch == ' ' ;) //Whitespace
                {
                    //Just ignore whitespaces
                    input.Read();
                    this.Ch++;
                }
                /* Identifier and Keyword */
                else if (char.IsLetter(ch) || ch == '@' || ch == '_' ;) // Letter, '@' or '_'
                {
                    input.Read(); //Pop character from input stream
                    tokenCharLoc = this.Ch++;

                    //Initialize StringBuilder
                    StringBuilder sb = new StringBuilder();
                    sb.Append(ch);

                    //Read characters until next whitespace
                    ch = (char)input.Peek();
                    while (char.IsLetterOrDigit(ch) || ch == '_' ;)
                    {
                        if (!char.IsLetterOrDigit(ch))
                            //DEBUG: Console.WriteLine(char.GetNumericValue(ch));
                            ThrowUnexpectedCharacterException(ch);

                        sb.Append(ch);
                        input.Read();
                        this.Ch++;

                        ch = (char)input.Peek();
                    }

                    //Create a Token-object for token: Check if it is an Identifier or Keyword

                    Token.TokenType type = Token.TokenType.Identifier;

                    foreach(var str in Keywords)
                        if (string.Compare(str, sb.ToString(), !CaseSensitive) == 0)
                        {
                            type = Token.TokenType.Keyword;
                            break;
                        }

                    //Create a Token
                    Token token = new Token(sb.ToString(), type)
                    {
                        Ch = tokenCharLoc,
                        Ln = this.Ln
                    };

                    //Add token to output
                    result.Add(token);
                }

The code above demonstrates the Scan method of the Scanner class. In this excerpt the scanner is chunking the stream by checking the characters one by one and thus decide what kind of token it is and if the current character is accepted. In my scanner a token ends before whitespaces (that are ignored by the compiler). After a token is discovered it is added to a Token-object that contains a string containing the value, and an enum that tells the type of the token, e.g. identifier. The Token is then added to a collection containing the result of the scan.

I want to add that the identifier and keyword tokens are scanned by the same block of code. In the end of this block it always compares the identifier token to check if it is equal to any of the strings in an array that you have specified. If it equals then it is a keyword.

That’s all for now. In the next part I will talk about the Abstract Syntax Tree (AST) and the source language this compiler will compile. I will also make all source code for the parser available soon.

Continue reading ‘Building a compiler for the .NET Framework - II : The scanner’

Building a compiler for the .NET Framework - I : Introduction

I have been working on a compiler for the .NET Framework, i.e. a compiler that produces IL instead of ordinary machine code. I want to share my plans and thoughts about this little compiler I’m working on.
In this blogpost I’ll concentrate on explaining my project.

First I just want to explain some words that I will or may use in this series of articles.

  • assembly language - machine specific language dependend of the instruction set of the system.
  • programming language - a language constructed to make software development easier by introducing certain abstractions that makes it easier for humans to write and understand.
  • compiler - a program that compiles (translates) codes written in a high level-language into object code (assembly language). A compiler for a assembly language is usualy called assembler.
  • IL (a.k.a CIL and MSIL) is a high-level assembly language created by Microsoft that is to be executed on a virtual machine made according the Common Language Infrastructure standard. Can be executed by virtual machines such as the CLR (Windows) or the Mono Runtime (Linux). This means that IL is not bound to a specific platform, in contrast to native assembly languages. IL is Microsoft’s response to Sun’s Java Bytecode.
  • virtual machine - A program working on top of a system that simulates another system. In this case the CLR or the Java Runtime that is converting bytecode into native code (object code) and executes it. A virtual machine can therefore test the code and then execute it if it does not contain any errors or bad code.
  • front-end and back-end are terms used in compiler construction. Front-end usaully refers to the scanner and parser. That is why scanner and parser sometimes is incorrectly known as the parser. The Back-end contains the code generator and preprocessor (if any). The idea of these two is that you can switch parts, for example writing a new code generator for another system, without changeing the rest (the front-end).
  • lexical scanner, lexer or scanner is a program that creates tokens of the input code by following a set of rules. A scanner is usually constructed to work as a Defininite State Machine and use regualar expressions. We have for example the following line of code (written in C-style): int myNumber = 20 + 3; “int” would be a keyword, “myNumber” a identifier, “=” an operator, “20″ a number or integral and so on. Scanners are often generated by a scanner generator because they are performing better than those written by programmers. 
  • parser - A program that is considered the heart of the compiler because it is creating a presentation of the code that is easier to use when generating machine code. The parser is following rules that decides how the code is going to be interpreted and then creates an Abstract Syntax Tree (AST).
    The AST is then easily converted to machine code by the code generator.

I am using an object-oriented approach when constructing my compiler. That means that there will be objects that will symbolize for example tokens, the scanner, the parser and the code generator. Why? Because it is a lot easier to think that way. I cannot manage to do this the procedural way like legendary constructors like Niklaus Wirth did. Another reason is that I am a OOP programmer and I’m attached to C# and the .NET Framework.

The purpose of this project is only educational. I want to learn how a compiler works and especially the parts of it. I am not trying to build a “perfect” compiler that supports all the CLR features. I will not concentrate on the performance either because that is not important. I want the compiler to work before I think of that, and of course it is of beyond the scope of this project.

I want the back-end and front-end to be separate and independent in that way that you can reuse its components and modify them if necessary. But I will firstly concentrate on getting everything working and then try to make these changes.

I have defined a small OO language called Simple Programming Language that I will use as the source language compiled by the compiler that I will write. More details on it in one of the following blogposts.

I have decided to implement Pascal instead of my own SPL. It’s because I have had som issues with it. It is not a good thing to start implementing a compiler for a full OOP langauge as ones first real project.

Anyway, stay tuned!

Nickelback - Rockstar

Nickelback - Rockstar

Orbis Latinus

This is one of the best resources when learning a romance language. Here you’ll find everything from grammars to the history of the languages. A good site for lang geeks like me.

http://www.orbilat.com/Languages/

Nek - Para ti sería

Probably one of the best singers in Europe, perhaps the world, according to me. This song is the Spanish language-version of “Lascia che io sia”, sung by the same artist. Who am I talking about then? I’m talking about the Italian singer Nek (nome in arte di Filippo Neviani). There are a lot of videos on YouTube, both his Italian and Spanish songs. I strongly recommend this artist.  Well here’s “Para ti sería”, performed at Operación Triunfo (Spanish equivalent to Pop Idol) in Spain.

Para ti sería, I’m for you

http://www.nekweb.com/

El síndrome del acento extranjero

This is a funny clip from a Spanish televison program called “El Intermedio”.

Language resources

Here’s a good link for those who like to learn a language the fun way. The University of Texas has started some really good projects for learners of Italian, Spanish, French, Portuguese and German. At their webpages you can listen to interesting podcasts about different topics. As you can see, you find a lot of things on the universities’ webpages, especially the American. Hope you will enjoy this little tip. Cheers!

http://tltc.la.utexas.edu/tltc/projects/index.html

 

New version: PictureScreensaver 1.2

The new version of PictureScreensaver is now released. I have released the source code too.

Check the program page.


Pages

Categories