Archive for the 'Computer Science' Category

Pointers

Pointers are very common and powerfull parts of programming languages such as C and C++. If you have never encountered pointers before (you are probably from the world of Java or .NET) you may ask “Why?”. Well, if you know an object-oriented language you may notice how you can use the pointers to simulate behaviours similar to reference types. Like reference types pointers “point” to a specific object, but through its address in memory.

In C and C++ you declare and use a pointer this way:


int a = 20;
int *p = &a;

printf("%d", p);

*p = 30; //Manipulating the value (*p)

printf("%d", *p); // Will print 30 to the console.

printf("%d", p); // Will print the address which the pointer points to (p)

I will now explain this fairly simple piece of code. We declare an integer (a) and give it the value 20. Nothing strange so far. Then we declare a pointer to integer variable. int* means pointer to integer. It is being initialized with the address of the integer (a) we declared. The address operator is & (read as address of). Later we starts to manipulate the memory through the pointer. Just notice the difference between *p and p (commented in the code). As it is shown both p and a are referencing the same memory. The power of pointers! But when you are getting more experienced and you are working with more advanced stuff you have to take responsibility. That will I talk about later in this blogpost.

A native string type does not exist in C, but you can simulate one with arrays or pointers. Yes, pointers. Take a look at the following example where a pointer to a character is initated with a string.


char *str;
str = "Hello, World!";

//This is similar to:
char[] strAr = "Hello, Second World!";

As I said before it is almost like OOP with reference types. You can point to primitive types and structures, and even functions. Function pointers are similar to delegates in C#, though delegates are type-safe and function pointers not. These are declared and initiated as in this example:


int (*Operation)(int OpA, int OpB);

static int Add(int OpA, int OpB)
{
      return OpA + OpB;
}

static int Sub(int OpA, int OpB)
{
      return OpA - OpB;
}

int Main(void)
{
      Operation = Add;
      int r1 = (*Operation)(2, 3); //5

      Operation = Sub;
      int r2 = (*Operation)(3, 2); //1

      return 0;
}

With these two types of pointers you can do a lot of cool and advanced stuff. One of the many common uses of type pointers are allocation and reallocation of memory. You do like this:


int *ptr = malloc(sizeof (int));

The argument of malloc is the size of memory you want to allocate.

Because you are allocating the memory the program cannot manage it for you. That is to say that if your program is ending you will have to reallocate the memory you have used. The memory will remain allocated if you do not do so. That makes exception handling very important in your program. The There are certain functions to reallocate or free occupied memory. The most simple are free which takes a pointer as an argument. It will automatically free the memory the pointer is referencing to.


     free(ptr);

Because of the pointers, languages like C and C++ are good for system- and hardware programming. If you for instance want to write a driver for a specific device, that is manipulating the hardware, you use pointers because they are able to manipulate memory and through it the basic parts of the computer. Pointers make C nearly as powerfull as assembler (machine language), as close to the hardware you can get.

This was a short introduction to pointers. There are many of ways of using them. It takes a lot of practice to master them without messing stuff up although they are easy to declare.

I just want to add that C# does have pointers although they are not common to use. They are mostly used in P/Invoke (platform invocation) to unmanaged native code in i.e. the Windows API. These parts are abstracted away by the managed OOP classes in the .NET Framework. You can still do some pretty advanced stuff with pointers but you do not need to, or rather you are encouraged not to use them, because they considered are unsafe. They are thereby against the principles of the CLI (Common Language Infrastructure). Java does currently not have pointers (no instructions for them). Maybe they will add support in the future, but who knows.

Thanks for reading.

Hello World! (Java Bytecode)

class HelloWorldApp
{
  method public static void main (java.lang.String[])
  max_stack 2
  {
    getstatic java.io.PrintStream java.lang.System.out
    ldc "Hello World!"
    invokevirtual void java.io.PrintStream.println(java.lang.String)
    return
  }
}

Common Intermediate Language

Common Intermediate Language, simply known as CIL, IL or MSIL (among the legacy coders). This is not to be confused with C Intermediate Language which is also abbreviated as CIL. CIL is a stack-based high-level assembly language which is used as an intermediate language by virtual machines such as the CLR (Common Language Runtime) and the Mono Runtime. It is because it has a fairly simple syntax that can easily be translated into native machine code. As a part of the Common Language Infrastructure (CLI, a standard developed by Microsoft) it is the most significant piece of the .NET Framework, among others. Compilers which are written for the specific frameworks support the CLI standard and are translated to CIL. This can be done entirely or partially, like C++/CLI (which can mix it with native code). Because of that they can interop with other programs written in other CLI languages through the virtual machine. CIL is cross-platform and can be be executed on different machine architectures without being modificated or recompiled. The only thing that is needed is an implementation of the virtual machine targeting the specific processor and operating system. CIL is also, in many aspects, similar to Java bytecode which is executed by the Java Virtual Machine (JVM).

CIL is stack-based, in difference to many native machine languages that are register-based. This means that everytime you want to manipulate an object (a type or an unit of code) you put it on the stack and call an op code (language instruction) that is performing an operation, i.g. adding two values. Thereby adding the two objects (Int32 or simply 32-bit integers) on the top of the stack and leaving the sum of those. The stack should be empty at the end of runtime. See the example.

IL:

ldc.i4.3 //Loads a constant in the stack
ldc.i4.2
add

Stack:

2 //Int32
3 //Int32

After the last operation (add) this will remain on the stack:

5 //Int32

If you had only one object on the stack and performed the add operation you would get a stack underflow.
The opposite is stack overflow that is caused by objects remaining on the stack in the end of the program. These errors are often caused by bugged compilers.

Some common op codes are:

ldstr "Kazoom" //Puts a string on the stack
ldarg.0 //Loads the method argument at the index 0 and puts it on the stack
ldc.i4.29 //Puts an integer, 29, on the stack
ldloca.s lcal //Loads a local (name lcal) an places it on top of the stack
call //Used when calling a method
newobj //Used when instanciating an object

// Arithmic operations that pops two values from the top of the stack.
add
sub
mul
div
mod

This is the most simple part of the CIL language. Because of the object-oriented nature of CLI you are able to use some OOP approach in your programming. Many of the constructs found in the programming languages at a higher level are represented directly as IL. For instance, a class declaration looks like this:

.class public MyClass
{

}

A class can be declared as static (non-instance class) with the static attribute static.

.class public static MyStaticClass
{

}

Classes contains methods defined this way:

.method Int32 Add(Int32, Int32)
{
      //Load both arguments on the stack and add them, then return the value.
      ldarg.0
      ldarg.1
      add
      ret
}

Likewise to classes, methods can also be marked as static.

This syntax, is like the class syntax, very similar to the C style. The only difference is the IL opcodes instead of ordinary statements. Another thing you must know is that methods always ends with a return statement even though they do not seem to return any value. In reallity theres a value, void. This will the runtime engine (virtual machine) handle for you.

To make a method the entrypoint of the program you add the .entrypoint attribute inside the method body. Notice that you can only have one entrypoint in a program.

Local variables (mostly refered to as locals) are also defined as attributes inside a method.

.locals init (
      [0] string str1
      [1] Int32 int1)

Those locals named by the compilers often have a random name consisting of letters and a number.

Because of performance there is also a .maxstack attribute which sets the maximum numbers of items on the stack. This tells the runtime engine that this method needs a stack of this size. Now is the virtual machine able to manage the amount of memory.

There are a lot of other things as well. I will not dig deeper into it because it is to much at a time. If you are familiar with a high-level language or any other stack-based and object-oriented intermediate language, like the Parrot Intermediate Language you, should understand the IL too.

Finally, this is the Hello World program in IL style. You can try to assemble it with ILASM (IL Assembler) ported with .NET Framework or Mono.

.method public static void Main() cil managed
{    
      .entrypoint    
      .maxstack 8    
      ldstr "Hello world!."    
      call void [mscorlib]System.Console::WriteLine(string)    
      ret
}

Building a compiler for the .NET Framework - Report 2

I have written a lot of code since my last post (Report 1). I have written a parser (an object-oriented) that is working, everything but expression parsing. Not yet! Then I started on the code generator which ended up quite well. The only thing I really hate is that I can’t find a way to generate two types with fields of the other type. I have to find a solution soon… and I will. I’m sure. I’m also working on a new symbol table class that I want to use during code generation.

Please check out my compiler (Pascal Compiler for .NET Framework v.0.0.1).
Requires Windows XP/Vista and .NET Framework 3.5.

Download it here!

 

Notes:

The compiler…

  • supports definitions of types, functions, procedure and variables.
  • supports Primitive types such as Integer, Double, Boolean, Char and String.
  • has these Complex types: Record, Enum and Dynamic Array.
  • can only parse the expression WriteLn([stringliteral]);
  • only supports single sourcefiles.

 

Building a compiler for the .NET Framework - Report 1: Hello World

I have uploaded this screenshot of the IDE. Its the Parser. The most interesting part is the Locals view, where you can se a variable called output. It is the root of the AST (Abstract Syntax Tree) I made for my Pascal parser. If you observe it you may understand how it works and what its current contents are. It is a Hello World-program.

 http://robertsundstrom.files.wordpress.com/2008/06/pascalc-debugging-microsoft-visual-studio.png

Building a compiler for the .NET Framework - II : The scanner

I am going to talk about about my scanner that I have made for this project.

I decided to write my own scanner instead of using a tool such as GOLD Parser. This is because I want to learn more about how a scanner works. I also think that it is cheating using a tool. The architecture of the scanner is simple to understand and implement. I will explain more as we walk through its parts.

As I mentioned in the last blogpost, everything is object oriented. The code is based on an article that was published in MSDN Magazine in February this year. You can read the whole article at http://msdn.microsoft.com/en-us/magazine/cc136756.aspx. The source code is also available for download and study.

When we are scanning we must assume that the parts of the code, what we call tokens, follow a specified set of rules depending on what type they are. For example an integral literal must only contain integers. Real literals on the other hand can contain punctuation, ‘.’. Identifiers are a bit trickier to scan. An identifier can start with a ‘@’ or letter and later it can contain both letters and integers. All this can be achieved by using regular expressions to create a so called Finite State Automaton (or Finite State Machine, DFM).

You can construct it this way in C#.


 char ch; //Lookahead char-variable
            ch = (char)input.Peek(); //Peek the first char from stream without removing it

            while (input.Peek() > -1)
            {
                //Read lookahead
                ch = (char)input.Peek();

                if (ch == ' ' ;) //Whitespace
                {
                    //Just ignore whitespaces
                    input.Read();
                    this.Ch++;
                }
                /* Identifier and Keyword */
                else if (char.IsLetter(ch) || ch == '@' || ch == '_' ;) // Letter, '@' or '_'
                {
                    input.Read(); //Pop character from input stream
                    tokenCharLoc = this.Ch++;

                    //Initialize StringBuilder
                    StringBuilder sb = new StringBuilder();
                    sb.Append(ch);

                    //Read characters until next whitespace
                    ch = (char)input.Peek();
                    while (char.IsLetterOrDigit(ch) || ch == '_' ;)
                    {
                        if (!char.IsLetterOrDigit(ch))
                            //DEBUG: Console.WriteLine(char.GetNumericValue(ch));
                            ThrowUnexpectedCharacterException(ch);

                        sb.Append(ch);
                        input.Read();
                        this.Ch++;

                        ch = (char)input.Peek();
                    }

                    //Create a Token-object for token: Check if it is an Identifier or Keyword

                    Token.TokenType type = Token.TokenType.Identifier;

                    foreach(var str in Keywords)
                        if (string.Compare(str, sb.ToString(), !CaseSensitive) == 0)
                        {
                            type = Token.TokenType.Keyword;
                            break;
                        }

                    //Create a Token
                    Token token = new Token(sb.ToString(), type)
                    {
                        Ch = tokenCharLoc,
                        Ln = this.Ln
                    };

                    //Add token to output
                    result.Add(token);
                }

The code above demonstrates the Scan method of the Scanner class. In this excerpt the scanner is chunking the stream by checking the characters one by one and thus decide what kind of token it is and if the current character is accepted. In my scanner a token ends before whitespaces (that are ignored by the compiler). After a token is discovered it is added to a Token-object that contains a string containing the value, and an enum that tells the type of the token, e.g. identifier. The Token is then added to a collection containing the result of the scan.

I want to add that the identifier and keyword tokens are scanned by the same block of code. In the end of this block it always compares the identifier token to check if it is equal to any of the strings in an array that you have specified. If it equals then it is a keyword.

That’s all for now. In the next part I will talk about the Abstract Syntax Tree (AST) and the source language this compiler will compile. I will also make all source code for the parser available soon.

Continue reading ‘Building a compiler for the .NET Framework - II : The scanner’

Building a compiler for the .NET Framework - I : Introduction

I have been working on a compiler for the .NET Framework, i.e. a compiler that produces IL instead of ordinary machine code. I want to share my plans and thoughts about this little compiler I’m working on.
In this blogpost I’ll concentrate on explaining my project.

First I just want to explain some words that I will or may use in this series of articles.

  • assembly language - machine specific language dependend of the instruction set of the system.
  • programming language - a language constructed to make software development easier by introducing certain abstractions that makes it easier for humans to write and understand.
  • compiler - a program that compiles (translates) codes written in a high level-language into object code (assembly language). A compiler for a assembly language is usualy called assembler.
  • IL (a.k.a CIL and MSIL) is a high-level assembly language created by Microsoft that is to be executed on a virtual machine made according the Common Language Infrastructure standard. Can be executed by virtual machines such as the CLR (Windows) or the Mono Runtime (Linux). This means that IL is not bound to a specific platform, in contrast to native assembly languages. IL is Microsoft’s response to Sun’s Java Bytecode.
  • virtual machine - A program working on top of a system that simulates another system. In this case the CLR or the Java Runtime that is converting bytecode into native code (object code) and executes it. A virtual machine can therefore test the code and then execute it if it does not contain any errors or bad code.
  • front-end and back-end are terms used in compiler construction. Front-end usaully refers to the scanner and parser. That is why scanner and parser sometimes is incorrectly known as the parser. The Back-end contains the code generator and preprocessor (if any). The idea of these two is that you can switch parts, for example writing a new code generator for another system, without changeing the rest (the front-end).
  • lexical scanner, lexer or scanner is a program that creates tokens of the input code by following a set of rules. A scanner is usually constructed to work as a Defininite State Machine and use regualar expressions. We have for example the following line of code (written in C-style): int myNumber = 20 + 3; “int” would be a keyword, “myNumber” a identifier, “=” an operator, “20″ a number or integral and so on. Scanners are often generated by a scanner generator because they are performing better than those written by programmers. 
  • parser - A program that is considered the heart of the compiler because it is creating a presentation of the code that is easier to use when generating machine code. The parser is following rules that decides how the code is going to be interpreted and then creates an Abstract Syntax Tree (AST).
    The AST is then easily converted to machine code by the code generator.

I am using an object-oriented approach when constructing my compiler. That means that there will be objects that will symbolize for example tokens, the scanner, the parser and the code generator. Why? Because it is a lot easier to think that way. I cannot manage to do this the procedural way like legendary constructors like Niklaus Wirth did. Another reason is that I am a OOP programmer and I’m attached to C# and the .NET Framework.

The purpose of this project is only educational. I want to learn how a compiler works and especially the parts of it. I am not trying to build a “perfect” compiler that supports all the CLR features. I will not concentrate on the performance either because that is not important. I want the compiler to work before I think of that, and of course it is of beyond the scope of this project.

I want the back-end and front-end to be separate and independent in that way that you can reuse its components and modify them if necessary. But I will firstly concentrate on getting everything working and then try to make these changes.

I have defined a small OO language called Simple Programming Language that I will use as the source language compiled by the compiler that I will write. More details on it in one of the following blogposts.

I have decided to implement Pascal instead of my own SPL. It’s because I have had som issues with it. It is not a good thing to start implementing a compiler for a full OOP langauge as ones first real project.

Anyway, stay tuned!

New version: PictureScreensaver 1.2

The new version of PictureScreensaver is now released. I have released the source code too.

Check the program page.

BNF and EBNF

When you are describing a context-free grammar such as the syntax of a formal language e.g. a programming language when you are constructing a compiler, you most commonly use the so called Backus-Naur form (usually just called BNF).
BNF was invented by John Backus, who developed FORTRAN, when he needed to express the grammar of ALGOL.
It was then only recognized as Backus Normal Form but after the Danish programmer Peter Naur had made some changes to it the meaning of N was changed to Naur. BNF consists of terminals, nonterminals and some metasymbols which you write expressions, called production, that maps your grammar. Terminals are literals (like ‘a’, ‘b’, ‘c’, ‘1′, ‘3′, ‘4′, ‘!’) and nonterminals are composed by terminals and other nonterminals.
Literal = '1' | '2' | 'a' | '-'.

 

Nonterminal = {Literal}.

 

In the productions above the elements 1, 2, a, - are terminals and Literal, Nonterminal nonterminals.
The conventions used when naming terminals are that they are capitalized and nonterminals not.
With BNF you can do recursions and other things by using different metasymbols. Following are the most common ones from both standard BNF and the Extended BNF.
{}  Repeat 0 or more times
 ()  Grouping
[]  Optional element
 |   Or
 <>  Encloses nonterminals (used in EBNF)   

Here is an example using all elements and some others too.

(* a simple program in EBNF − Wikipedia *)
program = 'PROGRAM' , white space , identifier , white space ,
           'BEGIN' , white space ,
           { assignment , ";" , white space } ,
           'END.' ;
identifier = alphabetic character , { alphabetic character | digit } ;
number = [ "-" ] , digit , { digit } ;
string = '"' , { all characters − '"' } , '"' ;
assignment = identifier , ":=" , ( number | identifier | string ) ;
alphabetic character = "A" | "B" | "C" | "D" | "E" | "F" | "G"
                     | "H" | "I" | "J" | "K" | "L" | "M" | "N"
                     | "O" | "P" | "Q" | "R" | "S" | "T" | "U"
                     | "V" | "W" | "X" | "Y" | "Z" ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
white space = ? white space characters ? ;
all characters = ? all visible characters ? ;

 

With this grammar you can parse text that is syntactically correct with the grammar. For example this code:

PROGRAM DEMO1
BEGIN
  A0:=3;
  B:=45;
  H:=-100023;
  C:=A;
  D123:=B34A;
  BABOON:=GIRAFFE;
  TEXT:="Hello world!";
END.

There are some good engines available on the market, both commercial and free. I have tried GOLD Parser Generator, a free grammar parser engine where you can build grammars and later port them so that you can use them in your own programming projects.

Why am I interested in this? Lexical parsers are a very interesting and also important piece in compilators. Almost every parser uses BNF. I am planning to construct my own compiler in the future so this will come in handy.

Grid<int> Class

This has been my project for the last two weeks. Here’s the .NET library bundled with documentation. Feel free to use it in your own projects.


Grid grid = new Grid(10, 10);
grid[3, 9] = 25;

Download: http://www.mediafire.com/?jg1ox2xbngw

Next Page »


Pages

Categories