I am going to talk about about my scanner that I have made for this project.
I decided to write my own scanner instead of using a tool such as GOLD Parser. This is because I want to learn more about how a scanner works. I also think that it is cheating using a tool. The architecture of the scanner is simple to understand and implement. I will explain more as we walk through its parts.
As I mentioned in the last blogpost, everything is object oriented. The code is based on an article that was published in MSDN Magazine in February this year. You can read the whole article at http://msdn.microsoft.com/en-us/magazine/cc136756.aspx. The source code is also available for download and study.
When we are scanning we must assume that the parts of the code, what we call tokens, follow a specified set of rules depending on what type they are. For example an integral literal must only contain integers. Real literals on the other hand can contain punctuation, ‘.’. Identifiers are a bit trickier to scan. An identifier can start with a ‘@’ or letter and later it can contain both letters and integers. All this can be achieved by using regular expressions to create a so called Finite State Automaton (or Finite State Machine, DFM).
You can construct it this way in C#.
char ch; //Lookahead char-variable
ch = (char)input.Peek(); //Peek the first char from stream without removing it
while (input.Peek() > -1)
{
//Read lookahead
ch = (char)input.Peek();
if (ch == ' '
//Whitespace
{
//Just ignore whitespaces
input.Read();
this.Ch++;
}
/* Identifier and Keyword */
else if (char.IsLetter(ch) || ch == '@' || ch == '_'
// Letter, '@' or '_'
{
input.Read(); //Pop character from input stream
tokenCharLoc = this.Ch++;
//Initialize StringBuilder
StringBuilder sb = new StringBuilder();
sb.Append(ch);
//Read characters until next whitespace
ch = (char)input.Peek();
while (char.IsLetterOrDigit(ch) || ch == '_'
{
if (!char.IsLetterOrDigit(ch))
//DEBUG: Console.WriteLine(char.GetNumericValue(ch));
ThrowUnexpectedCharacterException(ch);
sb.Append(ch);
input.Read();
this.Ch++;
ch = (char)input.Peek();
}
//Create a Token-object for token: Check if it is an Identifier or Keyword
Token.TokenType type = Token.TokenType.Identifier;
foreach(var str in Keywords)
if (string.Compare(str, sb.ToString(), !CaseSensitive) == 0)
{
type = Token.TokenType.Keyword;
break;
}
//Create a Token
Token token = new Token(sb.ToString(), type)
{
Ch = tokenCharLoc,
Ln = this.Ln
};
//Add token to output
result.Add(token);
}
The code above demonstrates the Scan method of the Scanner class. In this excerpt the scanner is chunking the stream by checking the characters one by one and thus decide what kind of token it is and if the current character is accepted. In my scanner a token ends before whitespaces (that are ignored by the compiler). After a token is discovered it is added to a Token-object that contains a string containing the value, and an enum that tells the type of the token, e.g. identifier. The Token is then added to a collection containing the result of the scan.
I want to add that the identifier and keyword tokens are scanned by the same block of code. In the end of this block it always compares the identifier token to check if it is equal to any of the strings in an array that you have specified. If it equals then it is a keyword.
That’s all for now. In the next part I will talk about the Abstract Syntax Tree (AST) and the source language this compiler will compile. I will also make all source code for the parser available soon.
Continue reading ‘Building a compiler for the .NET Framework - II : The scanner’
Latest comments