Implementing a lexer

Lexer is a part of the first phase in compiling a program. Lexer tokenizes a stream of input in a program. Then parser consumes these tokens to construct an AST. The language to tokenize is generally a context-free language. A token is a string of one or more characters that are significant as a group. The process of forming tokens from an input stream of characters is called tokenization. Certain delimiters are used to identify groups of words as tokens. There are lexer tools to automate lexical analysis, such as LEX. In the TOY lexer demonstrated in the following procedure is a handwritten lexer using C++.

Getting ready

We must have a basic understanding of the TOY language defined in the recipe. Create a file named toy.cpp as follows:

$ vim toy.cpp

All the code that follows will contain all the lexer, parser, and code generation logic.

How to do it…

While implementing a lexer, types of tokens are defined to categorize streams of input strings (similar to states of an automata). This can be done using the enumeration (enum) type:

  1. Open the toy.cpp file as follows:
    $ vim toy.cpp
  2. Write the enum in the toy.cpp file as follows:
    enum Token_Type {
    EOF_TOKEN = 0,
    NUMERIC_TOKEN,
    IDENTIFIER_TOKEN,
    PARAN_TOKEN,
    DEF_TOKEN
    };

    Following is the term list for the preceding example:

    • EOF_TOKEN: It states the end of file
    • NUMERIC_TOKEN: The current token is of numeric type
    • IDENTIFIER_TOKEN: The current token is identifier
    • PARAN_TOKEN: The current token is parenthesis
    • DEF_TOKEN: The current token def states that whatever follows is a function definition
  3. To hold numeric values, a static variable is defined in the toy.cpp file as follows:
    static int Numeric_Val;
  4. To hold the Identifier string name, a static variable can be defined in the toy.cpp file as follows:
        static std::string Identifier_string;
  5. Now the lexer function can be defined by using library functions such as isspace(), isalpha(), and fgetc() in the toy.cpp file, as shown in the following:
    static int get_token() {
      static int LastChar = ' ';
      
      while(isspace(LastChar))
      LastChar = fgetc(file);
      
      if(isalpha(LastChar)) {
        Identifier_string = LastChar;
        while(isalnum((LastChar = fgetc(file))))
        Identifier_string += LastChar;
        
        if(Identifier_string == "def")
        return DEF_TOKEN;
        return IDENTIFIER_TOKEN;
      }
      
      if(isdigit(LastChar)) {
        std::string NumStr;
        do {
          NumStr += LastChar;
          LastChar = fgetc(file);
        } while(isdigit(LastChar));
        
        Numeric_Val = strtod(NumStr.c_str(), 0);
        return NUMERIC_TOKEN;
      }
      
      if(LastChar == '#') {
        do LastChar = fgetc(file);
        while(LastChar != EOF && LastChar != '
    '
        && LastChar != '
    '),
        
        if(LastChar != EOF) return get_token();
      }
      
      if(LastChar == EOF) return EOF_TOKEN;
      
      int ThisChar = LastChar;
      LastChar = fgetc(file);
      return ThisChar;
    }

How it works…

The example TOY language defined earlier was as follows:

def foo (x , y)
x + y * 16

The lexer will get the preceding program as input. It will come across the def keyword and determine that whatever follows is a definition token, and hence returns the enum value DEF_TOKEN. After this, it will come across the function definition and its arguments. Then, there is an expression that involves two binary operators, two variables, and a numeric constant. How these are stored in data structures is demonstrated in the following recipes.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.67.5