Lexer is a part of the first phase in compiling a program. Lexer tokenizes a stream of input in a program. Then parser consumes these tokens to construct an AST. The language to tokenize is generally a context-free language. A token is a string of one or more characters that are significant as a group. The process of forming tokens from an input stream of characters is called tokenization. Certain delimiters are used to identify groups of words as tokens. There are lexer tools to automate lexical analysis, such as LEX. In the TOY lexer demonstrated in the following procedure is a handwritten lexer using C++.
We must have a basic understanding of the TOY language defined in the recipe. Create a file named toy.cpp
as follows:
$ vim toy.cpp
All the code that follows will contain all the lexer, parser, and code generation logic.
While implementing a lexer, types of tokens are defined to categorize streams of input strings (similar to states of an automata). This can be done using the enumeration (enum) type:
toy.cpp
file as follows:$ vim toy.cpp
enum
in the toy.cpp
file as follows:enum Token_Type { EOF_TOKEN = 0, NUMERIC_TOKEN, IDENTIFIER_TOKEN, PARAN_TOKEN, DEF_TOKEN };
Following is the term list for the preceding example:
EOF_TOKEN
: It states the end of fileNUMERIC_TOKEN:
The current token is of numeric typeIDENTIFIER_TOKEN:
The current token is identifierPARAN_TOKEN:
The current token is parenthesisDEF_TOKEN
: The current token def
states that whatever follows is a function definitiontoy.cpp
file as follows:static int Numeric_Val;
Identifier
string name, a static variable can be defined in the toy.cpp
file as follows:static std::string Identifier_string;
isspace()
, isalpha()
, and fgetc()
in the toy.cpp
file, as shown in the following:static int get_token() { static int LastChar = ' '; while(isspace(LastChar)) LastChar = fgetc(file); if(isalpha(LastChar)) { Identifier_string = LastChar; while(isalnum((LastChar = fgetc(file)))) Identifier_string += LastChar; if(Identifier_string == "def") return DEF_TOKEN; return IDENTIFIER_TOKEN; } if(isdigit(LastChar)) { std::string NumStr; do { NumStr += LastChar; LastChar = fgetc(file); } while(isdigit(LastChar)); Numeric_Val = strtod(NumStr.c_str(), 0); return NUMERIC_TOKEN; } if(LastChar == '#') { do LastChar = fgetc(file); while(LastChar != EOF && LastChar != ' ' && LastChar != ' '), if(LastChar != EOF) return get_token(); } if(LastChar == EOF) return EOF_TOKEN; int ThisChar = LastChar; LastChar = fgetc(file); return ThisChar; }
The example TOY language defined earlier was as follows:
def foo (x , y) x + y * 16
The lexer will get the preceding program as input. It will come across the def
keyword and determine that whatever follows is a definition token, and hence returns the enum value DEF_TOKEN
. After this, it will come across the function definition and its arguments. Then, there is an expression that involves two binary operators, two variables, and a numeric constant. How these are stored in data structures is demonstrated in the following recipes.
18.227.111.208