3.4 Error Handling in a Scanner

Generally, all errors are passed on to the parser. Usually the Scanner does not print anything. Errors are communicated to the parser by returning a special error token called ERROR. Note that you should ignore the token called error (in lowercase), used by the parser. There are several requirements for reporting and recovering from lexical errors:

  • When an invalid character (one that cannot begin any token) is encountered, a string containing just that character is returned as the error string. Resume scanning at the following character.
  • If a string contains a UN-escaped new line, that error is reported as “Unterminated string constant” and scanning is resumed at the beginning of the next line – we assume that the programmer simply forgot the close quote.
  • When a string is too long, report the error as “String constant too long” in the error string in the ERROR token. If the string contains invalid characters (i.e. the null character), report this as “String contains null character”. In either case, scanning is resumed after the end of the string. The end of the string is defined as either
    1. the beginning of the next line if a UN-escaped new line occurs after these errors are encountered or
    2. after the closing ” otherwise.
  • If a comment remains open when EOF is encountered, report this error with the message “EOF in comment”. The comment's contents are not to be tokenized simply because the terminator is missing. Similarly for strings, if an EOF is encountered before the close quote, this error is reported as “EOF in string constant”.

We shall now see the basic steps to include errors detection by the Scanner. We shall add error detection to a typical Scanner for a programming language given in Section 3.3.1.

%{
#define VARIABLE 257
#define INTEGER 258
#define TEXT 259
#define ERROR 511
%}
comment ″//″.*
…   …   …
text ″({ascii})*″ %%
{whitespace} {} 
…   …   …
{text} {mktext(); return TEXT;}
. {return ERROR;}
%%
int main(){
int i;
while(i= yylex())
  if(i == 511){
    printf(″Error! %s
″, yytext);
  } else {
    printf(″%d
″, i);
    /* Here parser will take over instead of printf() */
  }
}

If you generate the Scanner for the above lex-code and compile and execute the resultant Scanner, it will give the following response for a valid integer, a valid variable, a valid string, an invalid variable and an invalid integer (or variable), respectively:

123
258
asd
257
″this is good.″
259
ASD
Error! A
Error! S
Error! D
456wer
258
257

The last trial shows that though from a typical programming language viewpoint, the input string “456wer” is neither an integer nor a variable; our Scanner has detected it as an integer immediately followed by a variable. From the viewpoint of syntax (i.e. the Parser), this is a wrong construct and it should be detected as such by the parser. On the other hand, when the Scanner detects an error, normally it will resume with the next character, but the Parser will have to recover at its own recovery point. This is the reason why we said that the Scanner detected errors should be passed on to the Parser and be reported by the Parser.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.66.94