Chapter 2

Lexical Analyzer

This is the first phase of a compiler. The compiler spends most of its time (20–30% of compile time) in this phase because reading character by character is done only in this phase. If a lexical analyzer is implemented efficiently, the overall efficiency of the compiler improves. Lexical analyzers are used in text processing, query processing, and pattern matching tools.

CHAPTER OUTLINE

The scanner or lexical analyzer (LA) performs the task of reading a source text as a file of characters and dividing them up into tokens. Tokens are like words in natural language. This chapter deals with issues to be taken care of in the design of a lexical analyzer. Here we shall discuss how to specify tokens using regular expressions, how to design LA using Finite Automata (FA) or Lex tool. It is easy to design a pattern recognizer as Nondeterministic Finite Automata (NFA). But NFA is slower than Deterministic Finite Automata (DFA). Hence NFA, NFA with ε-transitions, inter conversions of NFA and DFA, and minimal DFA are also discussed. A software tool that automates the construction of a lexical analyzer, that is, LEX is also discussed. This tool allows people with different backgrounds to use pattern matching in their own application areas.

2.1  Introduction

Lexical analysis is the act of breaking down source text into a set of words called tokens. Each token is found by matching sequential characters to patterns. In general, programming languages are defined using Context-free grammars, and these include the regular languages. All the tokens are defined with regular grammar and the lexical analyzer identifies strings as tokens and sends them to a syntax analyzer for parsing. A typical lexical analyzer or scanner is shown in Figure 2.1.

Don’t think that a lexical analyzer reads the complete text and sends stream of tokens. It is not so. The interaction of a lexical analyzer with a parser is shown in Figure 2.2. Only on getting a request from the parser, the lexical analyzer reads the next token and sends the next token to parser. If the token is an identifier or a procedure, then the lexical analyzer stores that token in the symbol table.

For example, if there is a source text A + B, LA does not read A + B at once and sends tokens: id, +, id. On getting the first request from the parser, the LA reads the first string A, recognizes that as the token id, stores it in the symbol table, and sends the token id to the parser. On the next request, it only reads + and sends the operator as it is as a token. On the third request, it reads string B, recognizes that as token id, stores it in the symbol table and sends the token id to the parser.

Out of all the phases, a compiler spends much time in lexical analysis. Hence, if this phase is implemented efficiently, this contributes to the overall efficiency of the compiler.

Lexical analysis and parsing can be combined in one phase but there are some advantages for doing it separately.

Figure 2.1

Figure 2.1 A Typical Lexical Analyzer

Figure 2.2

Figure 2.2 Interaction between a Lexical Analyzer and a Parser

2.2  Advantages of Separating Lexical Analysis from Syntax Analysis

There are several reasons for separating the analysis phase into lexical analysis and parsing.

  1. Simplicity: By having LA as a separate phase, the compiler design is simplified. Techniques of DFA are sufficient for lexical analysis and techniques of PDA can be used for syntax analysis. Designing them together is a complex process. By dividing that complex task into simpler subtasks, design is simplified. Separation also simplifies the syntax analyzer and allows us to use independent tools.
  2. Efficiency: Separating lexical analysis from syntax analysis increases design efficiency. Separation into different modules makes it easier to perform simplifications and optimizations unique to the different paradigms. For example, a compiler spends much time in LA. If this module is implemented efficiently, then this contributes to the overall efficiency of a compiler.
  3. Portability: Portability is enhanced. Due to input/output and character set variations, lexical analyzers are not always machine independent. We can take care of input alphabet peculiarities at this level.

The most important advantage is that separation of these phases helps in designing special tools to automate the construction of lexical analyzers and parsers.

2.3  Secondary Tasks of a Lexical Analyzer

The main task of LA is to read the source text and break it to stream of tokens. In addition to this it even performs the following secondary tasks.

 

  1. Filtering comment lines and white space characters.

    White space characters like tab, space, newline characters, and comment lines are not required during the translation of a source code. Hence, a lexical analyzer stripes them off the code.

  2. Associating the error messages in other phases.

    A lexical analyzer reads the source code character by character. While reading it even remembers the line numbers. If an error is seen by any other phase, then the lexical analyzer helps other phases in giving error diagnostics properly.

2.4  Error Recovery in Lexical Analysis

Generally, errors often detected in a lexical analysis are as follows:

  1. Numeric literals that are too long
  2. Long identifiers (often a warning is given)
  3. Ill-formed numeric literals
  4. Input characters that are not in the source language

How to recover from errors?

There are four approaches to recover from lexical errors:

  • Delete: This is the simplest and most important operation. If there is an unknown character, simply delete it. Deletion of character is also called panic-mode. This is used by many compilers. However, it has certain disadvantages:
    • The meaning of the program may get changed.
    • The whole input may get deleted in the process.

    Example: “charr” can be corrected as “char” by deleting “r”

  •  Insert: Insert an extra or missing character to group into a meaningful token.

    Example: “cha” can be corrected as “char” by inserting “r”

  • Transpose: Based on certain rules, we can transpose two characters. Like “whiel” can be corrected to “while” by the transpose method.
  • Replace: In some cases, it may require replacing one character by another.

    Example: “chrr” can be corrected as “char” by replacing “r” with “a”.

Let us consider an example here.

Example:

 

if a statements;

    else statements;

Here, to recover from the error, we need to insert a blank between “if” and “a.” However, if we use the delete approach, then since ‘ifa’ is an identifier, it can’t be followed just by an identifier, so statements is deleted. Next, else can’t occur without an “if” preceding it, so this else is also deleted. So in this way we end up deleting a substantial portion of the program.

We can have different error handlers to handle different errors that occur at various stages. The error handler need not be totally generic. In lexical, the above four operations are based on characters, while in syntactical, these operations are based on tokens.

Certain desired goals for error recovery:

  •  Error recovery should be accurate.
  •  Location of error reported should be precise.
  •  Error recovery should be quick (the algorithm shouldn’t look into the whole code for error recovery).
  •  Error recovery should not lead to error cascade (i.e., you recover from one error and get trapped into other errors and so on).

2.5  Tokens, Patterns, Lexemes

When talking about lexical analysis, most often we use the following terms: lexeme, token, and pattern. It is important to understand these terms.

Token: It is a group of characters with logical meaning. Token is a logical building block of the language.

Example: id, keyword, Num etc.

Pattern: It is a rule that describes the character that can be grouped into tokens. It is expressed as a regular expression. Input stream of characters are matched with patterns and tokens are identified.

Example: Pattern/rule for id is that it should start with a letter followed by any number of letters or digits. This is given by the regular expression: [A – Za – z][A – Za – z 0 – 9]*.

Using this pattern, the given input strings “xyz, abcd, a7b82” are recognized as token id.

Lexeme: It is the actual text/character stream that matches with the pattern and is recognized as a token.

For example, “int” is identified as token keyword. Here “int” is lexeme and keyword is token.

For example, consider the following statement.

float key = 1.2;

 

Lexeme

Token

Pattern

float

float

Float

key

id

A letter followed by any number of letters or digits

=

relop

< | > | <= | >= | = | != | = =

1.2

num

Any numeric constant

;

;

;

Design complexity of a lexical analyzer is mainly dependent on language conventions.

A popular example that illustrates the potential difficulty of LA design is that in many languages certain strings are reserved, that is, the meaning is predefined and cannot be changed by the user. If keywords are not reserved, then lexical analyzer must distinguish between a keyword and identifier. Example: Keywords are not reserved in PL/I. Thus, the rules for recognizing keywords from identifiers are quite complicated; it is understood by the following statement:

 

If then then then=else; else else=then;

Attributes for Tokens

When a token represents many lexemes, additional information must be provided by the scanner or lexical analyzer about the particular lexeme of a token that matched with the subsequent phases of the compiler.

For example, the token id matches with x or y or z. But it is essential for the code generator to know which string is actually matched.

The lexical analyzer accumulates information about tokens into their associated attributes. The tokens control parsing decisions, whereas the attributes show significance in translation of tokens. Generally, tokens will have only one attribute, that is, pointer to the symbol table entry in which information about the token is available. Suppose we want the type of id, token, lexeme, line number, all these information can be stored in the symbol table entry of that identifier.

Example 1:

Consider the statement:

distance = 0.5 * g * t * t;

 

When the lexical analyzer invokes getnexttoken() function for the next token, the sequence of tokens returned are as follows:

  • id --------- (distance)
  • =
  • Const ------ (0.5)
  • *
  • id ----------- (g)
  • *
  • id ------------- (t)
  • *
  • Id ----------- (t)

Example 2:

Divide the following C code into tokens:

 

int Square(a) int a {

/* returns square of a, but is less than 100 */

return(a<=–10||a>=10)?100:a*a;

}

Which of the following is NOT one of the tokens?

  1. ?:
  2. /*
  3. never
  4. =
  5. /* returns......*/
  6. <=
  7. 100
  8. )

Solution:

  1. ?:—is not a token. Although these symbols form one C operator, they are separated in the text, and need to be recognized separately by the lexical analyzer. Notice that, although they are fairly close in this example, there could be an arbitrarily long expression between them in the legal C code.
  2. /* —is not a token. This symbol introduces the comment, but it is normal to strip out comments before tokenizing the source code.
  3. never—is not a token. The word “never” looks like an identifier, but it is within a comment. Normally, comments are stripped out before tokenizing, but we would never separate a comment into parts during lexical analysis at any event.
  4. = —is not a token. Although = might be a token by itself in some code, each use of = here is as part of a larger token.
  5. /* returns x-squared but never more than 100*/– is not a token. The comment is normally stripped out before tokenizing the source code. It would not be returned as a token to the parser, because its presence does not affect the structure of the code; neither does it affect the object code to be produced.
  6. <=—is a token. <= is the two-character “less-than-or-equals” operator, and is normally recognized as a single token.
  7. 100—is a token. “100” is a constant. It would normally be returned as the token CONST, with a lexical value indicating that the actual value is 100.
  8. )—is a token. Parentheses are always treated as tokens by themselves.

The list of tokens would be:

  • float

    ID --------(limitedSquare)

    (

    ID --------(x)

    )

    float

    ID --------(x)

    {

    return

    (

    ID --------(x)

    <=

    CONST --------(–10.0)

    |

    ID --------(x)

    >=

    CONST --------(10.0)

    ?

    CONST --------(100)

    :

    ID --------(x)

    *

    ID --------(x)

    ;

    }

2.6  Strategies for Implementing a Lexical Analyzer

Lexical analyzer can be designed in the following ways:

  1. Use a scanner generator tool like Lex/Flex to produce a lexical analyzer. The specification can be given using a regular expression, which on compilation generates a scanner capable of identifying tokens. The generator provides routines for reading and buffering the input. This is easy to implement but least efficient.
  2. Design a lexical analyzer in a high-level programming language like C and use the input and buffering techniques provided by the language. This approach is intermediate in terms of efficiency.
  3. Writing a lexical analyzer in assembly language is the most efficient method. It can explicitly manage input and buffering. This is a very complex approach to implement.

The above details provided for a lexical analyzer are in increasing order of complexity and efficiency.

Designing a lexical analyzer either by hand or automated tools mainly involves two steps:

  1. Describe rules for tokens using regular expressions.
  2. Design a recognizer for such rules, that is, for tokens. Designing a recognizer corresponds to converting regular expressions to Finite Automata. The processing can be speeded if the regular expression is represented in Deterministic Finite Automata. This involves the following steps:
    • Convert regular expression to NFA with ε
    • Convert NFA with s to NFA without ε
    • Convert NFA to DFA

2.7  Input Buffering

In order to recognize a token, a scanner has to look ahead several characters from the current character many times. For example, “char” is a keyword in C, while the term “chap” may be a variable name. When the character “c” is encountered, the scanner cannot decide whether it is a variable, keyword, or function name until it reads three more characters. It takes a lot of time to read character by character from a file, and so specialized buffering techniques are developed. These buffering techniques, makes the reading process easy and also reduces the amount of overhead required to process. There are many buffering schemes that can be used; since these techniques are somewhat dependent on system parameters, we shall discuss one scheme here.

Look ahead with 2N buffering

We use a buffer divided into two N-Character halves as shown in Figure 2.3. Typically, N is the number of characters on one disk block, for example, 1024 or 4096. We read N input characters into each half of the buffer with one system read command, rather than invoking a read command for each input character. If less than N characters remain in the input, then special characters EOF is read into the buffer after the input characters, as in Figure 2.3. eof marks end of source file and is different from any input character.

Two pointers “lexeme” and “fwd” to the input buffer are maintained. The string of characters enclosed between the two pointers is the current lexeme. To find the lexeme, first initialize both the pointers with the first character of the input. Keep incrementing the fwd pointer until a match for the pattern is found. Once a character not matching is found, stop incrementing the pointer and extract the string between the “lexeme” and “fwd” pointers. This string is required the lexeme and process it and set both the pointer to the next character to identify the next lexeme. With this scheme, comments and white space can be treated as patterns that yield no token.

Figure 2.3

Figure 2.3 Interaction of Lexical Analyzer and Parser

While the fwd pointer is being incremented, if it is about to move past the halfway mark, the right half is filled with N new input characters. Else if the fwd pointer is about to move past the right end of buffer, not only the left half is filled with N new characters but also the fwd pointer wraps around the beginning of the buffer.

Most of the time this buffering scheme works quite well, but the amount of look ahead is limited with it. This limited look ahead may make it difficult to recognize tokens in cases where the distance that the fwd pointer must travel is more than the length of the buffer. For example, in PL/I program, consider the statement

DECLARE (ARG1, ARG2, ARG3........ARGn)

Until we see what character follows the right parenthesis, we cannot determine whether DECLARE is a function name, keyword, or an array name. In all the cases, the lexeme ends at second E, but the amount of look ahead needed is proportional to the number of arguments, which in principle is unbounded.

Algorithm for moving the forward pointer “fwd”

 

If fwd is at end of first half

      reload second half;

      set fwd to point to beginning of second half;

Else if fwd is at end of second half

      load first half;

      set fwd to point to beginning of first half;

Else

      increment fwd pointer

This algorithm takes two tests each time to advance the fwd pointer. The performance can be improved by using other methods.

Buffer pairs

Sentinels

In order to optimize the number of tests to one for each advance of fwd pointer, sentinels are used with buffer. This is shown in Figure 2.4. The idea is to extend each buffer half to hold a sentinel at the end.

  • One of the special characters that cannot occur in a program is Sentinel (e.g., EOF).
  • It indicates the need for some special action (terminate processing or fill other buffer-half).

Algorithm for incrementing the forward pointer “fwd” that uses sentinels

 

increment fwd

If fwd is EOF

If fwd is at end of first half

      reload second half

      set fwd to point to beginning of second half;

Else if fwd is at end of second half

      reload first half;

      set fwd to point to beginning of first half;

otherwise

      terminate the process.

Figure 2.4

Figure 2.4 Sentinels in Input Buffering

This algorithm needs only one test per character.

2.8  Specification of Tokens

Patterns are specified using regular expressions. Each pattern matches a set of strings; so regular expressions will serve as names for sets of strings.

Strings and Languages

A language is a dynamic set of visual, auditory, or tactile symbols of communication and the elements used to manipulate them. Language can also refer to the use of such systems as a general phenomenon.

Symbol and Alphabet

A symbol is an abstract entity. It cannot be formerly defined as points in geometry.

Example: Letters, digits, or special symbols like $, @, # etc.

Alphabet: Finite collection of symbols denoted by Σ.

Example: English alphabet Σ = {a, b,......z}

Binary alphabet Σ = {0, 1}

 

String/word: Set of symbols from the set of alphabets

Example: 1101, 1111, 01010101 strings from binary alphabet.

0a1 is not a string from binary alphabet.

A word over an alphabet can be any finite sequence, or string, or group of letters. The set of all words over an alphabet Σ is usually denoted by Σ*. For any alphabet, there is only one word of length 0, the empty word, which is often denoted by e, ε, or Λ. An empty string can be denoted by €.

  • Any string with any number of leading symbols of string is called Prefix.
  • Any string with any number of trailing symbols of string is Suffix.
  • Any substring except the string itself is Proper substring.

Example 1: String: “ramu”

Prefix: ε, r, ra, ram

Suffix: u, mu, amu

Substring: ε, r, ra, ram, ramu

2.8.1  Operations on Language

If L1 and L2 are two languages, then

  1. Union of two languages is denoted as L1 + L2 or L1 U L2

    By union we get the words from both languages.

  2. Concatenation of two languages is denoted as L1L2

    Concatenation is a process of combining a word with other words to form a new word, whose length is the sum of the lengths of the original words. If we concatenate a word with an empty word, the result is the original word.

  3. Kleene’s closure Σ* is the language consisting of all words that are concatenations of 0 or more words in the original language (including null string €).

Example 2:

  1. Σ = {x} Σ* = {Σ0 U Σ1 U Σ2 U Σ3 ..............}

    Σ* = {ε, x, xx, xxx .........}

  2. Σ = {a, b} Σ* = {ε, a, b, aa, ab, bb, ba, aaa, aab,.........}
  3. Positive closure Σ+ = Σ* – {ε}

    Σ* = Σ+ + ε

L1 = {smart, ugly} L2 = {boy, girl}

L1 ∪ L2 = {smart, ugly, boy, girl}

L1L2 {smart boy, smart girl, ugly boy, ugly girl}

Regular Expressions

  1. Any atom (character) is a regular expression.
  2. If r and s are regular expressions, then so are r. s, r + s, r* and (r).

The language described by a regular expression is called a regular set. The operators + and * are called regular operators.

Properties of Regular Expressions

If r and s are two regular expressions then

  •  r + s = s + r (union is commutative)
  • r(s + t) = rs + rt, (s + t)r = sr + tr (concatenation distributes over union)
  • r + (s + t) = (r + s) + t (union is associative)
  • εr = r, rε = r (ε is the identity element for concatenation)
  • r** = r* (closure is idempotent)

Regular Definitions: It is often desirable to name regular expression groupings for being able to recognize tokens at different levels.

Example: Identifiers in C are defined as a letter or underscore followed by zero or More letters or underscores or digits.

Regular definition for id is as follows:

letter → A + B + C + : : : + Z + a + b : : : + z

digit → 0 + l + 2 + : : : + 9

id → (letter+)(letter+ +digit)_

Note: Be careful with recursive regular definitions! They may not be regular.

2.9  Recognition of Tokens

Lexical analysis can be performed with pattern matching through the use of regular expressions. Therefore, a lexical analyzer can be defined and represented as a DFA. Recognition of tokens implies implementing a regular expression recognizer. This entails implementation of a DFA.

Example of a lexical analyzer

Suppose that we want to build a lexical analyzer for the recognizing identifier, > =, >, integer const. The corresponding DFA that recognizes the above tokens is shown below in Figure 2.5.

How much should we match?

In general, find the longest match possible.

For example, given an input 123.45, should match with oken num_const(123.45) rather than num_const(123), “.”, num_const(45).

The DFA for matching such numbers is shown below in Figure 2.6.

Implementing finite automata: We can hand code DFA as each state corresponds to a labeled code fragment and state transitions represented as control transfers as follows.

Figure 2.5

Figure 2.5 DFA That Recognizes Tokens id, integer_const, etc.

Figure 2.6

Figure 2.6 DFA That Recognizes Floating Point Numbers

int scanner ()

{ char c;

while (TRUE) {

c = getchar ();

state1: switch (c) { /*initial state*/

      case ‘a’: goto state2;

      case ‘b’ goto state3;

      default: Error ();

      }

state2: switch (c){

      case ‘a’: goto state2;

      case ‘b’: goto state3;

      default: return SUCCESSFUL;

      }

state3: switch (c){

      case ‘a’: goto state2;

      default: return SUCCESSFUL;

      }

}/*while*/

}

Eqution

When the current state of automaton is a final state, then a match is found in DFA and no transition is enabled on the next input character.

Actions on finding a match:

  • If the lexeme is valid, then copy it in an appropriate place where the parser can access it.
  • Save any necessary scanner state so that scanning can subsequently resume at the right place.
  • Return a value indicating the token found.

So the concept of finite automaton is essential for the design of a lexical analyzer. Hence, let us discuss about finite automata, its types, and inter conversions.

2.10  Finite State Machine

The finite state system is a mathematical model of a system with certain input and gives certain output finally. The input given to an FSM is processed by various states. These states are called intermediate states.

A good example of finite state systems is a control mechanism of elevator. This mechanism remembers only the current floor number pressed and it does not remember all the previously pressed numbers. The finite state systems are useful in design of text editors, lexical analyzers, and natural language processing. The word “automaton” is singular and “automata” is plural.

Definition: A finite automaton is formerly defined as a 5-tuple (Q, Σ, δ, q0, F) where

Q – is a finite set of states which is non empty

Σ – is input alphabet

q0 – is initial state

F – is a set of final states and F EqutionQ

δ – is a transition function or mapping function Q x Σ → Q using this the next state can be determined depending on the current input.

2.10.1  Finite Automaton Model

The finite automaton can be represented as in Figure 2.7.

  1. Input tape is a linear tape having some cells that can hold an input symbol from Σ.
  2. Finite control is the finite control that indicates the current state and decides the next state on receiving particular input from input tape. The tape reader reads the cells one by one from left to right and at any instance only one input symbol is read.

The reading head examines the read symbol and the head moves to the right side with or without changing the state. When the entire string is read and if finite control is in the final state then the string is accepted or else rejected. Finite automaton can be represented by a transition diagram in which the vertices represent the states and edges represent the transitions. We can model the real-world problems as finite automaton which helps in understanding the behavior and analyzing the behavior.

Example 3:

Lexical analyzer behavior can be shown as FA. Consider the lexical analyzer that matches words like “the”, “this”, “that”, and “to”. This is shown in Figure 2.8.

These systems are called Finite Automaton as the number of possible states and the number of letters in the alphabet are both finite. It is automaton because the change of state is totally governed by the input.

Figure 2.7

Figure 2.7 Finite Automaton

Figure 2.8

Figure 2.8 DFA that recognizes strings to, the, this, and that

2.10.2  Properties of the Transition Function “δ”

  1. δ (q, ε) = q. The states of FA are changed only by an input symbol.
  2. For all strings w and i/p symbol a, δ (q, aw) = δ (δ (q, a), w)

    An FA can be represented by a

    1. transition diagram
    2. transition table

2.10.3  Transition Diagram

A transition graph contains

  1. Set of states as circles

    Start state q0 with arrow

    image

    Final state by double circle

    image
  2. A finite set of transitions (edges | labels) that show how to go from some state to other.

2.10.4  Transition Table

Following is a tabular representation where rows correspond to states and columns correspond to input. The start state is given by → and the final state by *

Example: M = {{q0, q1, q2}, {a, b}, δ, q0, {q2}}

δ (q0, a) = q1 δ (q0, b) = q2

δ (q1, a) = q2 δ (q1, b) = q0

δ (q2, a) = q2 δ (q2, b) = q2

image

This able can be shown as a transition diagram as seen below.

image

2.10.5  Language Acceptance

A string w is accepted by Finite Automaton U given as

U = {Q, Σ, δ, q0, F} if δ (q0, w) = P for some P in F. This concludes that the string is accepted when it enters into the final state on the last input element.

Example:

image

Let us check if input string 1010 is accepted or not

  1. δ (q0, 1010) = δ (q2, 010) = δ (q3, 10) = δ (q1, 0) = q0

    Eqution

    Here q0 is the final state. Hence, the string is accepted.

  2. Check 11111

    Eqution

    q2 EqutionF. Hence, the string is rejected.

Example 4:

Give the language defined by the following FA.

image

If we list different strings accepted by this automaton, we get

{1, 01, 0001, 10101, 011111..................}

If we observe, all strings that are accepted always end with 1.

L(M) = {w /w ends with 1 on Σ = {0, 1}}.

Language accepted by machine M is L(M), which is the set of strings that are ending with 1.

Example 5:

Define language accepted by the following machine.

image

L(m) = {w/w contains all strings that start and end with the same symbol}

2.10.6  Finite Automation Is of Two Types

  1. Deterministic finite automaton
  2. Nondeterministic finite automaton

In DFA there will be unique transition in any state on an input symbol, whereas in NFA there can be more than one transition on an input symbol. Hence, DFA is faster than NFA. The above figure is an example of DFA. The following figure is an example of NFA.

image

In the above figure, in state q0 on 0 it is either in state q0 or in state qr. Hence NFA.

2.10.7  Deterministic Finite Dutomaton (DFA)

Deterministic finite automation can be defined as quintuple

 

M = (Q, Σ, δ, q0, F)

Where Q = Non empty finite set of states

Σ = input alphabet

q0 = initial start state

F = set of final states

δ = transition function that takes two arguments a state and input symbol and returns output as state i.e., δ: Q × Σ → Q

Example: δ(q1, a) = q1 DFA can be used as finite acceptor because its sole job is to accept certain input strings and reject other strings.

It is also called language recognizer because it merely recognizes whether the input strings are in the language or not.

Example 5:

Design a DFA that accepts only input 101 over the set {0, 1}.

image

Here qtrap is called trap state/dummy state where unnecessary transitions are thrown away.

Example 6:

Design a DFA that accepts even number of 0’s and even number of l’s.

Solution: This FA will have four different possibilities while reading 0’s & l’s as input. The possibilities could be

 

Even number of 0’s and even number of l’s–q0

Odd number of 0’s and even number of l’s–q1

Even number of 0’s and odd number of l’s–q2

Odd number of 0’s and odd number of l’s–q3

 

where states are q0, q1, q2, and q3. Since the state q0 indicates the condition of even number of 0’s and even number of l’s this state is made as final state. The DFA is given by

image

2.10.8  Nondeterministic Finite Automaton (NFA)

For a given input symbol, there can be more than one transition from a state. Such automaton is called Nondeterministic Finite Automaton. NFA is mathematically described as quintuple. Nondeterministic finite automaton can be defined as quintuple

M = (Q, Σ, δ, q0, F)

Where Q = Non empty finite set of states

Σ = input alphabet

q0 = initial start state

F = set of final states

δ = transition function that takes two arguments a state and input symbol and returns output as state i.e., δ: Q × Σ → 2Q

Acceptance of NFA

Acceptance of a string is defined as reaching to the final states on processing the input string.

image

Check 0100 is accepted or not

image

Since q2 is in the final state, there is at least one path from the initial state to one of the final state. Hence the given string is accepted.

Note: It is easy to construct NFA than DFA, but the processing time of the string is more than DFA.

  • Every language that can be described by NFA can be described by some DFA.
  • DFA in practice has more states than NFA.
  • Equivalent DFA can have at most 2n states, whereas NFA has only ‘n’ states.

Example 7:

Design NFA accepting all string ending with 01 over Σ = {0, 1}.

We can have any string on 0 or 1 but should end with 0 followed by 1.

image

Corresponding DFA is

image

So drawing NFA is simple than DFA.

2.10.9  Equivalence of DFAs and NFAs

Let L be a set accepted by a NFA. Then there exists a DFA that accepts L.

Proof: Let M = (Q, Σ, δ, q0, F) be an NFA accepting L.

Define DFA M1 = (Q1, Σ1, δ1, q01, F1) as follows.

The states of M1 are all the subsets of the set of states of M. i.e., Q1 = 2Q

Example: If Q = {A, B} then Q1 = {[ε], [A], [B], [AB]}

F1 is the set of all states in Q1 containing a final state of M.

Example: If F = {B} then F1 = {[B], [AB]}

An element of Q1 will be denoted by [q1, q2,...............qi] where q1, q2,...............qi are in Q.

Note: [q1, q2,.........qi] is a single state of DFA corresponding to set of states of NFA. q01 = [q0]

we define δ1 ([q1, q2,...............qi]a) = [P1, P2,....................Pi]

iff δ ({q1, q2,...............qi}a) = {P1, P2,...................Pi}

that is, δ1 applied to an element [q1, q2,...............qi] of Q1 is computed by applying δ to each state of Q represented by [q1, q2,...............qi]. It is easy to show by induction on length of input string x that

δ1 (q01, x) = [q1, q2,...............qi]

iff δ (q0, x) = {q1, q2,...............qi}

Basis: The result is trivial for | x | = 0 q01 = [q0]

Induction: Suppose that the hypothesis is true for inputs of length m or less. Let “xa” be a string of length m+1 with a in Σ. Then

δ1(q0, xa) = δ11(q0, x), a)

By inductive hypothesis

δ1(q0, x) = [P1, P2,....................Pj]

iff δ (q0, x) = {P1, P2,....................Pj}

But by def of δ1

δ1([P1, P2,....................Pj]a) = [r1, r2,.............rk}

Thus

δ1(q01, xa) = [r1, r2,.............rk]

iff δ (q0, xa) = {r1, r2,.............rk}

This establishes the inductive hypothesis.

To prove that L(M) = L(M1)

The string x is accepted by NFA or DFA only if it is in one of the final state.

Let for a string x in NFA δ (q0, x) = P where P € F. Then δ1(q0, x) = [P] where [P] € F1. Hence, the string x is accepted iff it is accepted by the NFA.

2.10.10  Converting NFA (MN) to DFA (MD)–Subset Construction

Let MN = (QN, ΣN, δN, qON, FN) be the given NFA to construct the equivalent DFA MD, MD = {QD, ZD, δD, qD, FD} where

  1. QD = 2QN. If NFA has n states, DFA at most can have 2n states.
  2. ΣD = ΣN
  3. [q0] = {qo}
  4. FD = Set of all states of QD that contains at least one final state of FN.
  5. δD ((q1, q2, q3), a) = δn(q1, a) ∪ δn (q2, a) ∪ δn(q3, a) = {P1, P2, P3} say

    Add state [P1, P2, P3] to QD if it is not there.

Example 8:

Convert the following NFA to DFA

image

Solution:

Q = 23 = 8 states = all subsets of q0, q1, q2

    = {Ø, [q0], [q1], [q2], [q0, q1], [q0, q2], [q1, q2], [q0, q1, q2]}

Σ = 0, 1

q0 = [q0]

F = {[q2], [q0, q2], [q1, q2], [q0, q1, q2]}

δ = is given by δD([q1 q2], a) = δn (q1, a) U δn (q2, a)

when δn is transition function of NFA

0
1
Ø
Ø
Ø
→[q0]
[q0, q1]
[q0]
[q1]
Ø
[q2]
[q2]
Ø
Ø
*[q0, q1]
[q0, q1]
[q0, q2]
*[q0, q2]
[q0, q1]
[q0]
[q1, q2]
Ø
[q2]
[q0, q1, q2]
[q0, q1]
[q0, q2]

The states [Ø], [q1], [q2], [q1, q2] and [q0, q1, q2] are not reachable from start stated and hence cannot define any strings. So they can be ignored. Hence, the DFA can be simplified as follows:

image

To get this Simplified DFA construct the states of DFA as follows:

  1. Start with the initial state. Do not add all subsets of states as there may be unnecessary states.
  2. After finding the transition on this initial state, include only the resultant states into the list until no new state is added to the list. For example, if δ(q0, a) = {q0, q1}, then add this as new state in DFA. Then find transition from this state on input symbol.
  3. Declare the states as final if it has at least one final state of NFA.

Example 9:

Convert the following NFA to DFA.

δ
0
1
→q0
{q1 q2}
{q0}
q1
{q0 q1}
Ø
*q2
q1
{q0 q1}

DFA is

δ
0
l
→[q0]
[q1 q2]
[q0]
*[q1 q2]
[q0 q1]
[q0 q1]
[q0 q1]
[q0 q1 q2]
[q0]
*[q0 q1 q2]
[q0 q1 q2]
[q0 q1]

 

The transition diagram of DFA is as shown below.

image

2.10.11  NFA with Epsilon (ε)-Transitions

We can extend an NFA by introducing a “ε-moves” that allows us to make a transition on the empty string. There would be an edge labeled ε between two states, which allows the transition from one state to another even without receiving an input symbol. This is another mechanism that allows NFA to be in multiple states at once. Constructing such NFA is easier, but is not that powerful. The NFA with ε-moves is given by M = (Q, Σ, δ, q0, F) where δ is defined as Q × Σ ∪ {ε} → 2Q.

Example 10:

Design NFA for language L = {0K | K is multiple of 2 or 3 }

NFA for that set of strings that has a number of 0’s, which are multiples of 2.

image

NFA for the set of strings that has a number of 0’s, which are multiples of 3.

image

Combining these two NFAs,

image

2.10.12  Epsilon Closure (ε-closure)

Epsilon closure or ε-closure of a state is simply the set of all states reachable on ε. This can be expressed as either ε(q) or ε-closure (q). In the above example,

 

ε-closure (q0) = {q0, q1, q2}

ε-closure (q1) = {q1, q2}

ε-closure (q2) = {q2}

 

Let us define the extended transition function for an NFA with ε-transitions. For a regular NFA we used the induction step as follows.

Let

images(q, w) = {p1, p2, ... pk}

images(pi, a) = Si for i = 1, 2,...k

Then images

For an NFA with ε, we change for images(q, wa) as

images

This includes the original set S1, S2... Sk as well as any states we can reach via ε-transitions.

2.10.13  Eliminating ε-Transitions

ε-Transitions are used for convenience in some cases, but do not increase the power of the NFA. To eliminate them we can convert an NFA—with ε into an equivalent NFA—without ε by eliminating the ε edges and replacing them with the edged labeled with the symbol present in Σ. We can also convert NFA—with ε into an equivalent DFA, which is quite similar to the steps we took for converting a normal NFA to a DFA, except we must now follow all ε-transitions and add those to our set of states.

2.10.14  Converting NFA with ε-Transition to NFA Without ε-Transition

For each state, compute ε-closure(q) on each input symbol a ∊ Σ. If the ε-closure of a state contains a final state then make the state as final.

Let the following be NFA with ε-transitions.

image

The transition table is

image

NFA without ε-transitions is

image

Transition diagram of NFA without ε-transitions

image

2.10.15  Converting NFA with ε-Transition to DFA

  1. Compute ε* for the current state, resulting in a set of states S.
  2. δ(S, a) is computed for all a ∈ Σ by
    1. Let S = {p1, p2, ... pk}
    2. Compute R = δ(S, a) as

      image

      This set is achieved by following input a, not by following any ε-transitions.

    3. Add the ε-transitions by computing ε-closure(R).
  3. Make a state an accepting state if it includes any final states in the NFA.

Note: The epsilon transition refers to moving from one state to another without reading an input symbol. These transitions can be inserted between any states.

Consider the NFA-epsilon move machine M = {Q, Σ, δ, q0, F}

Q = {q0, q1, q2}

Σ = {a, b, c} and ε moves

image

DFA construction

Step 1: Compute ε – closure of each state.

image

Step 2: Explore the states that are valid states in DFA using the above step 2 and step 3.

DFA transition table

image
image

2.10.16  Comparison Method for Testing Equivalence of Two FAs

Let M and M1 be two FAs over Σ. We construct a comparison table consisting of n + 1 columns where n is the number of i/p symbols.

  1. The first column consisting of pair of nodes of form (q, q1) where q € M and q1 € M1.
  2. If (q, q1) appears in the same row of the first columns, then the corresponding entry in a column (a € Σ) is (qa, qa1), where (qa, qa1) are reachable from q and q1 on a.
  3. The table is constructed by starting with a pair of initial vertices qin, qn1 of M and M1. We complete construction by considering the pairs in second and subsequent columns, which are not in the first column.
    1. If we reach a pair (q, q1) such that q is final states of M1 and q1 is nonfinal state of M1 = > terminate construction and conclude that M and M1 are not equivalent.
    2. If construction is terminated, then no new element appears in second and subsequent columns which are not on the first column. Hence, M and M1 are equivalent.
      image

      By looking at the number of states we cannot conclude

       

      c
      d
      →[q0, q4]
      [q3, Ø]
      [q1, q5]

      Since we do not have a pair (qM1 qM2) for input c, M1 and M2 are not equivalent.

2.10.17  Reduction of the Number of States in FA

  • Any DFA defines a unique language but the converse is not true, that is, for any language there is unique DFA is not always true.
  • For the same language there can exist many DFAs. So there can be considerable difference in the number of states. By using the comparison method we can test this.

Indistinguishable states:

Two states p and q of a DFA are said to be indistinguishable if

      δ*(p, w) ∊ F implies δ*(q, w) ∊ F

and  δ*(p, w) ∉ F implies δ*(q, w) ∉ F

Or for some strong w ∊ Σ* if δ*(p, w) ∊ F and δ*(q, w) ∊ F or vice versa, then states p and q are said to be distinguishable by a string w.

Equivalent classes: The concept of equivalent class is used in minimizing the number of states. States that are equivalent can be combined as one class called equivalent class. Let us see the definition of equivalent states.

Definition 1

Two states q1 and q2 are equivalent (q1 ≡ q2) if both δ (q1, a) and (q2, a) are final states or both of them are non final states for all a ∊ Σ. These states are said to be 0-equivalent.

Definition 2

Two states q1 and q2 are K-equivalent (K ≥ O) if both δ(q1, x) and δ(q2, x) are final states or both non final states for all strong × of length K or less.

Therefore, any two final states are K-equivalent if they belong to same set in K-l step otherwise they are not K-equivalent.

Properties:

P1: If the relation (q1 & q2) is reflexive, symmetric and transitive, then it is said to be equivalent relation (i.e., K – Equivalence relation).

P2: Every equivalence relation partition set. is K- Equivalence relation partition set Q.

P3: If (q1 and q2) are K-equivalent for all, K ≥ O, then they are equivalent.

P4: If q1 and q2 are (K + 1) – Equivalent, then they are K-Equivalent.

2.10.18  Minimization of DFA

For any given Deterministic Automation with more number of states, we can construct its equivalent DFA with minimum number of states.

Construction of minimum automation

  1. Initially construct 0 – equivalence class as ∏o = {Q1o, Q2o} Where Q1o is set of final states and Q2o = Q – Q1o is set of non final states.
  2. Construct ∏K+1 from ∏K further partitioning as follows:
    1. Let Q1K be any subset in ∏K. if q1 and q2 are in Q1K they are (K + 1) equivalent provided δ(q1, a) and δ(q2, a) are K-equivalent.
    2. Find out whether δ(q1, a) and δ(q2, a) are in the same equivalence class in ∏K for every a ∊ Σ. If so, q1 and q2 are (k + 1) equivalence. This way Qik is further divided into (K + 1) equivalence classes. Repeat this for every QiK in ∏K to get all the elements of ∏K+1.
  3. Construct ∏n for n = 1, 2, 3,……………until ∏n = ∏n+1.
  4. For required minimum state automation, states are equivalent classes obtained finally.

First approach

Example 11:

Find the minimum finite state automaton for the following DFA.

image
image

Any two final states are 0 – equivalent and any two non final states are also 0 – equivalent.

Π0 (1, 2) = {{c},{a, b, d, e, f, g, h}}

image

From the above table, we find a, e, and g are 1–equivalent and b and h are 1–equivalent and d, f are 1 – equivalent. Hence Π1 is as follows:

Π1 (1, 3, 4, 5) = {{a, e, g}{b, h} {d, f} {c}}

Using the new classes we find whether they are 2–equivalent.

image

Π2 (1, 6, 7, 8, 9) = {{a, e}, {b, h}, {d, f}, {g}, {c}}

image

Π3 (1, 6, 7, 8, 9) = {{a, e}, {b, h}, {d, f}, {g}, {c}}

image

From the above two relations, Π2 and Π3 are the same. Hence, the final set of states are the sets 1, 6, 7, 8, 9 where {a, e}, {b, h}, {d, f}, {g}, {c} are all 3 – equivalent. The minimized DFA is as follows:

image
image

Second approach

2.10.19  Minimization of DFA Using the Myhill Nerode Theorem

The Myhill Nerode theorem that is used to prove the given language is not regular and also to eliminate useless states in the given DFA. The theorem is stated as

  • The language L accepted by a DFA is regular if and only if the number of equivalence classes of RL is finite.
  • The number of states in the smallest automaton accepting L is equal to the number of equivalence classes in RL. Therefore, RL is of finite index.

Let ≡ equivalence on the states in M such that

P ≡ q if and only if for each input string x

      δ(p, x) = δ(q, x) = qa where qa is accepting state.

      ⇒ p is equivalent to q.

If δ(p, x) = qa and δ(q, x) = qn for some qa ∊ F and qn ∉ F.

      ⇒ p is distinguishable from q.

Algorithm for finding distinguishable states:

  • For each pair [p, q] where p ∊ F and q ∊ {Q – F}, mark (p, q) = X
  • For each pair of distinct states [p, q] in F × F or (Q – F) X (Q – F) do
    • If for some input symbol a δ([p, q], a) = [r, s], if [r, s] = X then
      • Mark [p, q]
      • Recursively mark all unmarked pairs which lead to [p, q] on input for all a ∊ Σ.
  • Else
    • for all input symbols ‘a’ do
    • put [p, q] on the list for δ([p, q], a) unless δ([p, q], a) = [r, r].
  • For each pair [p, q], which is unmarked are the states which are equivalent.

Example 12:

Find minimum-state automaton equivalent to the transition diagram given.

image

Solution: The distinguishable states are marked with symbol x. The relation of all states are represented as a matrix of size n × n. since if p is distinguishable to q, it implies that q is distinguishable to p. Therefore, it is sufficient to have a lower matrix to represent the relation of one state with all the other states.

Step 1: First mark for all states (p, q) where p is the final state and q is the non-final state,

image

Step 2: Find the states that are distinguishable with a.

δ([a,b],0) = [b,a]

δ([a,b],1) = [a,c]

 

δ([a,c],0) = [b,d]

δ([a,c],1) = [a,b]

⇒ mark[a,c] = x as [b,d] = x

 

 

⇒ mark[a,b] = x as [a,c] = x

 

 

 

δ([a,e],0) = [b,d]

δ([a,e],1) = [a,f]

⇒ mark[a,e] = x as [b,d] = x

δ([a,f],0) = [b,g]

δ([a,f],1) = [a,e]

⇒ mark[a,f] = x as [a,e] = x

δ([a,g],0) = [b,f]

δ([a,g],1) = [a,g]

 

δ([a,h],0) = [b,g]

δ([a,h],1) = [a,d]

⇒ mark[a,h] = x as [a,d] = x

Find the states that are distinguishable with b

δ([b,c],0) = [a,d]

δ([b,c],1) = [c,b]

⇒ mark[b,c] = x as [a,d] = x

δ([b,e],0) = [a,d]

δ([b,e],1) = [c,f]

⇒ mark[b,e] = x as [a,d] = x

δ([b,f],0) = [a,g]

δ([b,f],1) = [c,e]

 

δ([b,g],0) = [a,f]

δ([b,g],1) = [c,g]

⇒ mark[b,g] = x as [a,f] = x

δ([b,h],0) = [a,g]

δ([b,h],1) = [c,d]

⇒ mark[b,h] = x as [c,d] = x

Find the states that are distinguishable with c

δ([c,e],0) = [d,d]

δ([c,e],1) = [b,f]

 

δ([c,f],0) = [d,g]

δ([c,f],1) = [b,e]

⇒ mark[c,f] = x as [d,g] = x

δ([c,g],0) = [d,f]

δ([c,g],1) = [b,g]

⇒ mark[c,g] = x as [d,f] = x

δ([c,h],0) = [d,g]

δ([c,h],1) = [b,d]

⇒ mark[c,h] = x as [d,g] = x

Find the states that are distinguishable with e

δ([e,f],0) = [d,g]

δ([e,f],1) = [f,e]

⇒ mark[e,f] = x as [d,g] = x

δ([e,g],0) = [d,f]

δ([e,g],1) = [f,g]

⇒ mark[e,g] = x as [d,f] = x

δ([e,h],0) = [d,g]

δ([e,h],1) = [f,d]

⇒ mark[e,h] = x as [d,g] = x

Find the states that are distinguishable with f

δ([f,g],0) = [g,f]

δ([f,g],1) = [e,g]

⇒ mark[f,g] = x as [e,g] = x

δ([f,h],0) = [g,g]

δ([f,h],1) = [e,d]

⇒ mark[f,h] = x as [e,d] = x

Find the states that are distinguishable with g

δ([g,h],0) = [f,g]

δ([g,h],1) = [g,d]

⇒ mark[g,h] = x as [g,d] = x

image

From the lower triangle of the matrix it is clear that state [a,g], [b,f], and [c,e] belong to the same class. These states can be merged and the minimized DFA is

image

2.11  Lex Tool: Lexical Analyzer Generator

2.11.1  Introduction

Lex is a language for specifying lexical analyzers.

Lex generates programs that can be used in simple lexical analysis of text. The input files contain regular expressions for recognizing tokens to be searched for and actions written in C to be executed when expressions are found.

Lex converts the regular expression into table-driven DFA. This deterministic finite automaton generated by Lex performs the recognition of the regular expressions. The order in which the program fragments written by the user are executed is the same as that in which the regular expressions are matched in the input stream.

 

The general form of a Lex source file is:

 

Declarations

%%

Regular expressions {actions}

%%

Subroutines

Lex program contains three sections—declarations, regular expressions, and subroutines where the declaration and the user subroutines are often omitted. The second section, that is, the regular expressions part is compulsory. The declarations part contains ‘c’ declarations and lex declarations. ‘c’ declarations are embedded between % {and %}. Lex declarations contain token definitions. Token rules, that is, patterns are defined as regular expressions in the second part. When the given input matches with this pattern, action to be performed is described against that regular expression. Lex program is stored with an extension “.l” like x.l. Lex turns the user’s expressions and actions into the host general-purpose language. The generated program is named yylex() in the lex.yy.c file. This lex.yy.c is the lexer in C. To run this c file, first compile lex.yy.c with cc then use the exe file a.out to test the output.

Expressions in a stream will be recognized by the yylex and it performs the specified actions for each expression whenever they are matched. See the following figure.

Assume that lex specification is prepared in file x.l. Run this input file x.l with lex, that gives lex.yy.c as output. Run lex.yy.c under the C compiler that gives a.out as output.

image

Regular expressions in Lex use the following operators:

 

 

a

the character “a.”

 

“a”

an “a”, even if a is an operator

 

a

an “a”, even if a is an operator

 

[ab]

the character a or b

 

[a-c]

the characters a, b, or c

 

[^a]

any character but a

 

.

any character but newline

 

^a

an a at the beginning of a line

 

<a>b

a b when Lex is in start condition a

 

a$

an a at the end of a line

 

a?

an optional a

 

a*

0, 1, 2, … instances of a

 

a+

1, 2, 3, … instances of a

 

a|b

an a or a b

 

(a)

an a

 

a/b

an a but only if followed by b

 

{aa}

the translation of aa from the definitions section

 

a{m,n}

m through n occurrences of a

Solved Problems

1. Write a lexical analyzer for input

 

void main()

{

    int a = 10;

}

Lex Code:

images

2. Here is a program in Lex to convert lowercase to uppercase. Ignore the blanks, and replace multiple blanks by a single blank.

 

%%

[a-z] {putchar(yytext[0]+’A’);}

[ ]+ { }

[ ]+ {putchar(‘ ‘);}

The following program copies an input file while adding 10 to every positive number divisible by 7.

 

%%

int a;

[0-9]+ {

      a = atoi(yytext) ;

      if (a%7 == 0)

         printf(“%d”, a + 10);

      else

         printf(“%d”,a);

      }

In this program, the rule [0–9]+ recognizes strings of digits. The atoi( ) function converts the digits to binary and stores the result in “a.” The modulus operator % is used to check whether “a” is divisible by 7. If it is divisible then “a” is incremented by 10, else written as it is to the output.

Summary

  • A lexical analyzer reads the input text and divides it in to a stream of tokens.
  • Token is a group of characters with logical meaning.
  • Pattern is a rule that describes tokens. Usually it is a regular expression.
  • Lexeme is actual text that matches the pattern and is recognized as token.
  • A lexical analyzer removes comment lines and white space characters.
  • Lex is an automated tool for generating scanners.
  • There are three different sections of LEX program: definitions, translation rules, and subroutines.
  • The output of the Lex compiler is in the lex.yy.c file.
  • Finite automaton is used for recognizing tokens.
  • Tokens are described by regular expressions.

Fill in the Blanks

  1. The lexical analyzer reads source text and divides them into ________.
  2. Pattern is a rule that describes a ________.
  3. In “if a != b” Lexeme is for token id is ________
  4. ________ and ________ are the routines compulsory in the subroutine section.
  5. The lex code that deletes all blanks or tabs at the ends of lines is ________.
  6. The lex code to count number of lines is ________.
  7. ________ is the default ‘c’ file name Lex produces after compiling the Lex program to c program.
  8. ________ is an example of non token in C language.
  9. Can we combine lexical analysis phase with parsing? Yes or No. ________
  10. ________ is the advantage of having lexical analysis as a separate phase.
  11. ________ is the regular expression that generates all and only the strings over alphabet {0, 1} that ends in 1.

Objective Question Bank

  1. The example of lexical error is
    1. Xyz
    2. _xyz
    3. x2yz
    4. $xyz
  2. Which of the following is an example of non tokens in C program?
    1. if
    2. #ifdef
    3. rate
    4. none
  3. Which of the following is a regular expression for patterns, integer numbers, and real numbers with scientific notation? Let D is [0–9]*
    1. D+(.D+)?(E(+ | –)?D+)
    2. (.D+)?(E(+ | –)?D+)
    3. D+(.D+)?(E(+ | –)?D*)
    4. D+(.D+) | (E(D+)
  4. What are the tasks performed by a lexical analyzer?
    1. Removes comment lines and white space characters
    2. Correlates error messages.
    3. dividing text into tokens
    4. all
  5. Find the number of tokens in the following code if (x > y) z=0;
    1. 8
    2. 9
    3. 10
    4. 7
  6. * Find the number of tokens in following code printf(“&i = % x i = %d”, & i, i);
    1. 7
    2. 8
    3. 9
    4. 10
  7. Following is the function of a lexical analyzer:
    1. Giving errors
    2. removing blanks
    3. correlating err messages
    4. all
  8. Following is a regular expression for the set of all strings over the alphabet {a} that has an even number of a’s.
    1. aa*
    2. (aa)*
    3. aa*a
    4. a(aa)*
  9. Find the number of tokens in the following C code

     

    int max(int x, int y) {

     /* find max */

     return (x > y:? x : y); }

    1. 25
    2. 26 .
    3. 22
    4. 21
  10. Which of the following best describes the output of LEX?
    1. DFA
    2. Tokens
    3. parser
    4. lexemes
  11. * Which of the following strings can definitely be said to be tokens without looking at the next input characters while compiling a Pascal program?

    (I) begin (II) Programs (III) < >

    1. II
    2. III
    3. all
  12. * In a compiler, the module that checks every char of source text is called
    1. Code generation
    2. Code optimization
    3. Lexical analysis
    4. Syntax analysis.
  13. * Match pairs

    (a) Lexical analysis (p) DAGS

    (b) Code optimization (q) Syntax Tree

    (c) Code generation (r) PDA

    (d) Abelian Group. (s) FA

    1. a – s, b – q, c – p, d – r 
    2. a – s, b – p, c – q, d – r
    3. a – s, b – p, c – r, d – q
    4. none
  14. Tables created during lexical analysis are
    1. terminal table
    2. identifier table
    3. literal table
    4. non uniform symbol table
  15. Find the number of tokens in the following code:

    if (x >= y) z = 0;

    1. 8
    2. 9
    3. 10
    4. 11

Exercises

  1. Write regular definitions for specifying a floating point number in a programming language like C.
  2. Write regular definitions for specifying an integer array declaration in a language like C.
  3. Convert the following regular expressions into NFA with ε.
    1. 0 + (1 + 0)*00
    2. zero→0; one→1; bit→zero + one; bits→bit*
    3. (011 + 101 + 110 + 111)*
    4. (000 + 111)* + (101 + 010)+
  4. Write a lex program to scan an input C source file and replace all “float” with “double.”
  5. Write a lex program that scans an input C source file and produces a table of all macros (#defines) and their values.
  6. Write a lex program that reads an input file containing numbers (either integers or floating-point numbers) separated by one or more white spaces and adds all the numbers.
  7. Write a lex program that scans an input C source file and recognizes identifiers and keywords. The scanner should output a list of pairs of the form (token; lexeme), where token is either “identifier” or “keyword” and lexeme is the actual string matched.
  8. Run the lexical analyzer for the following C program and comment on tokens/output.

     

    main()

    {

    int a[3], t1t2;

    t1=2;

    a[0]=1; a[1]=2; a[t1]=3;

    t2 = -(a[2]+t1*6)/(a[2]-t1);

      if t2>5

             print(t2);

       else

       {

          int t3; t3=99; t2=-25;

          print(-t1 +t2*t3); /*this is a comment on 2 lines */

       } endif

    }

  9. Write a program for the lexical analyzer to divide the following C program to tokens.

     

    main()

    {

    int x, y, z;

    x=10;

    y=20;

    z=30;

    }

  10. Run the lexical analyzer for the following C program and comment on tokens/output.

     

    main()

    {

    int i, j;

    while(i<100)

    {

    i=i+10;

    printf(“%d ”, &i);

    }

    }

Key for Fill in the Blanks

  1. stream of tokens 
  2. Tokens
  3. a or b
  4. main(), yywrap()
  5. [ ]+$;
  6. [ ] {lno++;}
  7. lex.yy.c
  8. #include, #define, /*
  9. yes
  10. Design simplified.
  11. (0 + 1)*1+

Key for Objective Question Bank

  1. d
  2. b
  3. a
  4. d
  5. c
  6. d
  7. d
  8. b
  9. c
  10. a
  11. c
  12. c
  13. b
  14. d
  15. c

CODING:

Design a lexical analyzer generator that can handle the following sample input. The lexical analyzer should ignore redundant spaces, tabs, newlines, and comments.

 

Input:

main()

{

      int x, y;

      if (x > y)

         printf(“x is greater”);

      else

         y=10; /* this is just a comment */

}

Logic of the program:

  • Read the input text from the file input_text.dat.
  • Read the text, line by line and divide it into different words.
  • Match words with tokens like “opr” for operator, “num” for numeric constant, “id” for identifier, etc.
  • To distinguish between keyword and id, compare the word with the keyword list first, if it does not match then treat it as an id.
  • Print token along with lexeme and line number.

Lexical analyzer developed in C language

 

 /* Lexical Analyzer for any C program as input */

 

#include<stdio.h>

#include<string.h>

#include<ctype.h>

int main()

{

int l, i, j, k, m, sno=1;

char *keys[5]={“main”,”int”,”printf”,”float”,”if”,”else”};

char line[100]={0}, str[20];

FILE *fptr;

clrscr();

fptr=fopen(“input_txt.dat”,“r”);    /* input file input_txt*/

printf(“S.no Lexeme Token Line no. ”);    /* output format */

printf(“__________________________________”);

for(l=1;!feof(fptr);l++)

{

  fgets(line, 100, fptr);

  for(i=0; line [i]!=’’;)

  {

   if(line [i]==’ ‘)    /* skip blank spaces */

    i++;

   else if(line [i]==’ ’)

    line [i]=’’;

   else if(isalpha(line [i]))

   {

    j=0;

    while(isalnum(line [i]))

    {

     str[j]= line [i];

     j++;

     i++;

    }    /* check for arrays */

    if(line [i]==’[‘)

    {

      str[j]=’’;

      printf(“%d. %s array %d ”, sno, str, l);sno++;

      i++; j=0;

      while(line [i]!=’]’)

      {

         str[j]= line [i];

         j++;

         i++;

      }

      str[j]=’’;

      printf((“%d. %s index %d ”, sno, str, l);sno++;

      i++;

    }

    else    /* check for keyword or id */

    {

    str[j]=’’;

    for(k=0;k<5;k++)

    {

     if(!strcmp(keys[k],str))

      break;

    }

    if(k==5)

     printf((“%d. %s identifier %d ”, sno, str, l);sno++;

    else

     printf((“%d. %s keyword %d ”, sno, str, l);sno++;

    }

   }    /* check for Number constants */

   else if(isdigit(line [i]))

   {

    j=0;

    while(isdigit(line [i]))

    {

     str[j]= line [i];

     j++;

     i++;

    }

    str[j]=’’;

    printf((“%d. %s const %d ”, sno, str, l);sno++;

   }    /* check for comment lines */

   else if((line [i]==’/’)&&(line [i+1]==’*’))

   {

    printf((“%d. /* start of comment %d ”,sno, l);sno++;

    i=i+2;

    while(line [i]!=’*’ && line [i+1]!=’/’ && line [i]!=’ ’)

    {

     i=i+1;

     if(line [i]==’ ’)

      {

       fgets(line,100,fptr);

       l++;

      }

    }

     i=i+2;

     printf(“ %d ”,l);

   }    /* check for operators */

   else

   {

    j=0;

    str[j]= line [i];

    j++;

    switch(line [i])

    {

     case ‘+’:

     case ‘-’:

     case ‘*’:

     case ‘/’:

     case ‘=’:

     case ‘>’:

     case ‘<’:

     case ‘<=’:

     case ‘>=’:

     case ‘%’:{

          if(line[i]==’%’ &&(line[i+1]==’d’||line[i+1]==’f’||

          line[i+1]==’c’))

        {

          str[j]=line[++i];

          i++;j++;

          str[j]=’’;

          printf((“%d. %s special symbol %d ”, sno, str, l);sno++;

          j=0;

           break;

        }

        else

        {

          str[j]=’’;

          printf((“%d. %s operator %d ”, sno, str, l);sno++;

          j=0;

          i++; break;

        }

      }

      case ‘(‘:

      case ‘)’:

      case ‘{‘:

      case ‘}’:{

          str[j]=’’;

          printf((“%d. %s parenthesis %d ”, sno, str, l);sno++;

          j=0;

          i++;

          break;

        }

      case ‘;’:{

          str[j]=‘‘;

          printf((“%d. %s psymbol %d ”, sno, str, l);sno++;

          j=0;

          i++;

          break;

        }

      default: {

        str[j]=‘‘;

        j=0;

        i++;

        printf((“%d. %s special symbol %d ”, sno, str, l);sno++;

        break;

      }

     }

    }

   }

  }

getch();

return 1;

}

Output:

image
image

Lexical analyzer using the LEX tool

Program: Write the lexical analyzer using the ‘lex’ tool that can handle the following sample input.

 

Input:

main()

{

      int x,y;

      if (x > y)

         printf(“x is greater”);

      else

         y=10; /* this is just a comment */

}

Logic

  • Define regular expressions for all tokens-constants, keywords, ids, operators, etc.
  • When input matches with that pattern, define the action to be performed against each regular expression.
  • Include main () that calls yylex () function that is created by the lex tool.
  • Include the yywrap () function that deals with EOF.

Lex Program

image
image

After executing the above program for the given input, the OUTPUT of lex is as follows:

 

Lexeme

Token

Line no

main

keyword

1

(

parenthesis

1

)

parenthesis

1

{

parenthesis

2

int

keyword

3

x

identifier

3

,

special symbol

3

y

identifier

3

;

special symbol

3

if

keyword

4

(

parenthesis

4

x

identifier

4

>

relational opr

4

y

identifier

4

)

parenthesis

4

printf

keyword

5

(

parenthesis

5

special symbol

5

x is greater

comment

5

special symbol

5

)

parenthesis

5

;

special symbol

5

else

keyword

6

y

identifier

7

=

Assignment opr

7

10

number constant

7

;

special symbol

7

/* this is just a comment */

comment

7

}

parenthesis

8

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.172.223