Chapter 3

Syntax Definition – Grammars

The syntax of high-level language is defined with context free grammar. Hence, these are used as a powerful tool by parsers in verifying the syntax of any high-level language.

CHAPTER OUTLINE 

The structure of a sentence in any language is defined with grammars. Hence, we will discuss what grammar is, how many types of grammar there are, and what are the different representations of grammar. We mainly focus on Context Free Grammars (CFG). What is CFG? How is a language defined with CFG? What are derivation tree, left-most derivation, and right-most derivation? These are all explained in detail in this chapter. We shall also discuss the problems with CFG like left recursion, left factoring, ambiguous grammars, and how to simplify grammars.

3.1  Introduction

Grammar is a set of rules and examples dealing with the syntax and word structures of a language, usually intended as an aid to the learning of that language.

Grammar is basically defined as a set of 4-tuple (V, T, P, S),

where V is set of nonterminals (variables),

      T is set of terminals (primitive symbols),

      P is set of productions (rules), which govern the relationship between nonterminals and terminals,

And    S is start symbol with which strings in grammar are derived.

 

These productions or rules define the strings belonging to the language. The motivation for these grammars was from the description of natural language. Let us examine the rules used to define a sentence in the English language.

 

<sentence> → <noun> <verb>

<noun> → <com-noun> | <prop-noun>

<verb> → ate | sat | ran

<com-noun> → Rama | Bhama

<prop-noun> → She | He

Using these set of rules many sentences can be derived by substituting for variables.

  1. Rama ate.
  2. Bhama sat.
  3. She ran.
  4. He sat.

Language defined by a grammar G is denoted by L(G). Here L(G) represents a set of strings w derived from G. Start symbol S in one or more steps derives the strings or sentences of grammar that is represented by S ⇒ + w. The sentential form of grammar is a combination of terminals and nonterminals. It is derived from S using one or more derivations. It is represented by S ⇒ + α. Two grammars G1 and G2 are equivalent if L(G1) = L(G2).

Some of the notations used to represent the grammar are:

  1. Terminals symbols: these are represented by
    • Lower case letters of alphabet like a, c, z, etc.
    • Operator symbols like +, –, etc.
    • Punctuation symbols like ( , } , , etc.
    • Digits like 0…9, etc.
    • Bold face strings like int, main, if, else etc.
  2. Nonterminal symbols: these are represented by
    • Upper case letters of alphabet like A, C, Z, etc.
    • Letter S is the start symbol.
    • Lower case strings like expr, stmt, etc.
  3. Grammar symbols can be terminals or nonterminals. They can be represented by Greek letters α, β.
  4.  Lower case letters u, v, w, x, y, z are generally used to represent a string of terminals.

Language acceptance:

The language starts with the start symbol; at every step, replace the nonterminal by the right hand side of the rule. Continue this until a string of terminals is derived. The string of terminals gives the language accepted by grammar.

Consider the language represented by a+, represented by a set {a, aa, aaa, ….}. To generate strings of this language we define grammar as S → a and S → aS. Now we get strings as follows:

Strings starting with S:

 

S ⇒ a

{a}

 

S ⇒ aS ⇒ aa

{aa}

 

S ⇒ aS ⇒ aaS ⇒ aaa

{aaa}

3.2  Types of Grammars—Chomsky Hierarchy

Linguist Noam Chomsky defined a hierarchy of languages, in terms of complexity. This four-level hierarchy, called the Chomsky hierarchy, corresponds to four classes of machines. Each higher level in the hierarchy incorporates the lower levels, that is, anything that can be computed by a machine at the lowest level can also be computed by a machine at the next highest level.

The Chomsky hierarchy classifies grammars according to the form of their productions into the following levels:

  • Type 0 Grammars—Unrestricted Grammars (URG): These grammars include all formal grammars. In URG, all the productions are of the form α → β, where α and β may have any number of terminals and nonterminals, that is, no restrictions on either side of productions. Every grammar is included in it if it has at least one nonterminal on the left hand side.

    Example:

    The language specified by L(G) = {a2i | I > = 1} is an unrestricted grammar. This grammar is also called phrase structured grammar. The grammar is represented by the following productions:

     

            S → A C a B

          C a → a a C

          C B → D B

          C B → E

          a D → D a

          A D → A C

          a E → E a

          A E → ε

    The string “aa” can be derived from the above grammar as follows:

     

    S ⇒ ACaB ⇒ AaaCB ⇒ AaaE ⇒ AaEa ⇒ AEaa ⇒ aa

     

    They generate exactly all languages that can be recognized by a turing machine. The language that is recognized by a turing machine is defined as all the strings on which it halts. These languages are also known as the recursively enumerable languages.

  • Type 1 Grammars—Context Sensitive Grammars (CSG): These grammars define the context sensitive languages. In CSG, all the productions of the form α → β where | α | ≤ | β |, α and β may have any number of terminals and nonterminals.

    These grammars can have rules of the form αAβ → αγβ with A as nonterminal and α, β and γ are strings of terminals and nonterminals. We can replace A by γ, where A lies between α and β. Hence, the name context sensitive grammar. The strings α and β may be empty, but γ must be nonempty. It cannot include the rule S → ε. These languages are exactly all languages that can be recognized by a Linear Bound Automata.

    Example:

    The language specified by L(G) = {anbn | n > = 1} is a context sensitive grammar. The grammar is represented by the following productions.

            S → S B C | a C

            B → a

         C B → B C

          Ba → aa

          C → b

          The string “aaabbb” can be derived from the above grammar as follows:

          S ⇒ SBC ⇒ SBCBC ⇒ aCBCBC⇒ aBCCBC ⇒ aaCCBC ⇒ aaCBCC ⇒ aaBCC ⇒ aaaCCC ⇒ aaabbb

  • Type 2 Grammars—Context Free Grammars (CFG): These grammars define the context free languages. These are defined by rules of the form α → β with | α | ≤ | β where | α | = 1 and is a nonterminal and β is a string of terminals and nonterminals. We can replace α by β regardless of where it appears. Hence, the name context free grammars. These languages are exactly those languages that can be recognized by a nondeterministic pushdown automaton. Context free languages define the syntax of all programming languages.

    Example:

    The language specified by L(G) = {anbn | n >= 1} is a context free grammar. The grammar is represented by the following productions.

    S → aSb | ε

    The string “aabb” can be derived from the above grammar as follows:

    S ⇒ aSb ⇒ aaSbb ⇒ aaεbb⇒ aabb

  • Type 3 Grammars—Regular Grammars (RG): These grammars generate the regular languages. Such a grammar restricts its rules to a single nonterminal on the left hand side. The right hand side consists of either a single terminal or a string of terminals with a single nonterminal on the left or the right end. Here rules can be of the form A → a B | a or A → Ba | a. The rule S → ε is also allowed here. These languages are exactly those languages that can be decided by a finite state automaton. This family of formal languages can be obtained even by regular expressions (RE). Regular languages are used to define search patterns and the lexical structure of programming languages.

    Example:

    Right linear grammar: A → a A | a

    Left linear grammar: A → A a | a

    An example of a regular grammar G with V = {S, A}, Σ = {a, b, c}, P consists of the following rules:

    S → aS, S → bA, A → ε, A → cA

    and S is the start symbol. This grammar describes the same language as the regular expression a*bc*.

    Every regular language is context free since it is a subset of context free language; every context free language is context sensitive since it is a subset of context sensitive language and every context sensitive language is recursively enumerable. These are all proper inclusions, meaning that there exist recursively enumerable languages that are not context sensitive; context sensitive languages are not context free and context free languages are not regular languages.

    image
    RG ⊆ Context Free ⊆ Context sensitive ⊆ Recursively enumerable

Table 3.1 summarizes each of Chomsky’s four types of grammars, the class of languages it generates, the type of automaton that recognizes it, and the form of rules it must have. Here α and β represent a string of grammar symbols, that is, the string can be terminals or nonterminals.

 

Table 3.1 Chomsky’s Hierarchy of Grammars

Table 3.1

Example 1:

Give a CSG but not CFG for (an | n ≥ 1)

      S → aS | X

     aS → aa

      X → a

Give a CFG but not regular for (an | n ≥1)

      S → A S | a

      A → a

Give a regular grammar for (an | n ≥1)

      S → aS | a

Every regular grammar is context free, every CFG is context sensitive, and every CSG is unrestricted.

3.3  Grammar Representations

We need a way to represent grammars so that we can talk about them in general. For this, we have some standard representations.

  1. Backus-Naur Form (BNF): A metalanguage, BNF is widely used as a notation for the grammars of computer programming languages, command sets and communication protocols, as well as a notation for representing parts of natural language grammars.

    BNF is the most general representation scheme.

    Example:

        <syntax> ::= <rule>

    We use symbols like :=, |, <,>, (, ) etc.

    The main operators here are:

    1. Alteration “|”
    2. Concatenation .”
    3. Grouping “( )”
    4. Production Sign “ → ” or “::=”

    For example A:= α | β and B:= a γ | b γ is in BNF.

    Some more examples are:

     

    <simpl.e_expr> ::= <term> | <sign><term> | <simple_expr> <opr> <term>

    <opr> ::= + | - | or

    <unsign int> ::= <digit> | <unsign int><digit>

    where:

     

    •  ‘::=’ means “is defined as”
    •  ‘|’ means “or”
    •  Terms enclosed in angle brackets, such as <term>, are nonterminal symbols.
    • All other symbols (shown in boldface above [although it’s not obvious for some characters, such as ‘-’, in some browsers]) are terminal symbols.
    • The concatenation of strings is represented by the juxtaposition of their descriptions. (In the example above, note that there is no space between <sign> and <term>.)

    Note the recursive definition in the last example above. This is the standard way of specifying sequences of elements in BNF, and for describing nested syntax, as in:

     

    <stmt> ::= <uncond_stmt>

          | if <expr> then < uncond_stmt >

          | if <expr> then < uncond_stmt > else < stmt >

    BNF was first used in the Algol-60 Report.

  2. Extended Backus-Naur Form (EBNF): This improves the readability and conciseness of BNF through extensions:
    • Kleene cross—a sequence of one or more elements of the class marked.

      <unsign int> ::= <digit>+

    • Kleene star—a sequence of zero or more elements of the class marked.

      <id> : = <letter><alphanumeric>*

    •  Braces can be used for grouping elements, allowing us to avoid having to define intermediate classes such as <alphanumeric>.

      <id> : = <letter>{<letter>|<digit>}*

      Note: We generally use braces to mean zero or more repetitions of the enclosed elements (i.e., a Kleene star is assumed).

    • Square brackets may be used to indicate optional elements (i.e., zero or one occurrence of the enclosed elements).

      <integer> : = [ + | – ]<unsigned integer>

      Do not use a Kleene cross or star with square brackets.

      Stacking alternatives result in an even more pictorial representation of the syntax.

       

      <digit>

      <identifier> : = <letter> { <letter> }*

      <integer> : = [+ –]<unsigned integer>

EBNF is a simple extension of BNF, which tries to extend BNF and makes it easier to represent complex forms.

We add two new symbols to our original BNF symbol set viz.

1. {}—Repetition or Closure: Used when one pattern is repeated more than once.

2.  []—Optional: Used to represent optional patterns whose inclusion is not necessary.
Example, to represent A → A α | β
The corresponding RE is A : β α*
We can represent this in EBNF format as A β { α }
Another example is:
Consider the BNF form as follows:

 

      stmt if (expr) stmt | if (expr) stmt else stmt

 

The equivalent EBNF form is:

 

      stmt if (expr) stmt [else stmt]

 

Note: An advantage of EBNF is that it is very easy to write the corresponding parser for it.

Figure 3.1

Figure 3.1 Syntax Diagrams

3. Syntax Diagrams: They are also called railroad diagrams. Syntax diagrams are a way to represent a context free grammar. They represent a graphical alternative to Backus-Naur Form BNF or EBNF. Figure 3.1 shows syntax diagrams of the conditional statement “if.”

Syntax diagrams were first used in the Pascal User Manual and Report by Jensen and Wirth. The grammar can be represented as a set of syntax diagrams. Each diagram defines a variable or nonterminal. There is a main diagram that defines the language in the following way: to belong to the language, a word must describe a path in the main diagram.

Each diagram has an entry point and an end point. The diagram describes possible paths between these two points by going through other nonterminals and terminals. Terminals are represented by round boxes, while non terminals are represented by square boxes.

Example 2:

We use grammar that defines arithmetic expressions as an example. Following is the simplified BNF grammar for arithmetic expressions:

 

<Expr> ::= <Term> | <Term> “+” <Expr>

<Term> ::= <Fact> | <Fact> “*” <Term>

<Fact> ::= <num> | <id> | “(” <Expr> “)”

<id> ::= “x” | “y” | “z”

<num> ::= <digit> | <digit> <num>

<digit> ::= “0”|“1”|“2”|“3”|“4”|“5”|“6”|“7”|“8”|“9”

This grammar can also be expressed in EBNF:

 

Expr = Term, {“+”, Term};

Term = Fact, {“*”, Fact};

Fact = num | id | “(”, Expr, “)”;

id = “x” | “y” | “z”;

num = digit, {digit};

digit = “0”|“1”|“2”|“3”|“4”|“5”|“6”|“7”|“8”|"9”;

The set of rail road diagrams or syntax diagrams for this grammar are shown in Figure 3.2.

Nonterminals are represented by rectangular boxes, whereas terminals are represented by circles or ellipses.

Figure 3.2

Figure 3.2 Syntax Diagrams for Expressions

3.4  Context Free Grammars

For a regular grammar, the productions are restricted in two ways: The left side must be a single variable and the right side can be any string of terminals and nonterminals. To create grammars that are more powerful, we must ease off some of the restrictions. By permitting anything on the right side, but retaining restrictions on the left side, we get context free grammars.

A grammar G = (V, T, P, S) is said to be context free if all production in P have the form A → x where A € V and x € (V ∪ T)*.

V, T, P, S are the four important components in the grammatical description of a language.

V—the set of variables, also called nonterminals. Each variable represents a set of strings, simply a language.

T—the set of terminals, which are a set of symbols that forms the strings of the language, also called terminal symbols.

P—the finite set of productions or rules that represent the recursive definition of language. Each production consists of a variable, production symbol →, and a string of terminals and nonterminals. The string is called body of production.

S—the start symbol. It is one of the variables that represent the language being defined.

 

The language generated (defined, derived, produced) by a CFG is the set of all strings of terminals that can be produced from the start symbol S using the productions as substitutions. A language generated by a CFG is called a context free language (CFL).

Example 3:

 

terminal:

a

 

nonterminal:

S

 

productions:

S → aS

 

 

S → ε

is a simple CFG that defines L(G) = a*.

where V = {S} T = {a}

Example 4:

The CFG for defining palindrome over {a or b}.

The productions P are:

      S → ε | a | b

      S → aSa

      S → bSb

And the grammar is G = ({S}, {a,b}, P, S)

Example 5:

The CFG for set of strings with equal no of a’s and b’s.

The productions P are:

      S →SaSbS | SbSaS | ε

And the grammar is G = ({S}, {a,b}, P, S)

Example 6:

The context free grammar for syntactically correct infix algebraic expressions in the variables x, y, and z:

And the grammar is G = ({S,T}, {+,*,(,),–,x,y,z}, P, S)

      S → T + S | T – S | T

      T → T * T | T | T

      T → (S)

      T → x | y | z

This grammar can generate the string (x + y) * x – z * y | (x + x).

Example 7:

A context free grammar for the language consisting of all strings over {a, b} which contain a different number of a’s than b’s is

      S → U | V

      U → T a U | T a T

      V → T b V | T b T

      T → a T b T | b T a T | ε

Here, ‘T’ can generate all strings with the same number of a’s as b’s, ‘U’ generates all strings with more a’s than b’s and ‘V’ generates all strings with less a’s than b’s.

Example 8:

a. Give the CFG for RE (011 +1)*(01)*

Solution:

CFG for (011 + 1)* is A CA | ε

      C → 011 | 1

CFG for (01)* is B → DB | ε

      D → 01

Hence, the final CFG is S → AB

        A → CA | ε

        C → 011 | 1

        B → DB | ε

        D → 01

 

b. Give the CFG for language L(G) = anb2n where n ≥ 1.

Give the CFG for RE (01l + 1)*(01)*

Solution:

      The given language is a*(bb)*.

      Hence, it can be defined as

        S → aSbb | abb

 

c. Give the CFG for language containing all the strings of different first and last symbols over ∑ = {0,1}

Solution:

The strings should start and end with different symbols 0, 1. But in between we can have any string on 0,1 i.e., (0 + 1)*. Hence, the language is

 

      0(0 + 1)*1 | 1(0 + 1)*0. The grammar can be given by

        S → 0A1 | 1A0

        A → 0A | | A | ε

3.5  Derivation of CFGs

It is a process of defining a string out of a grammar by application of the rules starting from the starting symbol. We can derive terminal strings, beginning with the start symbol, by repeatedly replacing a variable or nonterminal by the body of the production. The language of CFG is the set of terminal symbols we can derive; so it’s called context free language.

Example 9:

Derive ‘a4’ from the grammar given below

 

Terminal:

a

 

Nonterminal:

S

 

Productions:

S → aS

 

 

S → ε

Solution:

The derivation for a4 is:

      S ⇒ aS

        ⇒ aaS

        ⇒ aaaS

        ⇒ aaaaS

        ⇒ aaaaε ⇒ aaaa

The language has strings as {ε, a, aa, aaa, .......}

Example 10:

Derive “a2” from grammar

 

Terminal:

a

 

Nonterminal:

S

 

Productions:

S → SS

 

 

S → a

 

 

S → ε

Solution:

Derivation of a2 is as follows:

      S ⇒ SS

        ⇒ SSS

        ⇒ SSa

        ⇒ SSSa

        ⇒ SaSa

        ⇒ eaSa

        ⇒ eaea = aa

The string can also be derived as

      S ⇒ SS

        ⇒ Sa

        ⇒ aa

Example 11:

Find L(G) and derive “abbab.”

 

Terminal:

a,b

 

Nonterminal:

S

 

Productions:

 

 

 

S → aS

 

 

S → bS

 

 

S → a

 

 

S → b

Solution:

More compact notation:

      S → aS | bS | a | b

Derive abbab as follows:

      S ⇒ aS

        ⇒ abS

        ⇒ abbS

        ⇒ abbaS

        ⇒ abbab

L(G) is (a + b)+

Example 12:

Find the language and derive ‘abbaaba’ from the following grammar:

 

Terminal:

a,b

 

Nonterminal:

S,X

 

Productions:

 

 

 

S → XaaX

 

 

X → aX | bX | ε

Solution:

CFL is (a + b) * aa(a + b)*

Derive ‘abbaaba’ as follows:

      S ⇒ XaaX

        ⇒ aXaaX

        ⇒ abXaaX

        ⇒ abbXaaX

        ⇒ abbεaaX ⇒ abbaaX

        ⇒ abbaabX

        ⇒ abbaabaX

        ⇒ abbaabaε ⇒ abbaaba

3.6  Language Defined by Grammars

The only way to recognize the language is to try out various strings from the given production rules. Simply by observing the derived strings, one can find out the language generated from the given CFG.

Example 13:

Give the language defined by grammar G.

G = {{S}, {a},{S → SS},S}

Solution: L(G) = Φ

Example 14:

Give the language defined by grammar G.

G = {{S,C},{a,b}, P, S} where P is given by

      S → aCa

      C → aCa | b

Solution:

S ⇒ aCa

      ⇒ aaCaa

      ⇒ aaaCaaa

L(G) = anban for n ≥ 1

Example 15:

Give the language defined by grammar G.

G = {{S},{0,1},P,S} where P is given by

      S → 0S1 | ε

Solution:

S ⇒ 0S1 ⇒ 00S11 ⇒ 0011

L(G) = 0n1n for n ≥ 0

3.6.1  Leftmost and Rightmost Derivation

The leftmost nonterminal in a working string is the first nonterminal that we encounter when we scan the string from left to right.

For example, in the string “bbabXbaY SbXbY,” the leftmost nonterminal is X.

 

If a word w is generated by a CFG by a certain derivation and at each step in the derivation, a rule of production is applied to the leftmost nonterminal in the working string, then this derivation is called a leftmost derivation (LMD).

 

Practically, whenever we replace the leftmost variable first in a string, then the resulting derivation is the leftmost derivation. Similarly, replacing rightmost variable first at every step gives rightmost derivation RMD.

Example 16:

Consider the CFG:({S,X},{a,b),P,S)

where productions are:

      S → baXaS | ab

      X → Xab | aa

      Find LMD and RMD for string w = baaaababaab.

Solution:

The following is an LMD:

S ⇒ baXaS

      ⇒ baXabaS

      ⇒ baXababaS

      ⇒ baaaababaS

      ⇒ baaaababaab

The following is an RMD:

S ⇒ baXaS

      ⇒ baXaab

      ⇒ baXabaab

      ⇒ baXababaab

      ⇒ baaaababaab

Any word that can be generated by a given CFG can have LMD | RMD.

Example 17:

Consider the CFG:

      S → aB | bA

      A → a | aS | bAA

      B → b | bS | aBB

Find LMD and RMD for string w = aabbabba.

Solution:

The following is an LMD:

      S ⇒ aB ⇒ aaBB

        ⇒ aabSB

        ⇒ aabbAB

        ⇒ aabbaB

        ⇒ aabbabS

        ⇒ aabbabbA

        ⇒ aabbabba

The following is an RMD:

      S ⇒ aB ⇒ aaBB

        ⇒ aaBbS

        ⇒ aaBbA

        ⇒ aaBbba

        ⇒ aabSbba

        ⇒ a abbAbba

        ⇒ aaabbabba

3.6.2  Derivation Tree

The process of derivation can be shown pictorially as a tree called derivation tree to illustrate how a word is derived from a CFG. These trees are called syntax trees, parse trees, derivation trees. These trees clearly depict how the symbols of terminal strings are grouped into substrings, each of which belongs to a language defined from one of the variables of grammar.

For constructing a parse tree for a grammar G = (V, T, P, S)

  • the start symbol S becomes root for the derivation tree
  • variable or nonterminal in set V is marked as interior node
  • Leaf node can be a terminal, or ε.
  • For each production in P like A → Y1, Y2,.. . ,Yk, if an interior node is labeled A, and its children are labeled Y1, Y2,…, Yk respectively, from the left to right.

Example 18:

CFG:

Terminals: a, b

Nonterminals: S, B

Production A → BBB | BB

        B → BB | bB | Ba | a | b

String “ababa” has the following derivation tree:

image

By concatenating the leaves of the derivation tree from left to right, we get a string which is known as yield of the derivation tree. The yield is a string that is always derived from the root of the tree. There are derivation trees whose yields are in the language of underlying grammar. Such trees are important because of the following reasons:

  • The root is labeled by the start symbol.
  • The terminal string is yield. All leaves are labeled with a terminal “b” or ε.
  • The strings we get in the intermediate step are called the sentential form.

3.6.3  Equivalence of Parse Trees and Derivations

A terminal string is in the language of a grammar iff it is the yield of at least one parse tree. The existence of leftmost derivation, rightmost derivation, and parse trees are equivalent conditions that define exactly the strings in the language of a CFG.

There are some CFGs for which it is possible to find a terminal string with more than one parse tree, or equivalently, more than one leftmost derivation and one rightmost derivation. Such a grammar is called ambiguous.

High-level programming languages are generally the class of languages that can be parsed with a CFG.

Example 19:

The parse tree below represents a leftmost derivation according to the grammar

      S → AB, A → aS | a, B → bA.

image

Give the left-sentential forms in this derivation.

Solution:

To construct a leftmost derivation from a parse tree, we always replace the leftmost nonterminal in a left-sentential form. It helps to remember that each symbol of each left-sentential form corresponds to a node N of the parse tree. When a nonterminal is replaced by a production body, the introduced symbols correspond to the children of node N, in order from the left.

Thus, we start with the left-sentential form S, where the symbol S corresponds to the root of the parse tree. At the first step, S is replaced by AB, the children of the root; A corresponds to the left child of the root, and B to the right child. Since A is the leftmost nonterminal of AB, it is replaced by the string “aS,” formed by the labels of the two children of the left child of the root. The next left-sentential form is thus aSB. We proceed in this manner, next replacing the S. The complete leftmost derivation is:

      S ⇒ AB ⇒ aSB⇒ aABB ⇒ aaBB ⇒ aabAB ⇒ aabaB ⇒ aababA ⇒ aababa.

So S, AB, aSB, aABB, aaBB, aabAB, aabaB, aababA are left-sentential forms, whereas “aSbA” is not left-sentential form.

This is actually a sentential form, but neither rightmost nor leftmost. One derivation according to this parse tree is S ⇒ AB ⇒ aSB ⇒ aSbA. Remember, in a leftmost derivation, we must replace the leftmost nonterminal at every step.

Example 20:

Draw the derivation tree for the following natural language grammar.

<sentence> → <noun phrase> <verb phrase>

<noun phrase> → <article> <noun>

<verb phrase> → <verb> <noun phrase>

<article> → an|a

<noun> → <adjective> <noun>

<verb> → is | was

<noun> → Tiger | Cat | elephant | animal

<adjective> → small | huge | dangerous

The parse tree for the sentence “An elephant is a huge animal” is given by

image

Issues to resolve when writing a CFG for a programming language:

  1. Left recursion
  2. Indirect left recursion
  3. Left factoring
  4. Ambiguity

3.7  Left Recursion

A grammar is left recursive if it has a nonterminal A such that there is a derivation.

A ⇒ + Aα for some string α. If this grammar is used in some parsers (top-down parser), parser may go into an infinite loop. Consider the following left recursive grammar

A → Ab | a

To derive a string “abbb,” there is an ambiguity as to how many times the nonterminal “A” has to be expanded. As grammar is left recursive, the tree grows toward left.

image

Top-down parsing techniques cannot handle left-recursive grammars. So, we have to convert our left-recursive grammar into an equivalent grammar, which is not left-recursive. The left-recursion may appear in a single step of the derivation (immediate left-recursion), or may appear in more than one step of the derivation.

Immediate Left-Recursion:

A grammar G is left recursive if there is rule of the form

A → A α | β where β does not start with A, that is, leftmost nonterminal on the right hand side is same as the nonterminal on the left hand side.

To eliminate immediate left recursion rewrite the grammar as

A → β A′ A′ → α A′ | ε. This is an equivalent grammar G′, which is free of left recursion.

L(G) is β α * and L(G′) is also β α * if α, β are assumed as nonterminals.

In general,

A → A α1 | … | A αm | β1 | … | βn where β1… βn do not start with A

      ⇓ Eliminate immediate left recursion

      A → β1 A′ | … | βn A′

      A → α1 A′ | … | αm A′ | ε an equivalent grammar

Immediate Left-Recursion – Example

      E → E + T | T

      T → T * F | F

      F → id | (E)

After eliminating immediate left recursion, we get grammar as

      E → T E′

      E′ → + T E′ | ε

      T → F T′

      T′ → *F T′ | ε

      F → id | (E)

Left-Recursion – Problem

A grammar cannot be immediately left-recursive, but it still can be left-recursive. By just eliminating the immediate left-recursion, we may not get a grammar that is not left-recursive.

      S → Aa | b

      A → Sc | d This grammar is not immediately left-recursive, but it is still left-recursive.

      S ⇒ Aa ⇒ Sca        or

      A ⇒ Sc ⇒ Aac causes to a left-recursion

So, we have to eliminate all left-recursions from our grammar

Eliminate Left-Recursion – Algorithm

Arrange nonterminals in some order: A1…An

 

for i from 1 to n do {

      for j from 1 to i–1 do {

        replace each production

          Ai → Ajγ

           by

          Ai → αiγ | ... | αkγ

          where Aj → α1 ...|αk

    }

  eliminate immediate left–recursions among Ai productions

}

Example 21:

Eliminate left-recursion in the following grammar:

        S → Aa | b

        A → Ac | Sd | f

Order of nonterminals: S, A

for S:

      We do not enter the inner loop.

      There is no immediate left recursion in S.

For A:

      Replace A → Sd with A → Aad | bd

      So, we will have A → Ac | Aad | bd | f

      Eliminate the immediate left-recursion in A

        A → bdA′ | fA′

        A′ → cA′ | adA′ | ε

So, the resulting equivalent grammar, which is not left-recursive is:

        S → Aa | b

        A → bdA′ | fA′

        A′ → cA′ | adA′ | ε

Example 22:

Eliminate left-recursion in the following grammar

        S → Aa | b

        A → Ac | Sd | ε

        Order of nonterminals: A, S

For A:

      We do not enter the inner loop.

      Eliminate the immediate left-recursion in A

        A → SdA′ | A′

        A′ → cA′ | ε

For S:

      Replace S → Aa with S → SdA′a | A′a

      So, we will have S → SdA′a | A′a | b

      Eliminate the immediate left-recursion in S

        S → A′aS′ | bS′

        S′ → dA′aS′ | ε

So, the resulting equivalent grammar, which is not left-recursive is:

        S → A′aS′ | bS′

        S′ → dA′aS′ | ε

        A → SdA′ | A′

        A′ → cA′ | ε

So by eliminating left recursion we can avoid top-down parser to go into infinite loop.

3.8  Left-Factoring

Sometimes we find common prefix in many productions like A → αβ1 | αβ2 | αβ3, where α is common prefix. While processing α we cannot decide whether to expand A by αβ1 or by αβ2. So this needs back tracking. To avoid such problem, grammar can be left factoring.

Left factoring is a process of factoring the common prefixes of the alternatives of grammar rule. If the production of the form A → αβ1 | αβ2 | αβ3 has α as common prefix, by left factoring we get the equivalent grammar as

      A → αA′

      A′ → β1 | β2 | β3

So, we can immediately expand A to αA′.

One can also perform left factoring to avoid backtracking or for factoring the common prefixes. If the end result has no look ahead or backtracking needed, the resulting CFG can be solved by a “predictive parser” and coded easily in a conventional language. If backtracking is needed, a recursive descent parser takes more work to implement, but is still feasible.

As a more concrete example:

      S if E then S

      S if E then S1 else S2

can be factored to:

      S → if E then S S′

      S′ → else S2 | ε

A predictive parser (a top-down parser without backtracking) insists that the grammar must be left-factored.

Example:

Consider the following grammar

      stmt → if expr then stmt else stmt

        | if expr then stmt

When we see “if,” we cannot know which production rule to choose to rewrite stmt in the derivation. To left factor the above grammar, take out the common prefix and rewrite the grammar as follows:

      stmt → if expr then stmt stmt′

      stmt′ → else stmt | ε

Left-Factoring – Algorithm

  • For each nonterminal A with two or more alternatives (production rules) with a common nonempty prefix, let us say

          A → αβ1 | … | αβn | γl | … | γm

    Convert it into

          A → αA′ | γ1 | ... | γm

          A′ → β1 | … | βn

Example 23:

Left-factor the following grammar

      A → abB | aB | cdg | cdeB | cdfB

Solution:

Here common prefixes are “a” and “cd.” So first takeout ‘a’ and rewrite the grammar as

      A → aA′ | cdg | cdeB | cdfB

      A′ → bB | B

Now take out the common prefix ‘cd’ and rewrite the grammar as

      A → aA′ | cdA″

      A′ → bB | B

      A″ → g |eB | fB

Example 24:

Left-factor the following grammar

      A → ad | a | ab | abc | b

      ⇓

Solution:

Here common prefixes are “a” and “ab.” So first take out “a.”

      A → aA′ | b

      A′ → d | ε | b | bc

Now take out the common prefix “b” and rewrite the grammar as

        ⇓

      A → aA′ | b

      A′→ d | ε | bA″

      A″ → ε | c

The above problem can also be solved by taking the longest match “ab” first.

      A → ad | a | ab | abc | b

        ⇓

      A → abA′ | b | ad | a

      A′→ ε | c

Now there is a common prefix “a.” So take that out now

      A → aA″ | b

      A′ → ε | c

      A″ → ε | bA′ |d

Finally, both ways we get the same solution.

So by left factoring we can avoid backtracking.

3.9  Ambiguous Grammar

A CFG is ambiguous if there exists more than one parse tree or equivalently, more than one leftmost derivation and one rightmost derivation for at least one word in its CFL.

image

Remember that there is no algorithm that automatically checks whether a grammar is ambiguous or not. The only way to check ambiguity is “to choose an appropriate input string and by trial and error find the number of parse trees.” If more than one parse tree exists, the grammar is ambiguous.

There is no algorithm that converts an ambiguous grammar to its equivalent unambiguous grammar.

Example 25:

Show that the following grammar is ambiguous.

      E → id | E + E

      | E * E | E – E

Solution:

LMD: for string id + id * id is

      E ⇒ E + E

        ⇒ id + E

        ⇒ id + E * E

        ⇒ id + id * E

        ⇒ id + id * id

RMD: for string id + id * id is

      E ⇒ E * E

        ⇒ E * id

        ⇒ E + E * id

        ⇒ E + id * id

        ⇒ id + id * id

Parse trees represented by above LMD and RMD are as follows:

image

As there is more than one parse tree, the grammar is ambiguous.

Example 26:

The grammar G: S→ SS | a | b is ambiguous. This means at least some of the strings in its language have more than one leftmost derivation. However, it may be that some strings in the language have only one derivation. Identify the string that has exactly two leftmost derivations in G.

Solution:

A string of length 1 has only one leftmost derivation, for example, S ⇒ a. A string of length 2 has only one derivation, for example, S ⇒ SS ⇒ aS ⇒ ab.

However, a string of length 3 has exactly two derivations, for example, S ⇒ SS ⇒ SSS ⇒ aSS ⇒ abS ⇒ aba and S ⇒ SS ⇒ aS ⇒ aSS ⇒ abS ⇒ aba. In general, we can decide whether the first S generates a single terminal or two S's.

On the other hand, strings of length four or more have more than two derivations. We can either start S ⇒ SS ⇒ SSS ⇒ SSSS or S ⇒ SS ⇒ SSS ⇒ aSS ⇒ aSSS or S ⇒ SS ⇒ aS ⇒ aSS, and there are other options as well.

Hence, strings like “aaa,” “bbb,” “aba,” “bab,” etc., has only two derivations.

Example 27:

Consider the grammar ‘G’ with

      terminals: a, b

      nonterminals: S

      productions: S → aS | Sa | a.

Show that G is ambiguous.

Solution:

The word “aa” can be generated by two different trees:

image

Therefore, this grammar is ambiguous.

Example 28:

Consider the grammar “G” with

      terminals: a, b

      nonterminals: S, X

      productions: S→ aS | aSb | X

        X → Xa | a

      Show that G is ambiguous.

Solution:

The word “aa” has two different derivations that correspond to different syntax trees:

      S ⇒ X ⇒ Xa ⇒ aa    S ⇒ aS ⇒ aX ⇒ aa

image

Hence, the grammar is ambiguous.

Example 29:

The grammar ‘G’ for ‘PALINDROMES’ is

      S → aSa | bSb | a | b | ε. Check if G is ambiguous.

Solution:

The grammar can generate the word “babbab” as follows:

      S ⇒ bSb

        ⇒ baSab

        ⇒ babSbab

        ⇒ babbab

which has the following derivation tree:

image

Since there is only one parse tree, the grammar is unambiguous.

Example 30:

Check whether the given grammar is ambiguous or not.

        S → i C t S | i C t S e S | a

        C → b

Solution:

To check the ambiguities take an input string. Let us say the input string is “ibtibtaea.” Let us draw the derivation trees.

image

Hence, grammar is ambiguous.

3.10  Removing Ambiguity

There is no algorithm that straightaway converts an ambiguous grammar to equivalent unambiguous grammar. But on analyzing the grammar, if we identify what is missing in the grammar and why it is unambiguous, then we can write equivalent unambiguous grammars.

For example, consider the expression grammar given below:

 

Expr → Expr + Expr | Expr * Expr | id

 

If we take a string id + id * id or id + id + id, we get two parse trees.

So if we analyze the grammar with the above two strings, we can understand the following.

  1. Precedence of operators is not taken care of. Hence, you can derive the string replacing either Expr with Expr + Expr or with Expr * Expr.
  2. Associative property is also not taken care of.

    If we take a string id + id + id, we can replace Expr Expr + Expr.

    Now we can replace either left Expr or right Expr.

So write equivalent unambiguous grammar by taking care of precedence and associativity.

To take care of precedence rewrite the grammar by defining rules starting with lowest precedence to highest precedence.

For example, in the given grammar 'id' has highest precedence, next is * and least is for +.

So do not define all of them at the same level but separate them into different levels by introducing extra nonterminals. Start defining the rules with +, then *, and finally with id as shown below.

 

      E → E + T | T

      T → T * F | F

      F → id

 

For example, now we want to add operator unary. As unary has the highest precedence than * or +, add rule for – after * by introducing the new nonterminal P. So the resulting grammar is

 

      E → E + T | T

      T → T * F | F

      F → -P | P

      P → id

 

To ensure associativity, define the rule as left recursive if the operator is left associative. Define the rule as right recursive if the operator is right associative. In the given grammar, + and * are left associative. So the rules must be left recursive, that is, AAa | b. The equivalent unambiguous grammar is

 

      E → E + T | T

      T → T * F | F

      F → id

 

This procedure can be used for any expression grammars.

Example:

Convert the following grammar to equivalent unambiguous grammar.

      bExpr → bExpr and bExpr

        | bExpr or bExpr

        | not bExpr

        | true

        | false

Solution:

This is the grammar that defines Boolean expressions with basic operators 'and', 'or,' and 'not'. Here priority of operators is highest for 'not', then for 'and' and then the least for 'or.' So start defining 'or' then 'and, next 'not.' So the resulting grammar is

      E → E or T | T

      T → T and F | F

      F → not F | true | false

Example:

Convert the following grammar to equivalent unambiguous grammar.

      R → R + R | RR | R* | a | b

Solution:

This is the grammar that defines regular expressions with basic operations union, concatenation, and Kleenes closure. Here, priority of operators is the highest for closure, next highest for concatenation, and least or union.

So start defining least to highest. So the resulting grammar is

      E → E + T | T

      T → T F | F

      F → F* | a | b

So given a grammar, by using the above procedure we can get precedence and associativity of operators. For example,

      A → A # B | B

      B → C $ B | C

      C → C @ D | D

      D → a | b

Here "#," "$," and "@" are "unknown" binary operators. From the grammar, we can get precedence and associativity of operators as follows:

      "@" has the highest precedence and is left associative

      "$" has the next highest precedence and is right associative

      "#" has the least precedence and is left associative

3.11  Inherent Ambiguity

A CFL is said to be inherently ambiguous if it is defined only with ambiguous Grammar. It cannot be defined with unambiguous grammar. Following is a grammar for an inherently ambiguous language: L = (ai bj ck | i = j or j = k}

Here L = {ai bi ck | ai bk ck}

Grammar for the above language is given by

 

S → S1 | S2

 

 

S1 → S1 c | A

S2 → a S2 | B

 

A → aAb | ε

B → b Bc | ε

For example, consider a string "abc" from the language; it can be derived in two ways.

image

A CFG can be used to specify the syntax of programming languages.

3.12  Non-context Free Language Constructs

There are some language constructs found in many programming languages, which are not context free. This means that, we cannot write a context free grammar for these constructs. Following are a few examples:

  • L1 = { ωcω | ω is in (a | b)*} is not context free

    → L1 abstracts the problem of declaring an identifier and checking whether it is declared or not before being used. We cannot do this with a context free language. We need semantic analyzer (which is not context free), which checks that identifiers are declared before use.

  • L2 = {anbmcndm | n ≥ 1 and m ≥ 1} is not context free

    → L2 is declaring two functions (one with n parameters, the other one with m parameters), and then calling them with actual parameters.

    It is interesting to note that languages close to L1 and L2 are context free. For example,

  • L1′ = {ωcωR | ω is in (a | b)*} is context free

    It is generated by grammar

    S → aSa | bSb | c.

  • L2′ = {anbmcmdn | n ≥ 1 and m ≥ 1} is context free.

    It is generated by grammar

    S → aSd | aAd

    A → bAc | bc

  • Also L2″ = {anbncmdm | n ≥ 1 and m ≥ 1} is context free.

    It is generated by grammar

    S → AB

    A → aAb | ab

    B → cBd | cd

3.13  Simplification of Grammars

As we have seen, various languages can be represented effectively by CFG. All the grammars are not always optimized. That means grammar may contain some extra symbols (unnecessary symbols). These will increase the length of the grammar. Simplification of the grammar involves removing all these unnecessary symbols. For example, look at following grammar

      S → AB

      A → a

      B → b | C

      E → c | ε

Here C never defines any terminal.

E and C do not appear in any sentential form.

E → ε is a null production.

B → C simply replaces B by C.

Hence, if we simplify the grammar as follows

      S → AB

      A → a

      B → b

Simplification of the grammar generally includes the following:

  1. Elimination of useless symbols
  2. Elimination of ε productions
  3. Elimination of unit productions of the form A → B

a. Elimination of useless symbols

A symbol is useless if it cannot derive a terminal or it is not reachable from the start symbol.

To check if the symbol is reachable from the start symbol, we can use the dependency graph or the following lemma.

Lemma 1: If the grammar G = (V,T,P,S) with L(G) Φ, we can effectively find an equivalent grammar G′ = (V′,T,P′,S) such that for each A in V′ there is some w in T* for which A* ⇒ w.

If A → w is a production where w ∊ T*, then A is in V′. If A → X1X2...Xn is a production where each X1 is in V′ then A is also in V′. The set V′ can be computed by the following algorithm:

  1. Initialize V1 = Φ.
  2. Include A to V2 where A → w for some w in T*.
  3. Repeat 4 and 5 while V1 ≠ V2.
  4. V1 = V2
  5. V2 = V1 ∪ {A} where A → α for some α in (T ∪ V1)*
  6. V′ = V2

From the above algorithm, find all A ∈ V′; now include only those productions that include V′ ∪ T.

Lemma 2: Given a grammar G = (V,T,P,S) we can effectively find an equivalent grammar G′ = (V′,T,P′,S) such that for each X in V′ ∪ T′ there exist α and β in (V′ ∪ T′)* for which image

  1. Initially place S in V′.
  2. For all productions A → α1 | α2 | .. | αn then add the variables α1, α2,...αn to V′ and all terminals to T′.
  3. P′ is the set of productions, which includes symbols of V′ ∪ T′.

Example 31:

Eliminate useless symbols and productions from the following grammar.

      S → ABa | BC, A → aC | BCC, C → a, B → bcc,

      D → E, E → d, F → e

Solution:

Step 1: Eliminate nongenerating symbols: All variables are generating.

Step 2: Elimination of nonreachable variables: Draw the dependency graph.

image
  • D, E, and F are nonreachable from S.
  • After removing useless symbols

      S → ABa | BC

      A → aC | BCC

      C → a

      B → bcc

Example 32:

Eliminate useless symbols in G.

      S → AB | CA

      S → BC | AB

      A → a

      C → aB | b

Solution:

Here B is not defined; hence B is a useless symbol.

C and A are reachable and are deriving terminals. Hence, useful.

So reduced grammar isone without useless symbols.

      S → CA

      A → a

      C → b

Example 33:

Eliminate useless symbols in G.

      S → aAa

      A → bBB

      B → ab

      C → a b

Solution:

Here C is useless as it is not reachable from the start symbol.

So reduced grammar is one without useless symbols.

      S → aAa

      A → bBB

      B → ab

Example 34:

Eliminate useless symbols in G.

      S → aS | A | BC

      A → a

      B → aa

      C → aCb

Solution:

Here C is useless as it is not deriving any string. B is not reachable.

So reduced grammar is the one without useless symbols.

      S → aS | A

      A → a

b. Elimination of ε-Productions

If some CFL contains the word ε, then the CFG must have a ε-production. However, if a CFG has a ε-production, then the CFL does not necessarily contain ε.

  e.g.,

      S → aX

      X → ε

  which defines the CFL {a}.

 

Nullable Variable: In a given CFG, a nonterminal X is nullable if

  1. there is a production X → ε
  2. there is a derivation that starts at X and leads to ε:
X => . . . => ε     i.e., X => ε.

For any language L, define the language L0 as follows:

  1. If L ⇒ ε, then L0 is the entire language L, i.e., L0 = L.
  2. If ε ∈ L, then L0 is the language L – {ε}, so L0 is all words in L except ε.

If L is a CFL generated by a CFG G1 that includes ε-productions, then there is another CFG G2 with no ε-productions that generates L0.

Procedure for eliminating e productions:

  • Construct Vn set of all nullable variables.
  • For each production B → A, if A is nullable variable, replace nullable variable by ε and add with all possible combinations on the RHS.
  • Do not add the production

          A → ε

Example 35:

Eliminate null production in the grammar

      S → ABaC

      A → BC

      B → b | ε

      C → D | ε

      D → d

Solution:

Nullable variables are Vn = {B,C,A}

Hence, equivalent grammar without null productions is

      S → ABaC | BaC | AaC | ABa | aC | Aa | Ba | a

      A → BC | B | C

      B → b

      C → D

      D → d

Example 36:

Eliminate null production in the grammar

      S → aA

      A → BB

      B → aBb | ε

Solution:

Nullable variables are Vn = {A,B}

Hence, equivalent grammar without null productions is

      S → aA | a

      A → BB | B

      B → aBb | ab

c. Eliminating unit productions

 

      A unit production is a production of the form A → B

 

If a language L is generated by a CFG G1 that has no ε-productions, then there is also a CFG G2 for L with no ε-productions and no unit productions.

Procedure for eliminating unit productions:

  • For each pair of nonterminals A and B such that there is a production

          A → B and the non unit productions from B are

          B → s1 | s2 | . . . | sn

    where the si ε (Σ + N)* are strings of terminals and nonterminals, then create the new productions as

          A → s1 | s2 | . . . | sn

    –Do the same for all such pairs A and B simultaneously.

    –Remove all unit productions.

    Example 37:

    Eliminate unit productions in the grammar

          S → A | bb

          A → B | b

          B → S | a

    Solution:

    After eliminating unit productions S → A, A → B, B → S, we get

          S → a | b | bb, A → a | b | bb, B → a | b | bb

    Example 38:

    Eliminate unit production from the grammar below:

          S → Aa | B, B → A | bb, A → a | bc | B

    Solution:

    Unit productions are S → B, B → A, and A → B

    – A, B, and S are derivable

    – Eliminating B in the A production gives A → a | bc | bb.

    – Eliminating A in the B production gives B → a | bc | bb.

    – Eliminating B in the S production gives S → Aa | a | bc | bb.

  • The final set of productions after eliminating unit productions is given below:

          S → Aa | a | bc | bb

          B → a | bc | bb

          A → a | bc | bb

    Example 39:

    Simplify the following grammar.

          S → aA | aBB

          A → aAA | ε

          B → bB | bbC

          C → B

    Solution:

    Here it is better to eliminate null productions as this may introduce useless symbols and unit productions. Next, eliminate unit productions and at the end eliminate useless symbols.

      Removing ε-productions gives resulting grammar as

          S → aA | a | aBB

          A → aAA | aA | a

          B → bB | bbC

          C → B

      Eliminating unit productions we get the resulting grammar as

          S → aA | a | aBB

          A → aAA | aA | a

          B → bB | bbC

          C → bB | bbC

      B and C are identified as useless symbols. Eliminating these we get

          S → aA | a

          A → aAA | aA | a

    Finally, the reduced grammar is S → aA | a, A → aAA | aA | a, which defines any number of a's.

3.14  Applications of CFG

  • Grammars are useful in specifying syntax of programming languages. They are mainly used in the design of programming languages.
  • They are also used in natural language processing.
  • Tamil poetry called "Venpa" is described by context free grammar.
  • Efficient parsers can be automatically constructed from a CFG.
  • CFGs are also used in speech recognition in processing the spoken word.
  • The expressive power of CFG is too limited to adequately capture all natural language phenomena. Therefore, extensions of CFG are of interest for computational linguistics.

Example 40:

As in example, CFG for Pascal statements are given below.

 

Stmt → begin optional_stmts end

optional_stmts → list_of_stmt | ε

list_of_stmt → list_of_stmt; Stmt | Stmt

Stmt → if Expr then Stmt

      |if Expr then Stmt else Stmt

      |while Expr do Stmt

      |id = Expr

Expr → Expr + Term | Term

Term → Term * Fctr | Fctr

Fctr → id | num

Solved Problems

  1. Give the CFG, which generates all positive even integers up to 998.

    Solution:

    We need to generate positive integers with 1 digit, or 2 digits, or 3 digits.

    One digit numbers are 0, 2, 4, 6, 8

    Two-digit and three-digit positive integers can have any number in 10's or 100's place. Hence, we can define grammar as follows

     

    S → A

    single-digit numbers

     

        | BA

    ……double-digit numbers

     

        | BBA

    ……Three-digit numbers

     

    B → 0 | 1 | 2 | 3…….9

     

     

    A → 0 | 2 | 4 | 6 | 8

     

  2. Give CFG for RE (a + b)*cc(a + b)*

    Solution:

    The grammar for all the strings on a, b i.e., (a + b)* is A → aA | bA | ε

    In the middle of any string of a, b on either side, string "cc" is occurring.

    So the grammar for the given regular expression is

          S → AccA

          A → aA | bA | ε

  3. Give CFG for RE {0m1n m, n ≥ 0}

    Solution:

    The regular expression is 0* 1*.

    The grammar for strings a* is A → aA | ε

    Hence the grammar for the given regular expression is:

          S → AB

          A → 0A | ε

          B → 1B | ε

  4. Give CFG for RE 02m1n m, n ≥ 0

    Solution:

    The given RE has zero or more strings of two zeros and one 1's.

    The language is (00)*1*. So the final grammar is

          S → AB

          A → 00A | ε

          B → 1B | ε

  5. Find the language defined by CFG.

          S → aB | bA

          A → a | aS | bAA

          B → b | bS | aBB

    Solution:

    To find the language defined by grammar, list the words defined by grammar.

          S ⇒ aB ⇒ ab

          S ⇒ bA ⇒ ba

          S ⇒ aB ⇒ abS ⇒ abaB ⇒ abab

          S ⇒ bA ⇒ bbAA ⇒ bbaSA ⇒ bbabAa ⇒ bbabaa

    Hence, the language defined by the given grammar is equal no of a's and b's.

  6. Find LMD and RMD for string 00101 in the grammar given below:

          S → B | A

          A → 0A | ε

          B → 1B | 0B | ε

    Solution:

    LMD, RMD are the same.

          S ⇒ B ⇒ 0B ⇒ 00B ⇒ 001B ⇒ 0010B ⇒ 00101B

            ⇒ 00101ε

            ⇒ 00101

  7. Find LMD, RMD, and derivation tree for string 00110101 in the grammar given below.

          S → 0B | 1A

          A → 0 | 0S | 1AA

          B → 1 | 1S | 0BB

    Solution:

    LMD

          S ⇒ 0B ⇒ 00BB ⇒ 001SB ⇒ 0011AB

            ⇒ 00110SB

            ⇒ 001101AB

            ⇒ 0011010B

            ⇒ 00110101

    RMD

    image
  8. Check whether the following grammar is ambiguous.

            S → 0S1 | SS | ε

    Solution:

    Let us consider the string 01. This string can be generated in two ways.

    • S ⇒ 0S1 ⇒ 01
    • S ⇒ SS ⇒ 0S1S ⇒ 01S ⇒ 01

    Hence, the given grammar is ambiguous grammar.

  9.  Check whether the following grammar is ambiguous for w = ab

          S → aB | ab

          A → aAB | a

          B → ABb | b

    Solution:

    The string "ab" can be derived using LMD as follows:

    1. S ⇒ ab
    2. S ⇒ aB ⇒ ab

    Since there are two possible ways to derive the string, it is ambiguous for w.

  10. Check whether the following grammar is ambiguous for w = aab.

          S → AB | aaB

          A → a | Aa

          B → b

    Solution:

    The string "aab" can be derived using LMD as follows:

    1. S ⇒ AB ⇒ AaB ⇒ aaB ⇒ aab
    2. S ⇒ aaB ⇒ aab

    Since there are two possible ways to derive the string, it is ambiguous for w.

  11. Check whether the following grammar is ambiguous for w = abababa.

    S → SbS | a

    Solution:

    The string "aab" can be derived using LMD as follows:

    1. S ⇒ SbS ⇒ SbSbS ⇒ SbSbSbS

      ⇒ abSbSbS ⇒ ababSbS ⇒ abababS ⇒ abababa

    2. S ⇒ SbS ⇒ abS ⇒ abSbS ⇒ ababS ⇒ ababSbS

      ⇒ abababS ⇒ abababa

    Since there are two possible ways to derive the string, it is ambiguous for w.

  12. Eliminate useless symbols in G.

          S → a Aa

          A → Sb | bCc | DaA

          C → abb | DD

          E → a C

          D → a DA

    Solution:

    Here D is useless as it is not deriving any string. E is not reachable.

    So reduced grammar is the one without useless symbols.

          S → aAa

          A → Sb | bCc

          C → abb

  13. Eliminate useless symbols in G.

          S → a A | bB

          A → aA | a

          B → bB

          D → ab | Ea

          E → aC | d

    Solution:

    Here B is useless as it is in loop, not deriving any string. There is no rule for C. So C is also useless. E and D are not reachable from S.

    Hence, the reduced grammar is the one without useless symbols.

          S → a A

          A → aA | a

  14. Left factor of the grammar S → abc | abd | ae | f

    Solution:

    Here common prefix are "ab" and "a." So first take out the longest match "ab" from S.

          S → abS′ | ae | f

          S′ → c | d

    Now take out the common prefix "a" and rewrite the grammar as

            ⇓

          S → aS″ | f

          S″ → bS′ | e

          S′ → c | d

  15. Left factor of the grammar S → AaS | AaA, A → a | ab

    Solution:

    Here common prefix are "Aa" and "a." So first take out "Aa" from S.

          S → AaA′

          A′ → S | A

    Now take out the common prefix "a" from A and rewrite the grammar as

            ⇓

          A → aA″

       A″ → ε | b

  16. Eliminate left recursion and left factor of the grammar

            E → aba | abba | Ea | EbE

    Solution:

    Here first eliminate left recursion; then we get equivalent grammar as

          E → abaE′ | abbaE′

          E′ → bE′ | bEE′ | ε

    Now left factor the grammar and rewrite the grammar as

            ⇓

          E → abA

          A → aE′ | b aE′

          E′ → bB

          B → E′ | EE′ | ε

Summary

  • Context free grammar mainly defines the syntax of the programming language and push-down automata is used to recognize the language.
  • A context free grammar can be simplified by eliminating useless symbols or null productions or unit productions.
  • An equivalent grammar can be constructed for any language without null by eliminating null productions.
  • Context free grammars can be represented in standard form using Chomsky or Greibach normal forms.

Fill in the Blanks

  1. _________verifies if the tokens are properly sequenced in accordance with the grammar of the language.
  2. The language defined by S → SS is __________.
  3. A nonterminal is useless if it is __________.
  4. A variable that derives ε is called the __________ variable.
  5. Left linear grammars are __________ of CFG.
  6. CFGs are __________ of CSG.
  7. If there is a unique LMD, then the grammar is __________.
  8. The grammar is CFG. (True | False)
  9. S → bbba | ε
  10. Α → ε
  11. Context free languages are described by type __________ grammars.
  12. In Type 1 grammars, if α → β, then relation between α and β is __________.
  13. To simplify the grammar S → ST | ε, T → abT | ab we apply elimination of __________.
  14.  __________ is a context free grammar to generate expression with balanced parenthesis.
  15. Grammar S → αSβ | SS | ε is __________ grammar.
  16. For every ambiguous grammar there is equivalent grammar for the same language which is unambiguous. (True | False)
  17.  __________ languages are the subset of context free languages.
  18. Every context free language is a context sensitive language. (True | False)
  19. Elimination of null productions results in simplified grammar with no unit productions and useless symbols. (True | False)
  20. If a grammar has different LMD or RMD, then it is ambiguous. (True | False)

Objective Question Bank

  1. A context free grammar is ambiguous if
    •  The grammar contains useless nonterminals.
    •  It produces more than one parse tree for some sentence.
    • Some production has two nonterminals side by side on the right-hand side.
    • None of the above
  2. Find the number of parse trees for an input string "aaa" in S → Sa | aS | a
    • infinite
  3.  Find the number of parse trees for an input string "abab" in S → aSbS | bSaS | ε
    • infinite
  4. What is the language generated by the CFG?

    S → aSb

    S → aAb

    S → aBb

    A → aA | a

    B → Bb | b

    Here V = { S, A, B) and T = {a, b}

    •  {anbm, m > 0, | n – m | > 2}
    • {anbm, m > 1, | n – m| > 1}
    •  {anbm, m > 0, | n – m| > 1}
    •  {anbm, m > 0, | n – m | > 0}
  5. * Consider the grammar

    Stmt → if id then stmt

    | if id then stmt else stmt

    | id = id

    Which of the following is not true?

    •  The string "if a then if b then c=d" has two parse trees.
    • The LMD, RMD of "if a then if b then c=d" represent different parse trees.
    •  The string "if a then if b then c = d else e = f" has more than two parse trees.
    •  The string "if a then if b then c=d else e=f" has two parse trees.
  6. Let L denote the language generated by the grammar S → 0S0 | 00. Which of the follow ing is true?
    • L = 0+ 
    • L is regular but not 0+
    • L is context free but not regular 
    • L is not context free
  7. Aliasing in the context free programming languages refers to
    •  multiple variables having the same memory location
    • multiple variables having the same values
    •  multiple variables having the same memory identifier
    •  multiple uses of the same variable
  8. Let G = ({s}), {a,b}, R, s) be a context free grammar where the rule set R

    S → aSb | SS | ε
    Which of the following is true?

    •  G is not ambiguous
    • There exist x, y ∈ L(G) such that xy ∉ L(G).
    •  There is a deterministic PDA that accepts L(G).
    •  We can find a DFA that accepts L(G).
  9. The grammar S → Sa | aS | a is
    • left recursive 
    • right recursive
    • unambiguous grammar 
    • ambiguous
  10. The following CFG S → aS | bS | a | b is equivalent to the regular expression
    • (a* + b)* 
    • (a + b)* 
    • (a + b)(a + b)*
    • (a + b)*(a + b)
  11. Any string of terminals that can be generated by the following CFG
    S → xy, X → ax | bx | a, Y → ya | yb | a
    • has at least one b 
    • should end in a 'a'
    • has no consecutive a's or b's 
    • has at least two a's
  12. The following CFG

    S → aB | bA, A → a | aS | bAA, B → b | bS | aBB

    Generates strings of terminals that have

    •  Equal number of a's and b's
    • odd number of a's and odd number of b's
    •  even number of a's and even number of b's
    •  odd number of a's and even number of b's
  13. The set {an bn | n = 1,2,3.…} can be generated by the CFG
    • S → ab | aSb 
    • S → aaSbb | ab
    • S → ab | aSb | ε 
    • S → aaSbb | ab | aabb
  14. Choose the correct statements.
    •  All languages can be generated by CFG.
    • Any regular language has an equivalent CFG.
    •  Some non-regular languages can't be generated by any CFG.
    •  Some regular languages can't be generated by any CFG.
  15. Here is a context free grammar G:

    S → AB

    A → 0A1 | 2

    B → 1B | 3A

    Which of the following strings is in L(G)?

    • 021300211 
    • 00213021 
    • 002111300211 
    • 0211300021
  16. The grammar

    A → B ↑ A / A * B | B

    B → B - B / C / B | C

    C → i

    • reflects that - has high precedence than -
    • reflects that - has high precedence than /
    • reflects that - has high precedence than ↑
    • none of the above
  17. Consider the following four grammars:
    1.  S → abS | ab
    2.  S → SS | ab
    3.  S → aB; B → bS | b
    4.  S → aB | ab; B → bS

    The initial symbol is S in all cases. Determine the language of each of these grammars. Then, find, in the list below, the pair of grammars that define the same language.

    • G1: S → aB, B → bS, B → ab G2: S → SS, S → ab
    • G1: S → SS, S → ab G2: S → aB, B → bS, B → b
    • G1: S → abS, S → ab G2: S → aB, B → bS, S → a
    • G1: S → aB, B → bS, B → ab G2: S → aB, B → bS, S → ab
  18.  Consider the grammar S → aSb | bSa | ⇒

    The number of parse trees the grammar generates for an input string "abab" is

    • 4
  19. Consider the grammar R → RR | R + R | R* | a | b is
    • ambiguous 
    • unambiguous
    • inherently ambiguous
    • None of the above
  20. Consider the following CFG. G is defined by productions

    S → aSbS | bSaS | ε

    The language generated by this CFG is

    •  the set of all strings which contains even number of a's and even number of b's
    •  the set of all strings which contains odd number of a's and even number of b's
    •  the set of all strings which contains odd number of a's and odd number of b's
    •  the set of all strings with an equal number of a's and equal number of b's
  21. Consider the following grammar productions

    S → AB,

    A → BB | a,

    B → AB | b,

    Choose the incorrect statement.

    •  aabbb can be derived from the above grammar.
    • aabb can be derived from the above grammar.
    •  ababab can be derived from the above grammar.
    •  aaabb can be derived from the above grammar.
  22. Consider the following context free grammars
    1.  S → aSbb | a
    2.  S → aSA | a, A → bB, B → b

    Which of the following is correct?

    •  The language generated by '1' is subset of '2'.
    • The language generated by '2' is subset of '1'.
    •  The language generated by both the grammars '1' and '2' is one and the same.
    •  None of the above
  23. The grammar

    S → aSb / SS | ∈ is ambiguous as "ab" has

    • one LMD and one RMD 
    • two LMD
    • No RMD 
    • all true
  24. A context free grammar is said to be ambiguous if it has
    •  w ∈ L(G) which has at least two distinct derivative trees
    • w ∈ L(G) which has at least two left-most derivations
    • w ∈ L(G) which has at least two right-most derivations
    • Any one of the above
  25. Consider the grammar G with the start symbol S:

    S → bS | aA | b

    A → bA | aB

    B → bB | aS | a

    Which of the following is a word in L(G)?

    • ababba 
    • bbaabb
    • babbbabaaaa 
    • aabb

Exercises

  1. Define leftmost and rightmost derivations. Give example.
  2. Consider the grammar G. S → S + S | S * S | (S) | a. Show that the string "a + a * a" has
    • Parse trees
    • Left most derivations
  3. Find the unambiguous grammar G′ equivalent to G and show that L(G) = L(G′) and G′ is unambiguous.
  4. Convert the following CFG to equivalent unambiguous grammar.

    E → E + E

    E → E * E

    E → a | b | (E)

  5. Check whether the grammar is ambiguous or not.
    • S → 0S1 | SS | ε w = '0011'
    • S → AB | aaB, A → a | AA, B → b w = "aab"
    • S → SbS | a w = "abababa"
    • S → aSb | ab
    • R → R + R | RR | R* | a | b | c w = a + b * c
  6. Construct the reduced grammar from regular grammar given below.

    S → Aa | bs | ε

    A → aA | bB | ε

    B → aA | bc | ε

    C → aC | bc

  7.  Find a CFG, without ε productions, unit production, and useless productions equivalent to the grammar defined by

    S → ABaC

    A → BC

    B → b | ε

    C → D | ε

    D → d.

  8. Left factor the given grammar

    S → aSSbS | aSaSb | abb | b

  9. Left factor the given grammar

    S → 0SS1S | 0SS0S | 01 | 1

  10. Remove ε productions from the following grammar:

    S → XYX

    X → 0X | ε

    Y → 1Y | ε

  11. Eliminate useless symbols and productions from the following grammar:

    G = (V, T, P, S) where V = {S, A, B, C}, T = {a,b} and productions P given below:

    S → ab | A | C, A → a, B → ac, C → aCb

  12.  Remove ε productions from the following grammar:

    S → A0B | BB | 0C | CBA

    A → C0 | 1C | CC | CBA

    B → B 0 | AB | ε

    C → 0A | BBB

  13. Remove unit productions from the grammar

    S → 0A | 1B | C

    A → 0S | 00

    B → 1 | A

    C → 01

  14. Remove unit productions from the grammar

    S → AB

    A → a

    B → C | b

    C → D

    D → E | bC

    E → d | Ab

  15. Optimize the following grammar.

    S → A | 0C1

    A → B | 01 | 10

    C → CD | ε

  16. Optimize the following grammar.

    S → aAa

    A → Sb | bCc | DaA

    C → abb | DD

    E → aC

    D → aDA

  17. Write equivalent unambiguous for the following CFG.

    R → R + R | RR | R* | a | b

  18. Write equivalent unambiguous for the following CFG.

    bExpr → bExpr and bExpr | bExpr or bExpr | not R bExpr | a | b

  19. Eliminate left recursion in the grammar.

    S → Aa / b

    A → Ac / Sd | ε

  20.  Check whether the grammar.

    S → AaAb / BaBb

    A → ∈

    B → ∈ is ambiguous or unambiguous

  21.  Eliminate left recursion from the following grammar:

    A → Bd

          | Aff

          | CB

    B → Bd

          | Bf

          | df

    C → h

          | g

Key for Fill in the Blanks

  1. Syntax anlyzer 
  2. Ø
  3. not deriving any terminal 
  4. nullable 
  5. Subset 
  6. Subset 
  7. unambiguous 
  8. True 
  9. Type 2
  10.  | α | <= | β | & | α | = 1. 
  11. Null rule
  12. False
  13. S → CaA | b, A → CaA | CaCb

    Ca → a, Cb → b

  14. S → ( S ) | ε
  15. Ambiguous
  16. False
  17. RG
  18. True
  19. False
  20. True

Key for Objective Question Bank

  1. b
  2. a
  3. c
  4. b
  5. c
  6. b
  7. a
  8. c
  9. d
  10. b
  11. d
  12. a
  13. a
  14. b,c
  15. a
  16. c
  17. b
  18. a
  19. a
  20. d
  21. d
  22. a
  23. b
  24. d
  25. a
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.132.193