2.2 Introduction to Formal Languages

An alphabet is a finite set of symbols denoted by Σ. A string is a combination of symbols, also called characters, over an alphabet. For instance, strings over the alphabet Summation equals, left curly brace, a comma b comma c, right curly brace. include a, aa, aaa, bb, aba, and abc. The empty string (i.e., a string of zero characters) is represented as ∈. The Kleene closure operator of an alphabet (i.e., Σ*) represents the set of all possible strings that can be constructed through zero or more concatenations of characters from the alphabet. Thus, the set of all possible strings from the alphabet Summation equals, left curly brace, a comma b comma c, right curly brace. is Σ*. While Σ is always finite, Σ* is always infinite and always contains . The strings in Σ* are candidate sentences.

A formal language is a set of strings. Specifically, a formal language L is a subset of Σ*, where each string from Σ* in L is called a sentence. Thus, a formal language is a set of sentences. For instance, {a, aa, aaa, bb, aba, abc} is a formal language. There are finite and infinite languages. Finite languages have a finite number of sentences. The language described previously is a finite language (i.e., it has six sentences), whereas the Scheme programming language is an infinite language. Most interesting languages are infinite.

Determining whether a string s from Σ* is in L (i.e., whether the candidate sentence s is a valid sentence) depends on the complexity of L. For instance, determining if a string s from Σ* is in the language of all three-character strings is simpler than determining if s is in the language of palindromes (i.e., strings that read the same both forward and backward; e.g., dad, eye, or noon). Thus, determining if a string is a sentence is a set-membership problem.

Recall that syntax refers to the structure or form of language and semantics refers to the meaning of language. Formal notational systems are available to define the syntax and semantics of formal languages. This chapter is concerned with establishing an understanding of those formal systems and how they are used to define the syntax of programming languages. Armed with an understanding of the theory of formal language definition mechanisms and methods, we can turn to practice and study how those devices can be used to recognize a valid program prior to interpretation or compilation in Chapter 3.

There are three progressive types of sentence validity. A sentence is lexically valid if all the words of the sentence are valid. A sentence is syntactically valid if it is lexically valid and the ordering of the words is valid. A sentence is semantically valid if it is lexically and syntactically valid and has a valid meaning.

Consider the sentences in Table 2.1. The first candidate sentence is not lexically valid because “saintt” is not a word; therefore, the sentence cannot be syntactically or semantically valid. The second candidate is lexically valid because all of its words are valid, but it is not syntactically valid because the arrangement of those words does not conform to the subject–verb–article–object structure of English sentences; thus, it cannot be semantically valid. The third candidate is lexically valid because all of its words are valid and syntactically valid because the arrangement of those words conforms to the subject–verb–article–object structure of English sentences, but it is not semantically valid because the sentence does not make sense. The fourth candidate sentence is lexically, syntactically, and semantically valid. Notice that these types of sentence validity are progressive. Once a candidate sentence fails any test for validity, it automatically fails a more stringent test for validity. In other words, if a candidate sentence does not even have valid words, those words can never be arranged correctly. Similarly, if the words of a candidate sentence are not arranged correctly, that sentence can never make semantic sense. For instance, the second sentence in Table 2.1 is not syntactically valid so it can never be semantically valid.

Table 2.1 Progressive Types of Sentence Validity

A table of validity of different candidate sentences.
Description

Recall that validating a string as a sentence is a set-membership problem. We saw previously that the first step to determining if a string of words, where a word is a string of non-whitespace characters, is a sentence is to determine if each individual word is a sentence (in a simpler language). Only after the validity of every individual word in the entire string is established can we examine whether the words are arranged in a proper order according to the particular language in which this particular, entire string is a candidate sentence. Notice that these steps are similar to the steps an interpreter or compiler must execute to determine the validity of a program (i.e., to determine if the program has any syntax errors). Table 2.2 illustrates these steps of determining program expression validity. Next, we examine those steps through a formal lens.

Table 2.2 Progressive Types of Program Expression Validity

A table of validity of different candidate expressions.
Description
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.182.66