Chapter 8. Lexical Structure

Programs

An Mg program consists of one or more source files, known formally as compilation units. A compilation unit file is an ordered sequence of Unicode characters. Compilation units typically have a one-to-one correspondence with files in a file system, but this correspondence is not required. For maximal portability, it is recommended that files in a file system be encoded with the UTF-8 encoding.

Conceptually speaking, a program is compiled using four steps:

  1. Lexical analysis, which translates a stream of Unicode input characters into a stream of tokens. Lexical analysis evaluates and executes pre-processing directives.

  2. Syntactic analysis, which translates the stream of tokens into an abstract syntax tree.

  3. Semantic analysis, which resolves all symbols in the abstract syntax tree, type checks the structure, and generates a semantic graph.

  4. Code generation, which generates instructions from the semantic graph for some target runtime, producing an image.

Further tools may link images and load them into a runtime.

Grammars

This specification presents the syntax of the Mg programming language using two grammars. The lexical grammar defines how Unicode characters are combined to form line terminators, white space, comments, tokens, and pre-processing directives. The syntactic grammar defines how the tokens resulting from the lexical grammar are combined to form Mg programs.

Grammar Notation

The lexical and syntactic grammars are presented using grammar productions. Each grammar production defines a non-terminal symbol and the possible expansions of that non-terminal symbol into sequences of non-terminal or terminal symbols. In grammar productions, non-terminal symbols are shown in italic type, and terminal symbols are shown in a fixed-width font.

The first line of a grammar production is the name of the non-terminal symbol being defined, followed by a colon. Each successive indented line contains a possible expansion of the non-terminal given as a sequence of non-terminal or terminal symbols. For example, the production:

IdentifierVerbatim:
  [ IdentifierVerbatimCharacters  ]

defines an IdentifierVerbatim to consist of the token “[”, followed by IdentifierVerbatimCharacters, followed by the token “]”.

When there is more than one possible expansion of a non-terminal symbol, the alternatives are listed on separate lines. For example, the production:

DecimalDigits:  DecimalDigit  DecimalDigits  DecimalDigit

defines DecimalDigits to either consist of a DecimalDigit or consist of DecimalDigits followed by a DecimalDigit. In other words, the definition is recursive and specifies that a decimal-digits list consists of one or more decimal digits.

A subscripted suffix “opt” is used to indicate an optional symbol. The production:

DecimalLiteral:  IntegerLiteral  . DecimalDigit  DecimalDigitsopt

is shorthand for:

DecimalLiteral:  IntegerLiteral  . DecimalDigit  IntegerLiteral  . DecimalDigit  DecimalDigits

and defines a DecimalLiteral to consist of an IntegerLiteral followed by a '.' a DecimalDigit and by optional DecimalDigits.

Alternatives are normally listed on separate lines, though in cases where there are many alternatives, the phrase “one of” may precede a list of expansions given on a single line. This is simply shorthand for listing each of the alternatives on a separate line. For example, the production:

Sign: one of
  +  -

is shorthand for:

Sign:
  +
  -

Conversely, exclusions are designated with the phrase “none of”. For example, the production

TextSimple: none of
  "
  
  NewLineCharacter

permits all characters except ‘’, ‘’, and new line characters.

Lexical Grammar

The lexical grammar of Mg is presented in Section 8.3. The terminal symbols of the lexical grammar are the characters of the Unicode character set, and the lexical grammar specifies how characters are combined to form tokens, white space, and comments (Section 8.3.2).

Every source file in an Mg program must conform to the Input production of the lexical grammar.

Syntactic Grammar

The syntactic grammar of Mg is presented in the chapters that follow this chapter. The terminal symbols of the syntactic grammar are the tokens defined by the lexical grammar, and the syntactic grammar specifies how tokens are combined to form Mg programs.

Every source file in an Mg program must conform to the CompilationUnit production of the syntactic grammar.

Lexical Analysis

The Input production defines the lexical structure of an Mg source file. Each source file in an Mg program must conform to this lexical grammar production.

Input:  InputSectionoptInputSection:  InputSectionPart  InputSection  InputSectionPartInputSectionPart:  InputElementsopt  NewLineInputElements:  InputElement  InputElements InputElementInputElement:

  Whitespace  Comment  Token

Four basic elements make up the lexical structure of an Mg source file: line terminators, white space, comments, and tokens. Of these basic elements, only tokens are significant in the syntactic grammar of an Mg program.

The lexical processing of an Mg source file consists of reducing the file into a sequence of tokens, which becomes the input to the syntactic analysis. Line terminators, white space, and comments can serve to separate tokens, but otherwise these lexical elements have no impact on the syntactic structure of an Mg program.

When several lexical grammar productions match a sequence of characters in a source file, the lexical processing always forms the longest possible lexical element. For example, the character sequence // is processed as the beginning of a single-line comment because that lexical element is longer than a single / token.

Line Terminators

Line terminators divide the characters of an Mg source file into lines.

NewLine:
  NewLineCharacter
  U+000D  U+000A
NewLineCharacter:
  U+000A  // Line Feed
  U+000D  // Carriage Return
  U+0085  // Next Line
  U+2028  // Line Separator
  U+2029  // Paragraph Separator

For compatibility with source code editing tools that add end-of-file markers, and to enable a source file to be viewed as a sequence of properly terminated lines, the following transformations are applied, in order, to every compilation unit:

  • If the last character of the source file is a Control-Z character (U+001A), this character is deleted.

  • A carriage-return character (U+000D) is added to the end of the source file if that source file is non-empty and if the last character of the source file is not a carriage return (U+000D), a line feed (U+000A), a line separator (U+2028), or a paragraph separator (U+2029).

Comments

Two forms of comments are supported: single-line comments and delimited comments. Single-line comments start with the characters // and extend to the end of the source line. Delimited comments start with the characters /* and end with the characters */. Delimited comments may span multiple lines.

Comment:
  CommentDelimited
  CommentLine
CommentDelimited:
  /* CommentDelimitedContentsopt  */
CommentDelimitedContent:
  *  none of  /
CommentDelimitedContents:
  CommentDelimitedContent
  CommentDelimitedContents  CommentDelimitedContent
CommentLine:
  // CommentLineContentsopt

CommentLineContent: none of
  NewLineCharacter
CommentLineContents:
  CommentLineContent
  CommentLineContents  CommentLineContent

Comments do not nest. The character sequences /* and */ have no special meaning within a // comment, and the character sequences // and /* have no special meaning within a delimited comment.

Comments are not processed within text literals.

The example

// This defines a
// Logical literal
//
syntax LogicalLiteral
   = "true"
   | "false" ;

shows three single-line comments.

The example

/* This defines a
   Logical literal
*/
syntax LogicalLiteral
   = "true"
   | "false" ;

includes one delimited comment.

Whitespace

Whitespace is defined as any character with Unicode class Zs (which includes the space character) as well as the horizontal tab character, the vertical tab character, and the form feed character.

Whitespace:
  WhitespaceCharacters
WhitespaceCharacter:
  U+0009  // Horizontal Tab
  U+000B  // Vertical Tab
  U+000C  // Form Feed
  U+0020  // Space
  NewLineCharacter
WhitespaceCharacters:
  WhitespaceCharacter
  WhitespaceCharacters  WhitespaceCharacter

Tokens

There are several kinds of tokens: identifiers, keywords, literals, operators, and punctuators. White space and comments are not tokens, though they act as separators for tokens.

Token:

 

Identifier

 

Keyword

 

Literal

 

OperatorOrPunctuator

Identifiers

A regular identifier begins with a letter or underscore and then any sequence of letter, underscore, dollar sign, or digit. An escaped identifier is enclosed in square brackets. It contains any sequence of Text literal characters.

Identifier:
  IdentifierBegin  IdentifierCharactersopt
  IdentifierVerbatim
IdentifierBegin:
  -
  Letter
IdentifierCharacter:
  IdentifierBegin
  $
  DecimalDigit
IdentifierCharacters:
  IdentifierCharacter
  IdentifierCharacters  IdentifierCharacter
IdentifierVerbatim:
  [ IdentifierVerbatimCharacters  ]
IdentifierVerbatimCharacter:
  none of  ]
  IdentifierVerbatimEscape
IdentifierVerbatimCharacters:
  IdentifierVerbatimCharacter
  IdentifierVerbatimCharacters  IdentifierVerbatimCharacter
IdentifierVerbatimEscape:
  \
  ]
Letter:
  a..z
  A..Z
DecimalDigit:
  0..9
DecimalDigits:
  DecimalDigit
  DecimalDigits  DecimalDigit

Keywords

A keyword is an identifier-like sequence of characters that is reserved, and cannot be used as an identifier except when escaped with square brackets [].

Keyword: oneof:
  any empty error export false final id import interleave language labelof
  left module null precedence right syntax token true valuesof

The following keywords are reserved for future use:

checkpoint identifier nest override new virtual partial

Literals

A literal is a source code representation of a value.

Literal:

 

DecimalLiteral

 

IntegerLiteral

 

LogicalLiteral

 

NullLiteral

 

TextLiteral

Literals may be ascribed with a type to override the default type ascription.

Decimal Literals

Decimal literals are used to write real-number values.

DecimalLiteral:

 

DecimalDigits . DecimalDigits

Examples of decimal literal follow:

0.0
12.3
999999999999999.999999999999999

Integer Literals

Integer literals are used to write integral values.

IntegerLiteral:

 

- opt DecimalDigits

Examples of integer literal follow:

0
123
999999999999999999999999999999
-42

Logical Literals

Logical literals are used to write logical values.

LogicalLiteral: one of
  true  false

Examples of logical literal:

true
false

Text Literals

Mg supports two forms of Text literals: regular text literals and verbatim text literals. In certain contexts, text literals must be of length one (single characters). However, Mg does not distinguish syntactically between strings and characters.

A regular text literal consists of zero or more characters enclosed in single or double quotes, as in "hello" or ‘hello’, and may include both simple escape sequences (such as for the tab character), and hexadecimal and Unicode escape sequences.

A verbatim Text literal consists of a “commercial at” character (@ ) followed by a single- or double-quote character (' or "), zero or more characters, and a closing quote character that matches the opening one. A simple example is @"hello". In a verbatim text literal, the characters between the delimiters are interpreted exactly as they occur in the compilation unit, the only exception being a SingleQuoteEscapeSequence or a DoubleQuoteEscapeSequence, depending on the opening quote. In particular, simple escape sequences, and hexadecimal and Unicode escape sequences are not processed in verbatim text literals. A verbatim text literal may span multiple lines.

A simple escape sequence represents a Unicode character encoding, as described in the following table.

Escape Sequence

Character Name

Unicode Encoding

'

Single quote

0x0027

"

Double quote

0x0022

\

Backslash

0x005C

Null

0x0000

a

Alert

0x0007



Backspace

0x0008

f

Form feed

0x000C

New line

0x000A

Carriage return

0x000D

Horizontal tab

0x0009

v

Vertical tab

0x000B

Since Mg uses a 16-bit encoding of Unicode code points in Text values, a Unicode character in the range U+10000 to U+10FFFF is not considered a Text literal of length one (a single character), but is represented using a Unicode surrogate pair in a Text literal.

Unicode characters with code points above 0x10FFFF are not supported.

Multiple translations are not performed. For instance, the text literal u005Cu005C is equivalent to u005C rather than . The Unicode value U+005C is the character .

A hexadecimal escape sequence represents a single Unicode character, with the value formed by the hexadecimal number following the prefix.

TextLiteral:
  ' SingleQuotedCharactersopt '
  " DoubleQuotedCharactersopt "
  @ ' SingleQuotedVerbatimCharactersopt  '
  @ " DoubleQuotedVerbatimCharactersopt   "
CharacterEscape:
  CharacterEscapeHex
  CharacterEscapeSimple
  CharacterEscapeUnicode
Character:
  CharacterSimple
  CharacterEscape
Characters:
  Character
  Characters Character
CharacterEscapeHex:
  CharacterEscapeHexPrefix  HexDigit
  CharacterEscapeHexPrefix  HexDigit HexDigit
  CharacterEscapeHexPrefix  HexDigit  HexDigit  HexDigit
  CharacterEscapeHexPrefix  HexDigit  HexDigit  HexDigit  HexDigit
CharacterEscapeHexPrefix: one of
  x X
CharacterEscapeSimple:
   CharacterEscapeSimpleCharacter
CharacterEscapeSimpleCharacter: one of
  '  "    0  a  b  f  n  r  t  v
CharacterEscapeUnicode:
  u HexDigit  HexDigit  HexDigit  HexDigit
  U HexDigit  HexDigit  HexDigit  HexDigit HexDigit  HexDigit  HexDigit  HexDigit
DoubleQuotedCharacter:
  DoubleQuotedCharacterSimple
  CharacterEscape
DoubleQuotedCharacters:
  DoubleQuotedCharacter
  DoubleQuotedCharacters  DoubleQuotedCharacter
DoubleQuotedCharacterSimple: none of
  "
  
  NewLineCharacter
SingleQuotedCharacterSimple: none of
  '
  
  NewLineCharacter
DoubleQuotedVerbatimCharacter:
  none of  "
  DoubleQuotedVerbatimCharacterEscape
DoubleQuotedVerbatimCharacterEscape:
  " "
DoubleQuotedVerbatimCharacters:
  DoubleQuotedVerbatimCharacter
  DoubleQuotedVerbatimCharacters  DoubleQuotedVerbatimCharacter
SingleQuotedVerbatimCharacter:
  none of  "
  SingleQuotedVerbatimCharacterEscape
SingleQuotedVerbatimCharacterEscape:
  " "
SingleQuotedVerbatimCharacters:
  SingleQuotedVerbatimCharacter
  SingleQuotedVerbatimCharacters  SingleQuotedVerbatimCharacter

Examples of text literals follow:

'a'
'u2323'
'x2323'
'2323'
"Hello World"
@"""Hello,
World"""
"u2323"

Null Literal

The null literal is equal to no other value.

NullLiteral:
  null

An example of the null literal follows:

null

Operators and Punctuators

There are several kinds of operators and punctuators. Operators are used in expressions to describe operations involving one or more operands. For example, the expression a + b uses the + operator to add the two operands a and b. Punctuators are for grouping and separating.

OperatorOrPunctuator: one of
  [ ]  ( )  .  ,  :  ;  ?  =  =>  +  -  *  &  |  ^  { }  #  ..  @  '  "

Pre-processing Directives

Pre-processing directives provide the ability to conditionally skip sections of source files, to report error and warning conditions, and to delineate distinct regions of source code as a separate pre-processing step.

PPDirective:

 

PPDeclaration

 

PPConditional

 

PPDiagnostic

 

PPRegion

The following pre-processing directives are available:

  • #define and #undef, which are used to define and undefine, respectively, conditional compilation symbols.

  • #if, #else, and #endif, which are used to conditionally skip sections of source code.

A pre-processing directive always occupies a separate line of source code and always begins with a # character and a pre-processing directive name. White space may occur before the # character and between the # character and the directive name.

A source line containing a #define, #undef, #if, #else, or #endif directive may end with a single-line comment. Delimited comments (the /* */ style of comments) are not permitted on source lines containing pre-processing directives.

Pre-processing directives are neither tokens nor part of the syntactic grammar of Mg. However, pre-processing directives can be used to include or exclude sequences of tokens and can in that way affect the meaning of an Mg program. For example, after pre-processing the source text:

#define A
#undef B
language C
{
#if A
    syntax F = "ABC";
#else
    syntax G = "HIJ";
#endif
#if B
    syntax H = "KLM";
#else
    syntax I = "DEF";
#endif
}

results in the exact same sequence of tokens as the source text:

language C
{
    syntax F = "ABC";
    syntax I = "DEF";
}

Thus, whereas lexically the two programs are quite different, syntactically they are identical.

Conditional Compilation Symbols

The conditional compilation functionality provided by the #if, #else, and #endif directives is controlled through pre-processing expressions and conditional compilation symbols.

ConditionalSymbol:
  Any IdentifierOrKeyword except true or false

A conditional compilation symbol has two possible states: defined or undefined. At the beginning of the lexical processing of a source file, a conditional compilation symbol is undefined unless it has been explicitly defined by an external mechanism (such as a command-line compiler option). When a #define directive is processed, the conditional compilation symbol named in that directive becomes defined in that source file. The symbol remains defined until an #undef directive for that same symbol is processed, or until the end of the source file is reached. An implication of this is that #define and #undef directives in one source file have no effect on other source files in the same program.

When referenced in a pre-processing expression, a defined conditional compilation symbol has the Logical value true, and an undefined conditional compilation symbol has the Logical value false. There is no requirement that conditional compilation symbols be explicitly declared before they are referenced in pre-processing expressions. Instead, undeclared symbols are simply undefined and thus have the value false.

Conditional compilation symbols can only be referenced in #define and #undef directives and in pre-processing expressions.

Pre-processing Expressions

Pre-processing expressions can occur in #if directives. The operators !, ==, !=, &&, and || are permitted in pre-processing expressions, and parentheses may be used for grouping.

PPExpression:
  Whitespaceopt PPOrExpression Whitespaceopt
OrExpression:
  PPAndExpression
  PPOrExpression Whitespaceopt || Whitespaceopt PPAndExpression
PPAndExpression:
  PPEqualityExpression
  PPAndExpression Whitespaceopt && Whitespaceopt PPEqualityExpression
PPEqualityExpression:
  PPUnaryExpression
  PPEqualityExpression Whitespaceopt == Whitespaceopt PPUnaryExpression
  PPEqualityExpression Whitespaceopt != Whitespaceopt PPUnaryExpression
PPUnaryExpression:
  PPPrimaryExpression
  ! Whitespaceopt PPUnaryExpression
PPPrimaryExpression:
  true
  false
  ConditionalSymbol
  ( Whitespaceopt PPExpression Whitespaceopt )

When referenced in a pre-processing expression, a defined conditional compilation symbol has the Logical value true, and an undefined conditional compilation symbol has the Logical value false.

Evaluation of a pre-processing expression always yields a Logical value. The rules of evaluation for a pre-processing expression are the same as those for a constant expression, except that the only user-defined entities that can be referenced are conditional compilation symbols.

Declaration Directives

The declaration directives are used to define or undefine conditional compilation symbols.

PPDeclaration:
  Whitespaceopt # Whitespaceopt define Whitespace   ConditionalSymbol   PPNewLine
  Whitespaceopt # Whitespaceopt undef Whitespace   ConditionalSymbol   PPNewLine
PPNewLine:
  Whitespaceopt SingleLineCommentopt NewLine

The processing of a #define directive causes the given conditional compilation symbol to become defined, starting with the source line that follows the directive. Likewise, the processing of an #undef directive causes the given conditional compilation symbol to become undefined, starting with the source line that follows the directive.

A #define may define a conditional compilation symbol that is already defined, without there being any intervening #undef for that symbol. The following example defines a conditional compilation symbol A and then defines it again.

#define A
#define A

A #undef may “undefine” a conditional compilation symbol that is not defined. The following example defines a conditional compilation symbol A and then undefines it twice; although the second #undef has no effect, it is still valid.

#define A
#undef A
#undef A

Conditional Compilation Directives

The conditional compilation directives are used to conditionally include or exclude portions of a source file.

PPConditional:
  PPIfSection  PPElseSectionopt  PPEndif
PPIfSection:
  Whitespaceopt # Whitespaceopt if Whitespace   PPExpression   PPNewLine ConditionalSectionopt
PPElseSection:
  Whitespaceopt # Whitespaceopt else PPNewLine ConditionalSectionopt
PPEndif:
  Whitespaceopt # Whitespaceopt endif PPNewLine
ConditionalSection:
  InputSection
  SkippedSection
SkippedSection:
  SkippedSectionPart
  SkippedSection  SkippedSectionPart
SkippedSectionPart:
  SkippedCharactersopt  NewLine
  PPDirective
SkippedCharacters:
  Whitespaceopt  NotNumberSign InputCharactersopt
NotNumberSign:
  Any InputCharacter except #

As indicated by the syntax, conditional compilation directives must be written as sets consisting of, in order, an #if directive, zero or one #else directive, and an #endif directive. Between the directives are conditional sections of source code. Each section is controlled by the immediately preceding directive. A conditional section may itself contain nested conditional compilation directives provided these directives form complete sets.

A PPConditional selects at most one of the contained ConditionalSections for normal lexical processing:

  • The PPExpressions of the #if directives are evaluated in order until one yields true. If an expression yields true, the ConditionalSection of the corresponding directive is selected.

  • If all PPExpressions yield false, and if an #else directive is present, the ConditionalSection of the #else directive is selected.

  • Otherwise, no ConditionalSection is selected.

The selected ConditionalSection, if any, is processed as a normal InputSection: the source code contained in the section must adhere to the lexical grammar; tokens are generated from the source code in the section; and pre-processing directives in the section have the prescribed effects.

The remaining ConditionalSections, if any, are processed as SkippedSections: except for pre-processing directives, the source code in the section need not adhere to the lexical grammar; no tokens are generated from the source code in the section; and pre-processing directives in the section must be lexically correct but are not otherwise processed. Within a ConditionalSection that is being processed as a SkippedSection, any nested ConditionalSections (contained in nested #if...#endif and #region...#endregion constructs) are also processed as SkippedSections.

Except for pre-processing directives, skipped source code is not subject to lexical analysis. For example, the following is valid despite the unterminated comment in the #else section:

#define Debug        // Debugging on
module HelloWorld {
    language HelloWorld {
        syntax Main =
#if Debug
             "Hello World"
         ;
#else
         /* Unterminated comment!
#endif
    }
}

Note that pre-processing directives are required to be lexically correct even in skipped sections of source code.

Pre-processing directives are not processed when they appear inside multi-line input elements. For example, the program:

module HelloWorld {
    language HelloWorld {
        syntax Main = @'
#if Debug
            "Hello World"
        ;
#else
        /* Unterminated comment!
#endif'
    }
}

generates a language which recognizes the value:

#if Debug
            "Hello World"
        ;
#else
        /* Unterminated comment!
#endif

In peculiar cases, the set of pre-processing directives that is processed might depend on the evaluation of the PPExpression. The example:

#if X
    /*
#else
    /* */ syntax Q = empty;
#endif

always produces the same token stream (syntax Q = empty;) regardless of whether or not X is defined. If X is defined, the only processed directives are #if and #endif, due to the multi-line comment. If X is undefined, then three directives (#if, #else, #endif) are part of the directive set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.152.71