Chapter 7. Introduction to MGrammar Language

Text is often the most natural way to represent information for presentation and editing by people. However, the ability to extract that information for use by software has been an arcane art practiced only by the most advanced developers. The success of XML is evidence that there is significant demand for using text to represent information—this evidence is even more compelling considering the relatively poor readability of XML syntax and the decade-long challenge to make XML-based information easily accessible to programs and stores. The emergence of simpler technologies like JSON and the growing use of meta-programming facilities in Ruby to build textual domain specific languages (DSLs) such as Ruby on Rails or Rake speak to the desire for natural textual representations of information. However, even these technologies limit the expressiveness of the representation by relying on fixed formats to encode all information uniformly, resulting in text that has very few visual cues from the problem domain (much like XML).

The MGrammar Language (Mg) was created to enable information to be represented in a textual form that is tuned for both the problem domain and the target audience. The Mg language provides simple constructs for describing the shape of a textual language—that shape includes the input syntax as well as the structure and contents of the underlying information. To that end, Mg acts as both a schema language that can validate that textual input conforms to a given language as well as a transformation language that projects textual input into data structures that are amenable to further processing or storage. The data that results from Mg processing is compatible with Mg’s sister language, the “Oslo” Modeling Language, “M,” which provides a SQL-compatible schema and query language that can be used to further process the underlying information.

Language Basics

An Mg -based language definition consists of one or more named rules, each of which describe some part of the language. The following fragment is a simple language definition:

language HelloLanguage {
  syntax Main = "Hello, World";
}

The language being specified is named HelloLanguage and it is described by one rule named Main. A language may contain more than one rule; the name Main is used to designate the initial rule that all input documents must match in order to be considered valid with respect to the language.

Rules use patterns to describe the set of input values that the rule applies to. The Main rule above has only one pattern, "Hello, World", that describes exactly one legal input value:

Hello, World

If that input is fed to the Mg processor for this language, the processor will report that the input is valid. Any other input will cause the processor to report the input as invalid.

Typically, a rule will use multiple patterns to describe alternative input formats that are logically related. For example, consider this language:

language PrimaryColors {
  syntax Main = "Red" | "Green" | "Blue";
}

The Main rule has three patterns—input must conform to one of these patterns in order for the rule to apply. That means that the following is valid:

Red

as well as this:

Green

and this:

Blue

No other input values are valid in this language.

Most patterns in the wild are more expressive than those we’ve seen so far—most patterns combine multiple terms. Every pattern consists of a sequence of one or more grammar terms, each of which describes a set of legal text values. Pattern matching has the effect of consuming the input as it sequentially matches the terms in the pattern. Each term in the pattern consumes zero or more initial characters of input—the remainder of the input is then matched against the next term in the pattern. If all of the terms in a pattern cannot be matched, the consumption is “undone” and the original input will used as a candidate for matching against other patterns within the rule.

A pattern term can either specify a literal value (like in our first example) or the name of another rule. The following language definition matches the same input as the first example:

language HelloLanguage2 {
  syntax Main = Prefix ", " Suffix;
  syntax Prefix = "Hello";
  syntax Suffix = "World";
}

Like functions in a traditional programming language, rules can be declared to accept parameters. A parameterized rule declares one or more “holes” that must be specified to use the rule. The following is a parameterized rule:

syntax Greeting(salutation, separator) = salutation separator "World";

To use a parameterized rule, one simply provides actual rules as arguments to be substituted for the declared parameters:

syntax Main = Greeting(Prefix, ", ");

A given rule name may be declared multiple times provided each declaration has a different number of parameters. That is, the following is legal:

syntax Greeting(salutation, sep, subject) = salutation sep subject;
syntax Greeting(salutation, sep) = salutation sep "World";
syntax Greeting(sep) = "Hello" sep "World";
syntax Greeting = "Hello" ", " "World";

The selection of which rule is used is determined based on the number of arguments present in the usage of the rule.

A pattern may indicate that a given term may match repeatedly using the standard Kleene operators (e.g. ?, *, and +). For example, consider this language:

language HelloLanguage3 {
  syntax Main = Prefix ", "? Suffix*;
  syntax Prefix = "Hello";
  syntax Suffix = "World";
}

This language considers all the following to be valid:

Hello
Hello,
Hello, World
Hello, WorldWorld
HelloWorldWorldWorld

Terms can be grouped using parentheses to indicate that a group of terms must be repeated:

language HelloLanguage3 {
  syntax Main = Prefix (", " Suffix)+;
  syntax Prefix = "Hello";
  syntax Suffix = "World";
}

which considers all the following to be valid input:

Hello, World
Hello, World, World
Hello, World, World, World

The use of the + operator indicates that the group of terms must match at least once.

Character Processing

In the previous examples of the HelloLanguage, the pattern term for the comma separator included a trailing space. That trailing space was significant, as it allowed the input text to include a space after the comma:

Hello, World

More importantly, the pattern indicates that the space is not only allowed, but is required. That is, the following input is not valid:

Hello,World

Moreover, exactly one space is required, making this input invalid as well:

Hello,   World

To allow any number of spaces to appear either before or after the comma, we could have written the rule like this:

syntax Main = 'Hello'  ' '*   ','  ' '*  'World';

While this is correct, in practice most languages have many places where secondary text, such as whitespace or comments, can be interleaved with constructs that are primary in the language. To simplify specifying such languages, a language may specify one or more named interleave patterns.

An interleave pattern specifies text streams that are not considered part of the primary flow of text. When processing input, the Mg processor implicitly injects interleave patterns between the terms in all syntax patterns. For example, consider this language:

language HelloLanguage {
  syntax Main = "Hello"  ","  "World";
  interleave Secondary = " "+;
}

This language now accepts any number of whitespace characters before or after the comma. That is,

Hello,World
Hello, World
Hello   ,             World

are all valid with respect to this language.

Interleave patterns simplify defining languages that have secondary text like whitespace and comments. However, many languages have constructs in which such interleaving needs to be suppressed. To specify that a given rule is not subject to interleave processing, the rule is written as a token rule rather than a syntax rule.

Token rules identify the lowest level textual constructs in a language—by analogy token rules identify words, and syntax rules identify sentences. Like syntax rules, token rules use patterns to identify sets of input values. Here’s a simple token rule:

token BinaryValueToken  = ("0" | "1")+;

It identifies sequences of 0 and 1 characters much like this similar syntax rule:

syntax BinaryValueSyntax = ("0" | "1")+;

The main distinction between the two rules is that interleave patterns do not apply to token rules. That means that if the following interleave rule was in effect:

interleave IgnorableText = " "+;

then the following input value:

0 1011 1011

would be valid with respect to the BinaryValueSyntax rule but not with respect to the BinaryValueToken rule, as interleave patterns do not apply to token rules.

Mg provides a shorthand notation for expressing alternatives that consist of a range of Unicode characters. For example, the following rule:

token AtoF = "A" | "B" | "C" | "D" | "E" | "F";

can be rewritten using the range operator as follows:

token AtoF = "A".."F";

Ranges and alternation can compose to specify multiple non-contiguous ranges:

token AtoGnoD = "A".."C" | "E".."G";

which is equivalent to this longhand form:

token AtoGnoD = "A" | "B" | "C" | "E" | "F" | "G";

Note that the range operator only works with text literals that are exactly one character in length.

The patterns in token rules have a few additional features that are not valid in syntax rules. Specifically, token patterns can be negated to match anything not included in the set, by using the difference operator (-). The following example combines difference with any. Any matches any single character. The expression below matches any character that is not a vowel:

any - ('A'|'E'|'I'|'O'|'U')

Token rules are named and may be referred to by other rules:

token AorBorCorEorForG = (AorBorC | EorForG)+;
token AorBorC = 'A'..'C';
token EorForG = 'E'..'G';

Because token rules are processed before syntax rules, token rules cannot refer to syntax rules:

syntax X = "Hello";
token HelloGoodbye = X | "Goodbye"; // illegal

However, syntax rules may refer to token rules:

token X = "Hello";
syntax HelloGoodbye = X | "Goodbye"; // legal

The Mg processor treats all literals in syntax patterns as anonymous token rules. That means that the previous example is equivalent to the following:

token X = "Hello";
token temp = "Goodbye";
syntax HelloGoodbye = X | temp;

Operationally, the difference between token rules and syntax rules is when they are processed. Token rules are processed first against the raw character stream to produce a sequence of named tokens. The Mg processor then processes the language’s syntax rules against the token stream to determine whether the input is valid and optionally to produce structured data as output. The next section describes how that output is formed.

Output

Mg processing transforms text into structured data. The shape and content of that data is determined by the syntax rules of the language being processed. Each syntax rule consists of a set of productions, each of which consists of a pattern and an optional projection. Patterns, which were discussed in the previous sections, describe a set of legal character sequences that are valid input. Projections describe how the information represented by that input should be produced.

Each production is like a function from text to structured data. The primary way to write projections is to use a simple construction syntax that produces graph-structured data suitable for programs and stores. For example, consider this rule:

syntax Rock =
    "Rock" => Item { Heavy { true }, Solid { true } } ;

This rule has one production that has a pattern that matches "Rock" and a projection that produces the following value (using a notation known as D graphs):

Item {
  Heavy { true },
  Solid { true }
}

Rules can contain more than one production in order to allow different input to produce very different output. Here’s an example of a rule that contains three productions with very different projections:

syntax Contents
    = "Rock" => Item { Heavy { true }, Solid { true } }
    | "Water" => Item { Consumable { true }, Solid { false } }
    | "Hamster" => Pet { Small { true }, Legs { 4 } } ;

When a rule with more than one production is processed, the input text is tested against all of the productions in the rule to determine whether the rule applies. If the input text matches the pattern from exactly one of the rule’s productions, then the corresponding projection is used to produce the result. In this example, when presented with the input text "Hamster", the rule would yield:

Pet {
  Small { true },
  Legs { 4 }
}

as a result.

To allow a syntax rule to match no matter what input it is presented with, a syntax rule may specify a production that uses the empty pattern, which will be selected if and only if none of the other productions in the rule match:

syntax Contents
    = "Rock" => Item { Heavy { true }, Solid { true } }
    | "Water" => Item { Consumable { true }, Solid { false } }
    | "Hamster" => Pet { Small { true }, Legs { 4 } }
    | empty => NoContent { } ;

When the production with the empty pattern is chosen, no input is consumed as part of the match.

To allow projections to use the input text that was used during pattern matching, pattern terms associate a variable name with individual pattern terms by prefixing the pattern with an identifier separated by a colon. These variable names are then made available to the projection. For example, consider this language:

language GradientLang {
  syntax Main
    = from:Color ", " to:Color => Gradient { Start { from }, End { to } } ;
  token Color
    = "Red" | "Green" | "Blue";
}

Given this input value:

Red, Blue

the Mg processor would produce this output:

Gradient {
  Start { "Red" },
  End { "Blue" }
}

Like all projection expressions we’ve looked at, literal values may appear in the output graph. The set of literal types supported by Mg and a few examples follow:

  • Text literals—"ABC", 'ABC'

  • Integer literals—25, -34

  • Real literals—0.0, -5.0E15

  • Logical literals—true, false

  • Null literal—null

The projections we’ve seen so far all attach a label to each graph node in the output (e.g., Gradient, Start, and so on). The label is optional and can be omitted:

syntax Naked = t1:First t2:Second => { t1, t2 };

The label can be an arbitrary string—to allow labels to be escaped, one uses the id operator:

syntax Fancy = t1:First t2:Second => id("Label with Spaces!"){ t1, t2 };

The id operator works with either literal strings or with variables that are bound to input text:

syntax Fancy = name:Name t1:First t2:Second => id(name){ t1, t2 };

Using id with variables allows the labeling of the output data to be driven dynamically from input text rather than statically defined in the language. This example works when the variable name is bound to a literal value. If the variable was bound to a structured node that was returned by another rule, that node’s label can be accessed using the labelof operator:

syntax Fancier p:Point => id(labelof(p)) { 1, 2, 3 };

The labelof operator returns a string that can be used both in the id operator as well as a node value.

The projection expressions shown so far have no notion of order. That is, this projection expression:

A { X { 100 }, Y { 200 } }

is semantically equivalent to this:

A { Y { 200 }, X { 100 } }

and implementations of Mg are not required to preserve the order specified by the projection. To indicate that order is significant and must be preserved, brackets are used rather than braces. This means that this projection expression:

A [ X { 100 }, Y { 200 } ]

is not semantically equivalent to this:

A [ Y { 200 }, X { 100 } ]

The use of brackets is common when the sequential nature of information is important and positional access is desired in downstream processing.

Sometimes it is useful to splice the nodes of a value together into a single collection. The valuesof operator will return the values of a node (labeled or unlabeled) as top-level values that are then combinable with other values as values of new node.

syntax ListOfA
     = a:A => [a]
     | list:ListOfA "," a:A => [ valuesof(list), a ];

Here, valuesof(list) returns all the values of the list node, combinable with a to form a new list.

Productions that do not specify a projection get the default projection.

For example, consider this simple language that does not specify productions:

language GradientLanguage {
  syntax Main = Gradient | Color;
  syntax Gradient = from:Color " on " to:Color;
  token Color = "Red" | "Green" | "Blue";
}

When presented with the input "Blue on Green" the language processor returns the following output:

Main[ Gradient [ "Red", " on ", "Green" ] ] ]

These default semantics allow grammars to be authored rapidly while still yielding understandable output. However, in practice explicit projection expressions provide language designers complete control over the shape and contents of the output.

Modularity

All of the examples shown so far have been “loose Mg” that is taken out of context. To write a legal Mg document, all source text must appear in the context of a module definition. A module defines a top-level namespace for any languages that are defined.

Here is a simple module definition:

module Literals {
  // declare a language
  language Number {
    syntax Main = ('0'..'9')+;
  }
}

In this example, the module defines one language named Literals.Number.

Modules may refer to declarations in other modules by using an import directive to name the module containing the referenced declarations. For a declaration to be referenced by other modules, the declaration must be explicitly exported using an export directive.

Consider this module:

module MyModule {
  import HerModule; // declares HerType

  export MyLanguage1;
  language MyLanguage1 {
    syntax Main = HerLanguage.Options;
  }
  language MyLanguage2 {
    syntax Main = "x"+;
  }
}

Note that only MyLanguage1 is visible to other modules. This makes the following definition of HerModule legal:

module HerModule {
  import MyModule; // declares MyLanguage1
  export HerLanguage;

  language HerLanguage {
    syntax Options = (('a'..'z')+ ('on'|'off'))*;
  }
  language Private { }
}

As this example shows, modules may have circular dependencies.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.149.232