2.9. Grammars

Grammar support is formalized with VoiceXML 2.0. This Specification provides support for built-in grammars and explicit grammars. We introduced built-in grammars in 2.1.4, “Grammars: mapping a user response to a value,” on page 31. Explicit grammars can be written inline or referenced externally.

VoiceXML 2.0 states that an interpreter should support at least one of the following three grammar standards:

  • XML form of the W3C Speech Recognition Grammar Format (GRXML),

  • Augmented BNF (ABNF) form of the W3C Speech Recognition Grammar Format,

  • Java Speech Grammar Format (JSGF).

Note that the first two grammars, GRXML and ABNF, are both developed by the W3C and are completely compatible and transformable to one another format. We will use the GRXML form in this section. Details of the grammars are covered in Appendix D, “Speech Recognition Grammar Format,” on page 438.

As of the writing of this book, the VoiceXML grammar specification is being replaced with the Speech Recognition Grammar Specification (SRGS). The fundamental concepts covered in this section should remain true but you may discover that certain details of the syntax may change.

2.9.1. General concepts

All grammars, whether GRXML, ABNF, or JSGF, share the same fundamental grammar concepts, but use different semantics to represent them. In this section we'll look at the common grammar concepts and define the terms used by them.

A grammar is a compound set of rules that together define what can be recognized by a grammar processor, where a grammar processor is a software or hardware component that performs recognition on speech utterances or DTMF tones. In other words, the grammar tells the underlying recognizer what utterances to recognize.

A grammar has a header and a body. The grammar header is where the grammar name is defined and depending on the grammar specification, the grammar version, character encoding, language/locale, mode, and root are declared. The grammar body is where rules are defined. Each rule is defined within a rule definition. The scope of a rule can be public or private.

A rule is a combination of speakable text and references to other rules. Each rule has a rulename, which must be unique within a grammar. A reference to two grammars using the same rulename is ambiguous and is resolved by qualifying the rulename reference to each grammar.

There are at least two special rules defined, VOID and NULL; additionally, GARBAGE is defined in GRXML.

  • VOID defines a rule that can never be spoken. Inserting VOID into a sequence will make it unrecognizable.

  • NULL defines a rule that is automaticallymatched without the user speaking.

  • GARBAGE defines a rule that will match all speech until the end of spoken input, the next rule, or the next token.

A rule expansion is a regular expression that defines patterns of tokens, rule references and combinations of these. A token is one or more spoken words that reference a recognizer's vocabulary, a.k.a. lexicon, and the associated pronunciation.

Rules may contain the following constructs:

  • sequence - an expansion used to define an exact phrase,

  • alternative - a set of expansions with optional weighting, to define choices,

  • precedence used to define grouping,

  • optional rule expansions,

  • repetition operators,

  • recursion (implementation is optional for GRXML and ABNF),

  • tagging - an application aid for providing semantic interpretation, for example allowing an application to use a recognizer's results.

2.9.2. Mapping user responses to string values

Grammars take a user response as input and return a string value that represents the match. Using grammars, complex word patterns can be defined and tested for. Here we will look at some examples using GRXML to define our grammars.

A simple example is a bank application where we need to define a grammar that will test for the type of account. Let's assume the bank has three account types: savings, checking, and money market. To accomplish this, we could use the rule shown in Example 2-56.

Example 2-56. Defining a simple bank transaction rule
<rule id="accountType">
  <one-of>
    <item>savings</item>
    <item>checking</item>
    <item>money market</item>
  </one-of>
</rule>

The id attribute is a unique identifier of the rule, in this case it is accountType. The rule simply states that the account type must be one of the three types of accounts.

Similar to the definition of account types, we can make a rule for transaction types, shown in Example 2-57. We will make two transaction types, one for complex transactions that require an amount and another for simple transactions that have no associated amount. Notice the use of the attribute weight for the item element. This attribute lets us associate a probability weight with each item to give a hint to the speech recognizer. The default weight is 1.0, so in the rule transactionSimple the phrase "check balance" is four times as likely to occur as "check the balance" and twice as likely to occur as simply "balance." Weights that are less than one, e.g. 0.25, are weighted negatively by making the corresponding item less likely. It is important to mention that the probabilities do not need to sum up to 1; they are merely relative. In practice, one finds that adjusting weights does not have any noticeable effect on the actual recognition performance.

Example 2-57. Defining weighted rules
<rule id="transactionComplex">
  <one-of>
    <item>transfer</item>
    <item>deposit</item>
    <item>withdraw</item>
  </one-of>
</rule>

<rule id="transactionSimple">
  <one-of>
    <item weight="4">check balance</item>
    <item weight="1">check the balance</item>
    <item weight="2">balance</item>
    <item>inquire</item>
  </one-of>
</rule>

Now we can use the grammar rules defined so far and combine them with the built-in grammar rule for currency to make our userAction rule, as shown in Example 2-58. You can reference a rule defined in the same document using the # prefix. An item's repeat attribute can contain a range like "1-2", meaning "this item may be repeated one or two times."

Example 2-58. Referencing other rules within the same document
<rule id="userAction">
  <one-of>
    <item>
      <ruleref uri="#transactionComplex"/>
      <ruleref uri="builtin:/grammar/currency"/>
    </item>
    <item><ruleref uri="#transactionSimple"/></item>
  </one-of>
  <item repeat="1-2">
    <one-of>
      <item>in</item>
      <item>from</item>
      <item>into</item>
      <item>to</item>
    </one-of>
    <item repeat="0-1">my</item>
    <item><ruleref uri="#accountType"/></item>
    <item repeat="0-1">account</item>
</item>
</rule>

The action grammar rule defined in Example 2-58 will recognize any of the following user directives:

  • "Transfer $300 from my savings to checking."

  • "Transfer $20 from my savings account to my money market account."

  • "Check balance in savings account."

  • "Deposit $40 into checking."

If we can adequately explain to the users how they should describe their actions, the above could work. However, if we were to deploy this grammar in production, then we would want to ensure that nothing ambiguous is spoken. We would need to refine our grammar so that, for example, the following would not be considered valid: "Check balance in savings from checking." Writing flexible and unambiguous grammars is not trivial. We'll look at this in more detail in 2.9.4, “Whole-word speech grammars,” on page 113 and 2.9.5, “Natural dialogs and continuous speech grammars,” on page 113, but first we will look at a simpler case of DTMF grammars.

2.9.3. The DTMF grammar

The JSGF Specification does not cover DTMF support, while the W3C GRXML and ABNF Specifications state that DTMF grammar processor implementation is optional. In GRXML, DTMF is enabled through the grammar element's mode attribute, which can be voice or dtmf. The Specification does not define a mechanism for a mixed voice and DTMF grammar; instead, each grammar must be defined separately as either voice or DTMF. A DTMF and a voice grammar can both be active at the same time. However, a grammar of one type cannot reference a grammar of the other type.

Let's take a look at a DTMF grammar to accept a four digit PIN number with an option for the user to get help by entering a star key followed by a nine, i.e. "* 9". Example 2-59 uses GRXML to specify the desired DTMF grammar (we could have easily used the built-in DTMF grammar instead).

Example 2-59. GRXML grammar definition for PIN entry
<grammar mode="dtmf" version="1.0" 
         xmlns="http://www.w3.org/2001/06/grammar">
  <rule id="digit">
    <one-of>
      <item> 0 </item> 
      <item> 1 </item> 
      <item> 2 </item>
      <item> 3 </item>
      <item> 4 </item>
      <item> 5 </item>
      <item> 6 </item>
      <item> 7 </item>
      <item> 8 </item>
      <item> 9 </item>
    </one-of>
  </rule>
  <rule id="pin" scope="public">
    <one-of>
      <item>
        <item repeat="4"><ruleref uri="#digit"/></item>
      </item>
      <item>
        * 9
      </item>
    </one-of>
  </rule>
</grammar>

To accomplish this, we will first need to set the grammar mode to be DTMF. Next, we will need to define two rules within our grammar, one for a single digit and the other for the four digit PIN. The rule for one digit defines ten items representing numbers between 0 and 9. By default, digit is private since it is only used locally by the pin rule. The pin rule, on the other hand, is made public so that it can be referenced from other grammars. It has two rule items defined, one for returning when four digits are entered (<item repeat="4"><ruleref uri="#digit"/></item>) and another for returning when a "* 9" combination is entered (<item>*9</item>).

2.9.4. Whole-word speech grammars

Speech recognizer grammars are used to let an application specify to a speech recognizer what words or word patterns to recognize as well as what language to use for recognition. The specification does not address any of the following:

  • timeouts,

  • recognition thresholds,

  • search sizes,

  • n-best result counts,

  • specification of speaker voice or other meta data,

  • loading of lexicons or word pronunciations.

Taking Example 2-59, it is straightforward to rewrite it as a whole-word grammar as shown in Example 2-60.

Example 2-60. PIN grammar rewritten using whole-word format
<grammar mode="voice" version="1.0"
         xmlns=http://www.w3.org/2001/06/grammar>
  <rule id="digit">
    <one-of>
      <item> zero </item> 
      <item> one </item> 
      <item> two </item>
      <item> three </item>
      <item> four </item>
      <item> five </item>
      <item> six </item>
      <item> seven </item>
      <item> eight </item>
      <item> nine </item>
    </one-of>
  </rule>
  <rule id="pin" scope="public">
    <one-of>
      <item>
        <item repeat="4"><ruleref uri="#digit"/></item>
      </item>
      <item>
        help
      </item>
    </one-of>
  </rule>
</grammar>

2.9.5. Natural dialogs and continuous speech grammars

Having no grammar and passing every word recognized to an application wouldn't be feasible. Nor is it desirable to have a limited dialog. Applications will want to balance between natural dialogs and a finite set of match results that the grammar processor will return. In 2.9.2, “Mapping user responses to string values,” on page 108 we described a grammar needed for a bank transaction application. We can enhance that grammar to accommodate natural dialogs by modifying the action rule. This time we will rewrite our grammar using JSGF; this will serve to demonstrate the similarities and differences between JSGF and GRXML.

As shown in Example 2-61, we modify the userAction rule grammar to allow for more natural dialogs. Using the special <NULL> rule, all utterances will be consumed up until the moment the user says "Please," "Kindly," "I'd like to," or "I want to." We've also attached two grammar tags: {xa} represents that either transactionSimple or transactionComplex is selected, and {exit} represents that exit is selected. This will come into play later when we look at returning results back to the VoiceXML document.

Example 2-61. A JSGF banking grammar
[#JSGF V1.0;

grammar userAction;

public <userAction> = <NULL> (Please | Kindly | I'd like to 
 | I want to) ((<transactionComplex> | <transactionSimple>){xa} 
                                                  | exit {exit});

<accountType> = savings | checking | money market;

<transactionSimple> = (/4/ check balance | /3/ check the balance | 
    /2/ balance | /1/ inquire) ([in] | [in my]) 
    <accountType> [account];

<transactionComplex> = 
    (transfer <builtin:/grammar/currency> from [my] <accountType>
[account] 
      (to | in | into) [my] <accountType> [account])
  | (transfer <builtin:/grammar/currency> (to | into) [my] 
     <accountType> [account] from [my] <accountType> [account])
  | (withdraw <builtin:/grammar/currency> ([from] | [from my])
<accountType> 
      [account], [and] deposit ([in] | [into]) [my] <accountType> 
      [account])
  | (deposit <builtin:/grammar/currency> ([in] | [into]) [my]
<accountType> 
      [account], [and] withdraw ([from] | [from my]) <accountType> 
      [account]);

Some of the possible natural-sounding dialogs accepted by our new grammar are:

  • "Please transfer $22.50 from my savings account to my checking."

  • "Yes, I'd like to check the balance in my money market account."

  • "With the greatest of pleasure I'd like to deposit $500 into my savings account."

  • "At this point I'd like to transfer $20 into my checking account from my savings account."

  • "What is the balance in my money market account?"

2.9.6. A comparative look at different grammars

Table 2-2 summarizes some of the popular grammar markup languages and the syntax they use for building and describing grammar rules.

The rule types and meanings used in the table are as follows:

rule name

Shows how to define a rule; "ruleName" is the name assigned to the rule.

rule reference

Shows how one rule can reference another rule (e.g. rule r1 references rule color).

composition

Shows how a rule can be comprised of two or more rules (e.g. rule r1 references the color and object rules).

alternatives

Shows how a rule can be one of a number of choices (e.g. r1 can be either red, blue, or green).

grouping

Shows a combination of a rule with one of the choices (e.g. r1 can be Please open, Pleaseclose, or Please delete).

optional grouping

Shows a group with optional elements (e.g. r1 can be Please confirm your answer, or Please confirm answer, or confirm your answer, or confirm answer).

weights

Shows how to specify relative probabilities of elements. Higher values indicate greater likelihood of element occurrence (e.g. blue is more likely to occur than green, which is more likely to occur than red).

grammar tags

Shows how to allow a grammar to return a "{tag}" associated with a word match (e.g. the grammar will return {fruit} if apple matches and {dairy} if milk matches).

repetition

Shows how to specify the number of occurrences allowed (e.g. rule digit can occur one or more, zero or more, or between three and six times).

recursion

Shows how a rule can reference itself (e.g. rule palette references color and palette).

scope

Shows how to specify private or public scope for a rule.

alias

Shows a shorthand for references (e.g. contacts is an alias of http://www.domain.com/names.gram).

Table 2-2. Grammar language comparison chart
 GRXMLABNFJSGFGSL
rule name <rule id="ruleName"> $ruleName <ruleName> ruleName
rule reference
<rule id="r1">
 references the 
 <ruleref 
  uri="#color"/> 
 car
</rule>

$r1=
references 
the $color 
car

<r1>=
references
the <color> 
car

R1: 
(references 
the) color 
(car)

composition
<rule id="r1">
 The <ruleref 
  uri="#object"/> 
 is the color 
 <ruleref 
  uri="#color/>
</rule>

$r1=
The $object 
is the color 
$color

<r1>=
The <object>
is the color 
<color>

R1: 
(The) object
(is the) 
color

alternatives
<rule id="r1">
 <one-of>
<item>red</item>
<item>blue</item>
<item>green</item>
 </one-of>
</rule>

$r1=red 
    | blue 
    | green

<r1>=red 
     | blue 
     | green

R1: [red 
     blue 
     green]

grouping
<rule id="r1">
 Please
 <one-of>
    <item>open</item>
    <item>close</item>
    <item>delete</item>
 </one-of>
</rule>

$r1=Please 
(open 
 | close 
 | delete)

<r1>=Please 
(open 
 | close 
 | delete)

R1: (Please) 
[open 
 close 
 delete]

optional grouping
<rule id="r1">
<item repeat="0-1">Please</item> 
confirm 
<item repeat="0-1">your</item>
answer
</rule>

$r1=[Please] 
confirm 
[your] 
answer

<r1>=[Please] 
confirm 
[your] 
answer

R1: (?Please 
confirm 
?your 
answer)

weights
<rule id="r1">
 <one-of>
  <item weight="2">red</item>
  <item weight="8">blue</item>
  <item weight="4">green</item>
 </one-of>
</rule>

$r1=
  /2/ red 
| /8/ blue 
| /4/ green

<r1>=
  /2/ red 
| /8/ blue 
| /4/ green

R1: 
  [red~2 
   blue~8 
   green~4]

grammar tags
<rule id="r1">
 <one-of>
  <item>apples</item>
  <tag>fruit</tag>
  <item>milk</item>
  <tag>dairy</tag>
 </one-of>
</rule>

$r1=
apples{fruit} 
| milk{dairy}

<r1>=
apples{fruit} 
| milk{dairy}

R1:
[[apples] 
 {<result
  "fruit">}
 [milk] 
 {<result 
  "dairy">}]

repetition
<item repeat="1-">
 <rulerefuri="#digit"/>
</item>

$digit <1->

<digit>+

+digit

 
<item repeat="0-">
 <ruleref uri="#digit"/>
</item>

$digit <0->

<digit>*

*digit

 
<item repeat="3-6">
 <ruleref uri="#digit"/>
</item>

$digit <3-6>

no shorthandno shorthand
recursion
<rule id="palette">
 <one-of>
  <item>
   <rulerefuri="#color"/>
  </item>
  <item>
   <ruleref uri="#color"/>
   and
   <ruleref uri="#palette"/>
  </item>
 </one-of>
</rule>

$palette=
$color 
| ($color 
   and 
   $palette)

<palette>=
<color> 
  | <color> 
    and 
    <palette>

palette: 
[color 
 (color 
  and 
  palette)]

scope
<rule id="r1" 
 scope="public"/>

public 
 $r1= ...

public 
 <r1>= ...

R1:public 
 [...]

 
<rule id="r2" 
 scope="private"/>

private 
 $r2= ...

private 
 <r2>= ...

not available
alias
<alias 
 uri="http://www.
      domain.com/
      names.gram" 
 name="contacts">

alias 
 $(http://www.
   domain.com/
   names.gram) 
 $$ contacts

not availablenot supported

2.9.7. Using grammars in your VoiceXML application

Once grammars are defined, you can begin using them within a VoiceXML application. Let's take the userAction grammar we defined earlier in Example 2-61. We can access this grammar from a VoiceXML document, such as the one shown in Example 2-62. Here we have defined a form with a single field and referenced userAction as the field grammar. When the field has received valid speech matching the grammar, the filled section of the form will be activated. Notice that the field's name attribute is now one of two values; mainChoice can be either "xa" or "exit" as defined by the grammar.

Example 2-62. The VoiceXML document that uses the userAction grammar
<?xml version="1.0" encoding="iso-8859-1"?>
<vxml version="1.0">
  <form id="main">
    <field name="mainChoice">
      <prompt>
        Welcome to Virtual Bank's enhanced services.
        Please say the transaction you would like to perform.
      </prompt>
      <grammar src="userAction.jsgf"/>
    </field>
    <filled>
      <if cond="mainChoice=='xa'">
        Your transaction is being processed.
        <elseif cond="mainChoice=='exit'"/>
          Exiting the transaction menu.
        <exit/>
      </if>
    </filled>
  </form>
</vxml>

In Example 2-62, we used field-level grammars to return grammar results to a VoiceXML field. It is possible to also have form-level grammars. Form-level grammars are useful when an application wants to give the caller the option to give two or more answers for a single question. In the example we will look at, the caller is prompted to say the city and state. The caller can either initially give both answers at once, or if not, he or she will be prompted separately for the city and state.

Example 2-63 shows the grammar for returning the city and state. It is used by the VoiceXML document shown in Example 2-64. The form cityState uses an initial element to get both city and state. The grammar ensures that the values of the grammar tags {city} and {state} are passed to the VoiceXML document. If either is undefined, the field city and/or the field state will be visited to collect the missing information. Using the same name (i.e. city and state) for the grammar tag and the field guarantees the mapping.

Example 2-63. A grammar that returns city and state
#JSGF V1.0;

grammar citystate;

public <cityandstate> =
        <city> {this.city=$} [<state> {this.state=$}] [please] |
        <state> {this.state=$} [<city> {this.state=$}] [please] |;

<city> = Los Angeles | San Francisco | San Jose | Albany 
                                             | Buffalo | New York;

<state> = California | New York;

Example 2-64. A VoiceXML document that uses the city/state grammar
<?xml version="1.0" encoding="iso-8859-1"?>
<vxml version="1.0">
  <var name="docCity" expr="''"/>
  <var name="docState" expr="''"/>

  <form id="cityState">
    <grammar src="citystate.gram" type="application/jsgf"/>
    <initial name="getBoth">
      <prompt>Say the city and state</prompt>
      <filled>
        <prompt>You said both!!</prompt>
        <assign name="getBoth" expr="true"/>
      </filled>
      <nomatch>
       <prompt>What city and state?</prompt>
         <reprompt/>
      </nomatch>
    </initial>

    <field name="city">
      <prompt>The city is?</prompt>
      <filled>
        <assign name="docCity" expr="city"/>
      </filled>
    </field>
    <field name="state">
      <prompt>The state is?</prompt>
      <filled>
        <assign name="docState" expr="state"/>
      </filled>
    </field>
    <block>
      <prompt>
        City is set to <value expr="docCity"/>.
        State is set to <value expr="docState"/>.
      </prompt>
    </block>
  </form>
</vxml>

2.9.8. Field types and built-in grammars

The built-in grammars supported by VoiceXML 2.0 are:

  • digits,

  • boolean,

  • currency,

  • date,

  • number,

  • phone,

  • time.

Setting a field element's attribute type to any of the built-in grammar types will control how the field data is interpreted. In Example 2-65 the ticket_num field has the type digits. This ensures that the input will be interpreted as numeric digits. Built-in grammars must support both voice and DTMF. The VoiceXML code shown in Example 2-65 uses the built-in digits grammar to fill the ticket_num field.

Example 2-65. A VoiceXML field that uses the built-in digits grammar
<field name="ticket_num" type="digits">
  <prompt> Read the 12 digit number from your ticket. </prompt> 
  <help>The 12 digit number is to the lower left. </help> 
  <filled> 
    <if cond="ticket_num.length != 12"> 
      <prompt> Sorry, I didn't hear exactly 12 digits. </prompt> 
      <clear namelist="ticket_num"/>
    </if>   
  </filled> 
</field>

As VoiceXML continues to evolve, the specification of built-in grammars may be removed from the VoiceXML specification and added to the grammar language specifications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.64.221