Grammar support is formalized with VoiceXML 2.0. This Specification provides support for built-in grammars and explicit grammars. We introduced built-in grammars in 2.1.4, “Grammars: mapping a user response to a value,” on page 31. Explicit grammars can be written inline or referenced externally.
VoiceXML 2.0 states that an interpreter should support at least one of the following three grammar standards:
XML form of the W3C Speech Recognition Grammar Format (GRXML),
Augmented BNF (ABNF) form of the W3C Speech Recognition Grammar Format,
Note that the first two grammars, GRXML and ABNF, are both developed by the W3C and are completely compatible and transformable to one another format. We will use the GRXML form in this section. Details of the grammars are covered in Appendix D, “Speech Recognition Grammar Format,” on page 438.
As of the writing of this book, the VoiceXML grammar specification is being replaced with the Speech Recognition Grammar Specification (SRGS). The fundamental concepts covered in this section should remain true but you may discover that certain details of the syntax may change.
All grammars, whether GRXML, ABNF, or JSGF, share the same fundamental grammar concepts, but use different semantics to represent them. In this section we'll look at the common grammar concepts and define the terms used by them.
A grammar is a compound set of rules that together define what can be recognized by a grammar processor, where a grammar processor is a software or hardware component that performs recognition on speech utterances or DTMF tones. In other words, the grammar tells the underlying recognizer what utterances to recognize.
A grammar has a header and a body. The grammar header is where the grammar name is defined and depending on the grammar specification, the grammar version, character encoding, language/locale, mode, and root are declared. The grammar body is where rules are defined. Each rule is defined within a rule definition. The scope of a rule can be public or private.
A rule is a combination of speakable text and references to other rules. Each rule has a rulename, which must be unique within a grammar. A reference to two grammars using the same rulename is ambiguous and is resolved by qualifying the rulename reference to each grammar.
There are at least two special rules defined, VOID and NULL; additionally, GARBAGE is defined in GRXML.
VOID defines a rule that can never be spoken. Inserting VOID into a sequence will make it unrecognizable.
NULL defines a rule that is automaticallymatched without the user speaking.
GARBAGE defines a rule that will match all speech until the end of spoken input, the next rule, or the next token.
A rule expansion is a regular expression that defines patterns of tokens, rule references and combinations of these. A token is one or more spoken words that reference a recognizer's vocabulary, a.k.a. lexicon, and the associated pronunciation.
Rules may contain the following constructs:
sequence - an expansion used to define an exact phrase,
alternative - a set of expansions with optional weighting, to define choices,
precedence used to define grouping,
optional rule expansions,
repetition operators,
recursion (implementation is optional for GRXML and ABNF),
tagging - an application aid for providing semantic interpretation, for example allowing an application to use a recognizer's results.
Grammars take a user response as input and return a string value that represents the match. Using grammars, complex word patterns can be defined and tested for. Here we will look at some examples using GRXML to define our grammars.
A simple example is a bank application where we need to define a grammar that will test for the type of account. Let's assume the bank has three account types: savings, checking, and money market. To accomplish this, we could use the rule shown in Example 2-56.
<rule id="accountType"> <one-of> <item>savings</item> <item>checking</item> <item>money market</item> </one-of> </rule> |
The id attribute is a unique identifier of the rule, in this case it is accountType. The rule simply states that the account type must be one of the three types of accounts.
Similar to the definition of account types, we can make a rule for transaction types, shown in Example 2-57. We will make two transaction types, one for complex transactions that require an amount and another for simple transactions that have no associated amount. Notice the use of the attribute weight for the item element. This attribute lets us associate a probability weight with each item to give a hint to the speech recognizer. The default weight is 1.0, so in the rule transactionSimple the phrase "check balance" is four times as likely to occur as "check the balance" and twice as likely to occur as simply "balance." Weights that are less than one, e.g. 0.25, are weighted negatively by making the corresponding item less likely. It is important to mention that the probabilities do not need to sum up to 1; they are merely relative. In practice, one finds that adjusting weights does not have any noticeable effect on the actual recognition performance.
<rule id="transactionComplex"> <one-of> <item>transfer</item> <item>deposit</item> <item>withdraw</item> </one-of> </rule> <rule id="transactionSimple"> <one-of> <item weight="4">check balance</item> <item weight="1">check the balance</item> <item weight="2">balance</item> <item>inquire</item> </one-of> </rule> |
Now we can use the grammar rules defined so far and combine them with the built-in grammar rule for currency to make our userAction rule, as shown in Example 2-58. You can reference a rule defined in the same document using the # prefix. An item's repeat attribute can contain a range like "1-2", meaning "this item may be repeated one or two times."
The action grammar rule defined in Example 2-58 will recognize any of the following user directives:
"Transfer $300 from my savings to checking."
"Transfer $20 from my savings account to my money market account."
"Check balance in savings account."
"Deposit $40 into checking."
If we can adequately explain to the users how they should describe their actions, the above could work. However, if we were to deploy this grammar in production, then we would want to ensure that nothing ambiguous is spoken. We would need to refine our grammar so that, for example, the following would not be considered valid: "Check balance in savings from checking." Writing flexible and unambiguous grammars is not trivial. We'll look at this in more detail in 2.9.4, “Whole-word speech grammars,” on page 113 and 2.9.5, “Natural dialogs and continuous speech grammars,” on page 113, but first we will look at a simpler case of DTMF grammars.
The JSGF Specification does not cover DTMF support, while the W3C GRXML and ABNF Specifications state that DTMF grammar processor implementation is optional. In GRXML, DTMF is enabled through the grammar element's mode attribute, which can be voice or dtmf. The Specification does not define a mechanism for a mixed voice and DTMF grammar; instead, each grammar must be defined separately as either voice or DTMF. A DTMF and a voice grammar can both be active at the same time. However, a grammar of one type cannot reference a grammar of the other type.
Let's take a look at a DTMF grammar to accept a four digit PIN number with an option for the user to get help by entering a star key followed by a nine, i.e. "* 9". Example 2-59 uses GRXML to specify the desired DTMF grammar (we could have easily used the built-in DTMF grammar instead).
<grammar mode="dtmf" version="1.0" xmlns="http://www.w3.org/2001/06/grammar"> <rule id="digit"> <one-of> <item> 0 </item> <item> 1 </item> <item> 2 </item> <item> 3 </item> <item> 4 </item> <item> 5 </item> <item> 6 </item> <item> 7 </item> <item> 8 </item> <item> 9 </item> </one-of> </rule> <rule id="pin" scope="public"> <one-of> <item> <item repeat="4"><ruleref uri="#digit"/></item> </item> <item> * 9 </item> </one-of> </rule> </grammar> |
To accomplish this, we will first need to set the grammar mode to be DTMF. Next, we will need to define two rules within our grammar, one for a single digit and the other for the four digit PIN. The rule for one digit defines ten items representing numbers between 0 and 9. By default, digit is private since it is only used locally by the pin rule. The pin rule, on the other hand, is made public so that it can be referenced from other grammars. It has two rule items defined, one for returning when four digits are entered (<item repeat="4"><ruleref uri="#digit"/></item>) and another for returning when a "* 9" combination is entered (<item>*9</item>).
Speech recognizer grammars are used to let an application specify to a speech recognizer what words or word patterns to recognize as well as what language to use for recognition. The specification does not address any of the following:
timeouts,
recognition thresholds,
search sizes,
n-best result counts,
specification of speaker voice or other meta data,
loading of lexicons or word pronunciations.
Taking Example 2-59, it is straightforward to rewrite it as a whole-word grammar as shown in Example 2-60.
<grammar mode="voice" version="1.0" xmlns=http://www.w3.org/2001/06/grammar> <rule id="digit"> <one-of> <item> zero </item> <item> one </item> <item> two </item> <item> three </item> <item> four </item> <item> five </item> <item> six </item> <item> seven </item> <item> eight </item> <item> nine </item> </one-of> </rule> <rule id="pin" scope="public"> <one-of> <item> <item repeat="4"><ruleref uri="#digit"/></item> </item> <item> help </item> </one-of> </rule> </grammar> |
Having no grammar and passing every word recognized to an application wouldn't be feasible. Nor is it desirable to have a limited dialog. Applications will want to balance between natural dialogs and a finite set of match results that the grammar processor will return. In 2.9.2, “Mapping user responses to string values,” on page 108 we described a grammar needed for a bank transaction application. We can enhance that grammar to accommodate natural dialogs by modifying the action rule. This time we will rewrite our grammar using JSGF; this will serve to demonstrate the similarities and differences between JSGF and GRXML.
As shown in Example 2-61, we modify the userAction rule grammar to allow for more natural dialogs. Using the special <NULL> rule, all utterances will be consumed up until the moment the user says "Please," "Kindly," "I'd like to," or "I want to." We've also attached two grammar tags: {xa} represents that either transactionSimple or transactionComplex is selected, and {exit} represents that exit is selected. This will come into play later when we look at returning results back to the VoiceXML document.
Some of the possible natural-sounding dialogs accepted by our new grammar are:
"Please transfer $22.50 from my savings account to my checking."
"Yes, I'd like to check the balance in my money market account."
"With the greatest of pleasure I'd like to deposit $500 into my savings account."
"At this point I'd like to transfer $20 into my checking account from my savings account."
"What is the balance in my money market account?"
Table 2-2 summarizes some of the popular grammar markup languages and the syntax they use for building and describing grammar rules.
The rule types and meanings used in the table are as follows:
rule name
Shows how to define a rule; "ruleName" is the name assigned to the rule.
rule reference
Shows how one rule can reference another rule (e.g. rule r1 references rule color).
composition
Shows how a rule can be comprised of two or more rules (e.g. rule r1 references the color and object rules).
alternatives
Shows how a rule can be one of a number of choices (e.g. r1 can be either red, blue, or green).
grouping
Shows a combination of a rule with one of the choices (e.g. r1 can be Please open, Pleaseclose, or Please delete).
optional grouping
Shows a group with optional elements (e.g. r1 can be Please confirm your answer, or Please confirm answer, or confirm your answer, or confirm answer).
weights
Shows how to specify relative probabilities of elements. Higher values indicate greater likelihood of element occurrence (e.g. blue is more likely to occur than green, which is more likely to occur than red).
grammar tags
Shows how to allow a grammar to return a "{tag}" associated with a word match (e.g. the grammar will return {fruit} if apple matches and {dairy} if milk matches).
repetition
Shows how to specify the number of occurrences allowed (e.g. rule digit can occur one or more, zero or more, or between three and six times).
recursion
Shows how a rule can reference itself (e.g. rule palette references color and palette).
scope
Shows how to specify private or public scope for a rule.
alias
Shows a shorthand for references (e.g. contacts is an alias of http://www.domain.com/names.gram).
GRXML | ABNF | JSGF | GSL | |
---|---|---|---|---|
rule name | <rule id="ruleName"> | $ruleName | <ruleName> | ruleName |
rule reference |
<rule id="r1"> references the <ruleref uri="#color"/> car </rule> |
$r1= references the $color car |
<r1>= references the <color> car |
R1: (references the) color (car) |
composition |
<rule id="r1"> The <ruleref uri="#object"/> is the color <ruleref uri="#color/> </rule> |
$r1= The $object is the color $color |
<r1>= The <object> is the color <color> |
R1: (The) object (is the) color |
alternatives |
<rule id="r1"> <one-of> <item>red</item> <item>blue</item> <item>green</item> </one-of> </rule> |
$r1=red | blue | green |
<r1>=red | blue | green |
R1: [red blue green] |
grouping |
<rule id="r1"> Please <one-of> <item>open</item> <item>close</item> <item>delete</item> </one-of> </rule> |
$r1=Please (open | close | delete) |
<r1>=Please (open | close | delete) |
R1: (Please) [open close delete] |
optional grouping |
<rule id="r1"> <item repeat="0-1">Please</item> confirm <item repeat="0-1">your</item> answer </rule> |
$r1=[Please] confirm [your] answer |
<r1>=[Please] confirm [your] answer |
R1: (?Please confirm ?your answer) |
weights |
<rule id="r1"> <one-of> <item weight="2">red</item> <item weight="8">blue</item> <item weight="4">green</item> </one-of> </rule> |
$r1= /2/ red | /8/ blue | /4/ green |
<r1>= /2/ red | /8/ blue | /4/ green |
R1: [red~2 blue~8 green~4] |
grammar tags |
<rule id="r1"> <one-of> <item>apples</item> <tag>fruit</tag> <item>milk</item> <tag>dairy</tag> </one-of> </rule> |
$r1= apples{fruit} | milk{dairy} |
<r1>= apples{fruit} | milk{dairy} |
R1: [[apples] {<result "fruit">} [milk] {<result "dairy">}] |
repetition |
<item repeat="1-"> <rulerefuri="#digit"/> </item> |
$digit <1-> |
<digit>+ |
+digit |
<item repeat="0-"> <ruleref uri="#digit"/> </item> |
$digit <0-> |
<digit>* |
*digit | |
<item repeat="3-6"> <ruleref uri="#digit"/> </item> |
$digit <3-6> | no shorthand | no shorthand | |
recursion |
<rule id="palette"> <one-of> <item> <rulerefuri="#color"/> </item> <item> <ruleref uri="#color"/> and <ruleref uri="#palette"/> </item> </one-of> </rule> |
$palette= $color | ($color and $palette) |
<palette>= <color> | <color> and <palette> |
palette: [color (color and palette)] |
scope |
<rule id="r1" scope="public"/> |
public $r1= ... |
public <r1>= ... |
R1:public [...] |
<rule id="r2" scope="private"/> |
private $r2= ... |
private <r2>= ... | not available | |
alias |
<alias uri="http://www. domain.com/ names.gram" name="contacts"> |
alias $(http://www. domain.com/ names.gram) $$ contacts | not available | not supported |
Once grammars are defined, you can begin using them within a VoiceXML application. Let's take the userAction grammar we defined earlier in Example 2-61. We can access this grammar from a VoiceXML document, such as the one shown in Example 2-62. Here we have defined a form with a single field and referenced userAction as the field grammar. When the field has received valid speech matching the grammar, the filled section of the form will be activated. Notice that the field's name attribute is now one of two values; mainChoice can be either "xa" or "exit" as defined by the grammar.
<?xml version="1.0" encoding="iso-8859-1"?> <vxml version="1.0"> <form id="main"> <field name="mainChoice"> <prompt> Welcome to Virtual Bank's enhanced services. Please say the transaction you would like to perform. </prompt> <grammar src="userAction.jsgf"/> </field> <filled> <if cond="mainChoice=='xa'"> Your transaction is being processed. <elseif cond="mainChoice=='exit'"/> Exiting the transaction menu. <exit/> </if> </filled> </form> </vxml> |
In Example 2-62, we used field-level grammars to return grammar results to a VoiceXML field. It is possible to also have form-level grammars. Form-level grammars are useful when an application wants to give the caller the option to give two or more answers for a single question. In the example we will look at, the caller is prompted to say the city and state. The caller can either initially give both answers at once, or if not, he or she will be prompted separately for the city and state.
Example 2-63 shows the grammar for returning the city and state. It is used by the VoiceXML document shown in Example 2-64. The form cityState uses an initial element to get both city and state. The grammar ensures that the values of the grammar tags {city} and {state} are passed to the VoiceXML document. If either is undefined, the field city and/or the field state will be visited to collect the missing information. Using the same name (i.e. city and state) for the grammar tag and the field guarantees the mapping.
#JSGF V1.0; grammar citystate; public <cityandstate> = <city> {this.city=$} [<state> {this.state=$}] [please] | <state> {this.state=$} [<city> {this.state=$}] [please] |; <city> = Los Angeles | San Francisco | San Jose | Albany | Buffalo | New York; <state> = California | New York; |
<?xml version="1.0" encoding="iso-8859-1"?> <vxml version="1.0"> <var name="docCity" expr="''"/> <var name="docState" expr="''"/> <form id="cityState"> <grammar src="citystate.gram" type="application/jsgf"/> <initial name="getBoth"> <prompt>Say the city and state</prompt> <filled> <prompt>You said both!!</prompt> <assign name="getBoth" expr="true"/> </filled> <nomatch> <prompt>What city and state?</prompt> <reprompt/> </nomatch> </initial> <field name="city"> <prompt>The city is?</prompt> <filled> <assign name="docCity" expr="city"/> </filled> </field> <field name="state"> <prompt>The state is?</prompt> <filled> <assign name="docState" expr="state"/> </filled> </field> <block> <prompt> City is set to <value expr="docCity"/>. State is set to <value expr="docState"/>. </prompt> </block> </form> </vxml> |
The built-in grammars supported by VoiceXML 2.0 are:
digits,
boolean,
currency,
date,
number,
phone,
time.
Setting a field element's attribute type to any of the built-in grammar types will control how the field data is interpreted. In Example 2-65 the ticket_num field has the type digits. This ensures that the input will be interpreted as numeric digits. Built-in grammars must support both voice and DTMF. The VoiceXML code shown in Example 2-65 uses the built-in digits grammar to fill the ticket_num field.
<field name="ticket_num" type="digits"> <prompt> Read the 12 digit number from your ticket. </prompt> <help>The 12 digit number is to the lower left. </help> <filled> <if cond="ticket_num.length != 12"> <prompt> Sorry, I didn't hear exactly 12 digits. </prompt> <clear namelist="ticket_num"/> </if> </filled> </field> |
As VoiceXML continues to evolve, the specification of built-in grammars may be removed from the VoiceXML specification and added to the grammar language specifications.
18.116.64.221