Chapter 4. Creating and editing XML documents Word Power User Task

  • Creating and using schemas

  • Opening a document

  • Validation

  • Working with attributes

  • Saving a document

  • Combining documents

This chapter describes how to use Word to create and edit XML documents. We don’t include WordML documents in this category because Word treats them just like .doc or RTF documents – as native Word documents.[1]

Skills required

Skills required

Experience using Microsoft Word to perform basic tasks such as creating and editing documents.

Creating and using schemas

The real power of XML lies in using a vocabulary that describes the meaning of the document, not its outward appearance. For example, Example 4-1 shows Doug’s article marked up using a custom schema called article.[2]

Example 4-1. An article using the article schema (article.xml)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<article xmlns="http://xmlinoffice.com/article"
         type="sales" id="A123">
  <title>Sales Update</title>
  <author>Doug Jones</author>
  <date>February 3, 2004</date>
  <body>
   <section>
     <header>A great month!</header>
     <para>This month's figures are a <em>huge</em>
improvement over this month last year. We sold 1,342 widgets for a
total revenue of $14,327.</para>
   </section>
   <section>
     <header>More work to do</header>
     <para>Let's not rest on our past success. Let's get out there
and sell, sell, sell!</para>
   </section>
  </body>
</article>

This XML document identifies the meaning of the data elements, not just their location in the document. The author is identified by an <author> tag, and the title is identified by a <title> tag. The document uses a namespace, http://xmlinoffice.com/article, to identify the vocabulary used.

Vocabularies and schemas

A schema defines an XML vocabulary for documents of a particular type (such as article documents). The vocabulary includes the element-type names, such as author and title. The schema also constrains the order in which elements and attributes can appear.

Industry organizations and standards committees have defined XML vocabularies for subjects as varied as computer graphics and accounting statements, many with one or more schemas that employ the vocabularies.[3] Any of those schemas – or any other – can be used with Word; there is no specific set of “supported schemas.” However, a definition of the schema must be available in the W3C XML Schema definition language (XSDL), as no other schema language is supported.

Alternatively, you can define your own vocabulary by writing your own schema. Microsoft Office does not provide a GUI editor for defining a schema; you will have to create yours with a text editor or an available schema editing tool. Chapter 22, “XML Schema (XSDL)”, on page 466 explains schemas in more detail and gives some guidelines for writing your own schemas.

Schemas are used in Word both to validate documents, and to provide hints on the structure of a document while it is being edited.

The article schema

The article schema definition is shown in Example 4-2.

Example 4-2. Schema for article documents (article.xsd)

<?xml version="1.0"?>
<xs:schema targetNamespace="http://xmlinoffice.com/article"
           xmlns="http://xmlinoffice.com/article"
           xmlns:xs="http://www.w3.org/2001/XMLSchema"
           elementFormDefault="qualified">
  <xs:element name="article" type="ArticleType"/>
  <xs:complexType name="ArticleType">
    <xs:sequence>
      <xs:element name="title" type="xs:string"/>
      <xs:element name="author" type="xs:string"/>
      <xs:element name="date" type="xs:string"/>
      <xs:element name="body" type="BodyType"/>
    </xs:sequence>
    <xs:attribute name="id" type="xs:ID"/>
    <xs:attribute name="type" type="xs:string"/>
  </xs:complexType>
  <xs:complexType name="BodyType">
    <xs:sequence>
      <xs:element name="section" type="SectionType" maxOccurs=
      "unbounded"/>
    </xs:sequence>
  </xs:complexType>
  <xs:complexType name="SectionType">
    <xs:sequence>
      <xs:element name="header" type="xs:string"/>
      <xs:element name="para" type="ParaType" maxOccurs="unbounded"/
      >
    </xs:sequence>
  </xs:complexType>
  <xs:complexType name="ParaType" mixed="true">
    <xs:choice minOccurs="0" maxOccurs="unbounded">
      <xs:element name="em" type="xs:string"/>
      <xs:element name="cite" type="xs:string"/>
      <xs:element name="url" type="xs:string"/>
    </xs:choice>
  </xs:complexType>
</xs:schema>

The schema definition uses xs:element elements to declare the element types allowed in an article document. For example:

<xs:element name="title" type="xs:string"/>

declares that title elements contain data characters (“strings”).

Some of the element types, such as article and section, are complex, which means that their elements can have child elements and/or attributes.

Adding a schema to the library

When Word opens an XML document it checks to see if the document is associated with the known schemas in its Schema Library. It does so by comparing the namespace of the document (i.e., of its root element) with the target namespace of each schema in the Schema Library until it finds a match.

For example, the schema in Example 4-2 has a target namespace, specified by the targetNamespace attribute, of:

http://xmlinoffice.com/article

This is the same namespace that is declared for Doug’s article document in Example 4-1. By adding the schema to Word’s Schema Library, we can assure that it will be associated with Doug’s article or any other article document that is opened in Word.

To add a schema:

  1. On the Tools menu, click Templates and Add-Ins.

  2. Click the XML Schema tab, shown in Figure 4-1.

    The XML Schema tab

    Figure 4-1. The XML Schema tab

  3. Click Add Schema.

  4. Select the schema file, in this case article.xsd, and click Open. This will bring up the Schema Settings dialog shown in Figure 4-2.

    The Schema Settings dialog

    Figure 4-2. The Schema Settings dialog

  5. Type the word article in the Alias box. This will serve as a nickname for the article namespace. It is good practice to use the document type (i.e. the root element-type name) as the alias.

If you attempt to add an invalid schema, you will be advised that the schema is invalid and prevented from adding it. Once you have added the schema, it will appear in the Available XML schemas list, as shown in Figure 4-3. When it is selected, the pane will show the namespace URI and the path to the schema definition file.

Available XML schemas

Figure 4-3. Available XML schemas

These settings will be saved in your Word configuration. From now on, every time you open an XML document whose root element is in the http://xmlinoffice.com/article namespace, the article.xsd schema is automatically used for that document. It is not possible to add more than one schema for a given namespace.

Using the Schema Library

The Schema Library, shown in Figure 4-4, allows you to add and delete schemas, as well as give them mnemonic aliases (usually the document-type name). Word uses the alias as the name of the schema; without one it will use the entire namespace URI.

The Schema Library

Figure 4-4. The Schema Library

To access the Schema Library, click Schema Library on the XML Schema tab of the Templates and Add-Ins dialog.

The Schema Library also allows you to associate solutions with schemas. These are XSLT stylesheets that can be used to transform XML documents when Word opens or saves them, as we will see in 5.3.2.1, “Associating stylesheets with schemas, on page 108.

Opening a document

Once you have added the necessary schemas to the Schema Library, there are several ways to open an XML document for editing:

  • Open an existing or new Word document and mark it up according to a schema.

  • Open an existing XML document.

  • Open an XML template that you’ve created for a particular schema.

First, let’s look at converting Doug’s article.doc to article.xml.

Opening a Word document

To convert a Word document to XML – either an existing one or a new empty one – there are two steps. First you associate a schema with the document, and then you apply markup to its elements.

Associating a schema

A new document, or a Word document that has never been converted to XML, has no schema associated with it. Before you can begin to mark it up, you need to associate a schema. To do this (using article.doc as an example):

  1. Open article.doc in Word.

  2. On the Tools menu, click Templates and Add-Ins.

  3. Click the XML Schema tab.

  4. Check the box next to article and click OK.

You will now see the XML Structure pane on the right side of the Word window, as shown in Figure 4-5.

The XML Structure Pane

Figure 4-5. The XML Structure Pane

The XML Structure pane has two parts:

  • The top part shows the hierarchical structure of the current document, based on the elements that have already been identified. When we first apply the schema to article.doc, no structure is shown because the document has not yet been tagged.

  • The bottom part of the XML Structure task pane shows the available element types from the vocabulary. By default, it only shows the types of elements that can be validly inserted at the point in the document where the cursor is positioned.

Tip

Tip

If you ever lose the XML Structure task pane, or close it accidentally, click Task Pane on the View menu. Then, click the top bar of the task pane and click XML Structure.

Applying markup to elements

The first order of business is to identify the root element, in this case article. To do this:

  1. Select the word article in the bottom part of the XML Structure task pane.

  2. This will bring up a dialog that asks whether the element to be tagged is the entire document, or the current selection only. Click Apply to Entire Document.

You will notice that a start-tag icon appears at the beginning of the article, and an end-tag icon at the end of the article. In addition, article now appears in the top part of the XML Structure pane, indicating that an article element has been identified in the document.

To tag the rest of the document, you can simply select a section of the text and choose its element-type in the bottom part of the XML Structure task pane. Alternatively, you can select the text and right-click, click Apply XML Element, then choose the element-type name.

Note

Note

You aren’t really “applying an element” to anything; the element is there in the document. Think of the term as shorthand for “apply markup to XML element”.

You can also insert a new element. To do this, position the cursor at the desired location and choose the desired element-type name in the bottom part of the XML Structure task pane. You can then type text between the start-tag and end-tag icons. This procedure is also followed when creating an XML document from a new empty Word document and when adding elements to an existing XML document.

Once marked up, our document looks something like Figure 4-6.

A marked up article document

Figure 4-6. A marked up article document

Tip

Tip

Start applying markup at the root element and work down the tree structure. For example, tag the body element before either of the section elements. This technique allows the appropriate valid element-types to appear in the bottom part of the XML Structure task pane.

Opening an XML document

You can open and edit any XML document in Word, even documents that have no schema in Word’s Schema Library (or even no schema at all).

To do so, you use the standard procedure for opening a document: that is, clicking Open on the File menu and selecting the document of interest.

The XML document must have an XML declaration as its first line. This is the only way Word will identify it as an XML document; the file extension is ignored. For example:

<?xml version="1.0" encoding="UTF-8"?>

In addition, the document must be well-formed, meaning that it follows all the rules of the XML language.[4] Upon opening the document, Word displays the XML Structure task pane.

Word then attempts to associate a schema with the document, as described in 4.1.3, “Adding a schema to the library”, on page 63. If it fails, it allows all element types found in the document to occur anywhere in the document; no other element types are permitted. You can associate a schema with the document after it is opened, using the procedure described in 4.2.1.1, “Associating a schema”, on page 69.

If there is an associated schema, the document need not be valid with respect to it. Word will flag the errors in the XML Structure pane (see 4.3, “Validation”, on page 74).

Opening a skeleton document

Rather than starting a new XML document from scratch, you can take advantage of a document type’s predictable structure by creating a skeleton document. A skeleton serves the same purpose as a Word template. It is a sample document that an end user can modify to create the actual document he wants.

Figure 4-7 shows a skeleton document for the article schema. It uses placeholder text for all the elements, allowing the article writer simply to fill in the fields. When the user wants to enter content for an element, he just clicks the placeholder text and starts typing, much as if it were a field in a form.

Skeleton document for articles (article struct.doc)

Figure 4-7. Skeleton document for articles (article struct.doc)

Placeholder text is only displayed when the tag icons are not shown, and only for elements that have empty content. Note that Show XML tags in the document is unchecked. Checking it will show the icons and hide the placeholder text (although the text will not be deleted permanently).

You can create a skeleton by starting with an existing XML document and deleting the data content. Alternatively, you can start with a new empty document, associate a schema, and insert the desired elements (again without any data content).

You can then specify placeholder text for each of the elements. When an element is empty, the placeholder text appears where the data content would normally appear.

To specify placeholder text for an element:

  1. Right-click that element in the top part of the XML Structure pane, or in the document pane, and click Attributes on the context menu.

  2. Type the placeholder text at the bottom of the dialog, and click OK.

Note that placeholder text is associated with an individual element in the document, not with an element type. In Figure 4-7, for example, the first header element has the placeholder text “[SECTION HEAD 1]” while the second header element has the placeholder text “[SECTION HEAD 2]”.

Tip

Tip

Choose placeholder text that is clearly not the actual content, and that explains what data to enter if it isn’t obvious. Enclose the text in square brackets to make it look more like a field for data entry.

Validation

You may have noticed that Word validates your document as you mark it up. Word is constantly checking to make sure that the document conforms to the schema.

Schema rules

Rules that a schema can specify include:

  • The allowed children of an element. For example, according to the article schema, there can be only one article element, which can contain title, author, date and body elements.

  • The order of the children. The children of the article element must appear in the order specified in the schema.

  • The number of occurrences of each child. A section element must contain one and only one header element, but it may contain one or more para elements.

  • The presence or absence of character data content in an element. The title element can contain character data, but a section element cannot.

  • The datatype of an element or attribute. The id attribute of the article element must contain a valid ID.

Using datatypes, you can specify additional rules that limit the length of a string, require that a string match a particular pattern, limit a number to a particular range, specify a list of valid values, and apply various other constraints. For more information on the capabilities of datatypes, see Chapter 21, “Datatypes”, on page 442.

Validity errors

When a document (or part of a document) you are editing is invalid according to the schema, Word shows this in two ways:

  • A purple wavy line (similar to the one that shows spelling and grammatical errors). For errors that span multiple lines, it appears down the left side of the document. Otherwise, it underlines the error.

  • The top part of the XML Structure task pane shows icons next to invalid elements, as described in the next section.

Some of the possible errors that could be found are:

  • The element has invalid content. For example, character data appears where it should not, or the character data it contains is invalid according to its datatype.

  • The element is not expected to appear here. This could be because it is not a valid child, it does not appear in the correct sequence, or a required element that should appear before it is missing.

  • The element is empty, but according to the schema, it should have children, or some required content is missing.

  • One or more of the element’s attributes is missing or invalid.

The XML Structure task pane

Now that we have marked up the elements, let’s take a closer look at the XML Structure pane.

Document structure

The top part of the XML Structure pane shows the structure of your XML document. When you select an element in the task pane, it selects the contents of that element in the document itself. This provides an easy way to navigate the document. In addition, when you position the cursor in the document, the appropriate element is selected on the task pane.

Error signals

When there are errors in the document, a diamond-shaped yellow icon appears next to the element in the task pane. Hovering over the icon will reveal an explanation of the problem.

Available element types

The bottom part of the XML Structure task pane normally lists the element types available for children of the current element. You can see all element-type names in the vocabulary by unchecking the List only child elements of current element box.[5]

Despite the list, you can still add an element that is not valid. That is because the list does not take into account any required sequence of child elements, nor the number of times child elements are allowed to occur. Fortunately, such errors do appear in the top part of the task pane.

Viewing tag icons

There is an alternative view in Word that does not show the tag icons in the document pane. You can switch to this view by unchecking the Show XML tags in the document box on the XML Structure pane. The document will then look like a typical Word document, although the XML Structure pane will still be usable to identify and tag elements shown in the document pane.

Working with attributes

So far, our discussion has focused on elements. XML attributes are less visible in Word. You must go to a separate Attributes dialog to assign, edit or remove values for the attributes of an element.

To assign a value to an attribute:

  1. Right-click the article element, either in its start-tag icon or in the top part of the XML Structure task pane.

  2. Click Attributes. This will bring up the Attributes dialog as shown in Figure 4-8.

    The Attributes dialog

    Figure 4-8. The Attributes dialog

  3. In the top part of the dialog, select the attribute for which you want to specify a value. Only the attributes that are declared for the element type are listed. If no schema is associated, the individual element’s actual attributes are shown.

  4. The dialog then displays the type of the attribute. Specify a value in the Value box.

  5. Once you have entered a value, click Add. The button text now turns to Modify, which is the way it is shown in the dialog.

If you enter a value that is invalid for that type, you will receive the message “This attribute value is not allowed.” In our example, the value of the id attribute must conform to the type ID. This means that it must start with a letter or underscore and contain only letters, digits, underscores and periods.

You can also modify or delete the value for an attribute by selecting it in the bottom section of the dialog and clicking the appropriate button.

Saving a document

To save a document as XML:

  1. On the File menu, click Save As.

  2. Select XML Document (*.xml) from the Save as type list.

  3. Click Save.

There are two options, depending on whether you check the Save data only box on the Save As dialog.[6]

Caution

Caution

The Save data only box is confusingly named since you will be saving markup even if you check it. Think of it as “Save without WordML” or “Save custom XML only”.

Saving XML without WordML

When the Save data only box is checked, none of the Word formatting information is saved. In the case of Doug’s article document, the result is identical to Example 4-1. It contains only the article tags and is valid according to the article.xsd schema.

Any other information has been lost; when you reopen the document in Word, there will be no styles or formatting applied to the text, and the settings such as page margins and tracked changes will be wiped out.

Saving mixed vocabularies

If the Save data only box is not checked, the resulting XML document contains elements from both the WordML and article schemas, mixed together. Namespaces are used to distinguish between the two vocabularies. This option is discussed further in 5.2, “Mixing WordML with other vocabularies”, on page 102.

The marked up article document could also be saved as a Word document in the .doc binary format. As with WordML, the .doc representation retains all the article markup. However, it is less useful than WordML because it is not easily processable by other products.

Combining documents

In Chapter 1, “Desktop XML: The reason why”, on page 4, we learned about Denise, who is responsible for putting the newsletter together from many articles. One of the nice things about using XML is that all the articles Denise receives have the same XML markup, making her compilation job easier. Simply by opening the incoming articles in Word, she can validate them against the article schema to ensure that they meet her standards; for example, that they all have a title.

Combining elements from multiple namespaces

Denise has a newsletter skeleton document, shown in Example 4-3, that she uses when creating a newsletter. As we saw in 4.2.3, “Opening a skeleton document”, on page 73, a skeleton is simply an XML document without the data content. Because Denise is the only user, and prefers to see the tag icons so that she knows what is going on with the XML representation, she did not bother with placeholder text for the skeleton.

Example 4-3. Newsletter skeleton document (newsletter temp.xml)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<newsletter xmlns="http://xmlinoffice.com/newsletter">
  <volume></volume>
  <number></number>
  <date></date>
  <body> <!--body must contain one or more article elements -->
  </body>
</newsletter>

She also has a newsletter schema, shown in Example 4-4, whose target namespace is http://xmlinoffice.com/newsletter. According to the BodyType definition, the body element of a newsletter can contain one or more article elements.

Example 4-4. Newsletter schema (newsletter.xsd)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xs:schema targetNamespace="http://xmlinoffice.com/newsletter"
           xmlns="http://xmlinoffice.com/newsletter"
           xmlns:art="http://xmlinoffice.com/article"
           xmlns:xs="http://www.w3.org/2001/XMLSchema"
           elementFormDefault="qualified">
  <xs:import namespace="http://xmlinoffice.com/article"
             schemaLocation="article.xsd"/>
  <xs:element name="newsletter" type="NewsletterType"/>

  <xs:complexType name="NewsletterType">
    <xs:sequence>
      <xs:element name="volume" type="xs:integer"/>
      <xs:element name="number" type="xs:integer"/>
      <xs:element name="date" type="xs:string"/>
      <xs:element name="body" type="BodyType"/>
    </xs:sequence>
  </xs:complexType>

  <xs:complexType name="BodyType">
    <xs:sequence>
      <xs:element ref="art:article" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:schema>

This combining of elements from more than one namespace is not unusual, and Word supports it. When Word opens a newsletter document, it knows it must also use the article schema, because of the import element in the newsletter schema.

Compiling document fragments

Denise has three choices for compiling the articles: cutting and pasting elements, copying and inserting documents, and linking to documents.

Cutting and pasting elements

One straightforward way to put the newsletter together is to cut and paste article elements into the newsletter document. Word supports the cutting and pasting of XML fragments to and from documents.

When Denise opens her newsletter skeleton in Word, it looks something like Figure 4-9. She can then cut the article elements from the various article XML documents that were submitted to her, and paste them into the newsletter document.

Word presentation of the newsletter skeleton

Figure 4-9. Word presentation of the newsletter skeleton

Because the newsletter schema requires the children of body to be article elements (i.e. root elements from the article schema), inserting the articles will not cause any errors to be raised. If the elements had been from any other vocabulary, she would still be able to paste them, but errors would be flagged in the XML Structure task pane.

Copying and inserting documents

Another method is to insert entire article documents into the newsletter document. To do this:

  1. Position your cursor in the desired insert location.

  2. On the Insert menu, click File.

  3. Select an article XML file, such as article.xml.

  4. Click Insert.

Word will insert the entire document as a child of the current element. If a default solution stylesheet is specified (in the Schema Library) for the article document, it is applied automatically when the article is inserted into the newsletter.

Linking to documents

Another possibility is to insert links to the article files, so that if the articles are updated in the future, the newsletter will be updated as well. This method is useful if Denise needs to begin compiling the newsletter before the articles are finished. To do this:

  1. Position your cursor in the desired insert location.

  2. On the Insert menu, click File.

  3. Select an article XML file, such as article.xml.

  4. Click the down arrow just to the right of the Insert button.

  5. Click Insert as Link.

To refresh the newsletter after an article file has been updated, you can select the linked article elements and press F9. Alternatively, you can right-click an article start-tag and click Update Field, as shown in Figure 4-10.

Refreshing a linked document

Figure 4-10. Refreshing a linked document



[1] Which is not to say that WordML isn’t important. There are a lot of useful things that can be done with WordML, as we’ll show you in Chapter 5, “Rendering and presenting XML documents”, on page 86.

[2] You can download all example files used in the book from the companion Web site, http://www.XMLinOffice.com.

[3] The XML Handbook has a directory of more than 300 such public vocabulary projects.

[4] These are described in Chapter 15, “The XML language”, on page 350.

[5] This dialog can be confusing because it mixes element with element type. For example, the first and third occurrences of “element” are correct while the second should be “element type”. See 20.2, Tag vs. element, on page 431.

[6] The Apply transform check box on the Save as dialog offers another option: to specify a transformation to execute upon saving. It is discussed in 5.3.2.3, “Saving a document using a transformation”, on page 112.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.98.239