The XmlValidatingReader class is an implementation of the XmlReader class that provides support for several types of XML validation: document type definitions (DTDs), XML-Data Reduced (XDR) schemas, and XML Schemas. The XML Schema language is also referred to as XML Schema Definition (XSD). DTD and XSD are official recommendations issued by the W3C, whereas XDR is simply the Microsoft implementation of an early working draft of XML Schemas that will be superseded by XSD as time goes by.
You can use the XmlValidatingReader class to validate entire XML documents as well as XML fragments. An XML fragment is a string of XML code that does not have a root node. For example, the following XML string turns out to be a valid XML fragment but not a valid XML document. XML documents must have a root node.
<firstname>Dino</firstname> <lastname>Esposito</lastname>
The XmlValidatingReader class works on top of an XML reader—typically an instance of the XmlTextReader class. The text reader is used to walk through the nodes of the document, and then the validating reader gets into the game, validating each piece of XML based on the requested validation type.
What are the key differences between the validation mechanisms (DTD, XDR, and XSD) supported by the XmlValidatingReader class? Let’s briefly review the main characteristics of each mechanism.
DTD
A DTD is a text file whose syntax stems directly from the Standard Generalized Markup Language (SGML)—the ancestor of XML as we know it today. A DTD follows a custom, non-XML syntax to define the set of valid tags, the attributes each tag can support, and the dependencies between tags. A DTD allows you to specify the children for each tag, their cardinality, their attributes, and a few other properties for both tags and attributes. Cardinality specifies the number of occurrences of each child element.
XDR
XDR is a schema language based on a proposal submitted by Microsoft to the W3C back in 1998. (For more information, see http://www.w3.org/TR/1998/NOTE-XML-data-0105.) XDRs are flexible and overcome some of the limitations of DTDs. Unlike DTDs, XDRs describe the structure of the document using the same syntax as the XML document. Additionally, in a DTD, all the data content is character data. XDR language schemas allow you to specify the data type of an element or an attribute.
XSD
XSD defines the elements and attributes that form an XML document. Each element is strongly typed. Based on a W3C recommendation, XSD describes the structure of XML documents using another XML document. XSDs include an all-encompassing type system composed of primitive and derived types. The XSD type system is also at the foundation of the Simple Object Access Protocol (SOAP) and XML Web services.
DTD was considered the cross-platform standard until a couple of years ago. Then the W3C officialized a newer standard—XSD—which is, technically speaking, far superior to DTD. Today, XSD is supported by almost all parsers on all platforms. Although the support for DTD will not be deprecated anytime soon, you’ll be better positioned if you start migrating to XSD or building new XML-driven applications based on XSD instead of DTD or XDR.
As mentioned, XDR is an early hybrid specification that never reached the status of a W3C recommendation. It then evolved into XSD. The XmlValidatingReader class supports XDR mostly for backward compatibility, as XDR is fully supported by the Component Object Model (COM)–based Microsoft XML Core Services (MSXML).
Note
The .NET Framework provides a handy utility, named xsd.exe, that among other things can automatically convert an XDR schema to XSD. If you pass an XDR schema file (typically, a .xdr extension), xsd.exe converts the XDR schema to an XSD schema, as shown here:
xsd.exe myoldschema.xdr
The output file has the same name as the XDR schema, but with the .xsd extension.
The XmlValidatingReader class inherits from the base class XmlReader but implements internally only a small set of all the functionalities that an XML reader exposes. The class always works on top of an existing XML reader, and many methods and properties are simply mirrored.
The dependency of validating readers on an existing text reader is particularly evident if you look at the class constructors. An XML validating reader, in fact, can’t be directly initialized from a file or a URL. The list of available constructors comprises the following overloads:
public XmlValidatingReader(XmlReader); public XmlValidatingReader(Stream, XmlNodeType, XmlParserContext); public XmlValidatingReader(string, XmlNodeType, XmlParserContext);
A validating reader can parse only an XML document for which a reader is provided as well as any XML fragments accessible through a string or an open stream. In the section “Under the Hood of the Validation Process,” on page 89, we’ll look more closely at the internal architecture of an XML validating reader. In the meantime, let’s analyze more closely the programming interface of such a class, starting with properties.
Table 3-1 lists the key public properties exposed by the XmlValidatingReader class. This table does not include those properties defined in the XmlReader base class for which the XmlValidatingReader class simply mirrors the behavior of the underlying reader. Refer to Chapter 2 for more information about the base properties of XmlReader.
The validating reader uses the underlying reader to move around the document and implements most of its XmlReader-derived properties by simply mirroring the corresponding properties of the worker reader.
Table 3-2 lists the methods exposed by the XmlValidatingReader class that are either new or whose behavior significantly differs from the corresponding methods of the XmlReader class.
As you can see, the programming interface of the XmlValidatingReader class does not explicitly provide a single method that can validate the entire contents of a document. The validating reader works incrementally, node by node, as the underlying reader does. Each validation error found along the way results in a particular event notification being returned to the caller application. The application is then responsible for defining an ad hoc event handler and behaving as needed.
The XmlValidatingReader class contains a public event named ValidationEventHandler, which is defined as follows:
public event ValidationEventHandler ValidationEventHandler;
This event is used to pass information about any DTD, XDR, or XSD schema validation errors that have been detected. The handler for the event (also named ValidationEventHandler) has the following signature:
public delegate void ValidationEventHandler( object sender, ValidationEventArgs e );
The ValidationEventArgs class is described by the following pseudocode:
public class ValidationEventArgs : EventArgs { public XmlSchemaException Exception; public string Message; public XmlSeverityType Severity; }
The Message field returns a description of the error. The Exception field, on the other hand, returns an ad hoc exception object (XmlSchemaException) with details about what happened. The schema exception class contains information about the line that originated the error, the source file, and, if available, the schema object that generated the error. The schema object (the SourceSchemaObject property) is available for XSD validation only.
The Severity field represents the severity of the validation event. The XmlSeverityType defines two levels of severity—Error and Warning. Error indicates that a serious validation error occurred when processing the document against a DTD, an XDR, or an XSD schema. If the current instance of the XmlValidatingReader class has no validation event handler set, an exception is thrown. Typically, a warning is raised when there is no DTD, XDR, or XSD schema to validate a particular element or attribute against. Unlike errors, warnings do not throw an exception if no validation event handler has been set.
Let’s see how to validate an XML document. As mentioned, the XmlValidatingReader class is still a reader class, so it proceeds with an incremental validation as nodes are actually read. The caller is notified of any schema exception found for a node by raising the ValidationEventHandler event. This section describes in detail how to validate an XML document, including initializing an XML reader, handling validation errors, and setting and detecting the validation types.
To validate the contents of an XML file, you must first create an XML text reader to work on the file and then use this reader to initialize an instance of a validating reader. A validating reader can be initialized using a living instance of an XmlReader class—typically, an XmlTextReader object—or using an XML fragment taken from a stream or a memory string, as shown here:
XmlTextReader _coreReader = new XmlTextReader(fileName); XmlValidatingReader reader = new XmlValidatingReader(_coreReader);
You move around the input document using the Read method as usual. Actually, you use the validating reader as you would any other XML .NET reader. At each step, however, the structure of the currently visited node is validated against the specified schema and an exception is raised if an error is found.
To validate an entire XML document, you simply loop through its contents, as shown here:
private bool ValidateDocument(string fileName) { // Initialize the validating reader XmlTextReader _coreReader = new XmlTextReader(fileName); XmlValidatingReader reader = new XmlValidatingReader(_coreReader); // Prepare for validation reader.ValidationType = ValidationType.Auto; reader.ValidationEventHandler += new ValidationEventHandler(MyHandler); // Parse and validate all the nodes in the document while(reader.Read()) {} // Close the reader reader.Close(); return true; }
The ValidationType property is set to the default value—ValidationType.Auto. In this case, the reader determines what type of validation (DTD, XDR, or XSD) is required by looking at the contents of the file. The caller application is notified of any error through a ValidationEventHandler event. In the preceding code, the MyHandler procedure runs whenever a validation error is detected, as shown here:
private void MyHandler(object sender, ValidationEventArgs e) { // Logs the error that occurred PrintOut(e.Exception.GetType().Name, e.Message); }
Figure 3-1 shows the output of the sample program ValidateDocument. The list box tracks down all the errors that have been detected. The complete code listing for the sample application showing how to set up a validating parser is available in this book’s sample files.
When you’ve finished with the validation process, you close the reader using the Close method. This operation also resets the reader’s internal state to Closed. Closing the validating reader automatically closes the underlying text reader. However, no exception is raised if you also attempt to programmatically close the internal reader. The Close method simply returns when it is called on a reader that is already closed.
If you need to know the details of validation errors, you must necessarily define an event handler and pass it along to the validating reader. Whenever an error is found, the reader fires the event and then continues to parse. As a result, the event fires for all the errors detected, thus giving the caller application a chance to handle the errors separately.
In some situations, you might want to know simply whether a given XML document complies with a given schema. In this case, you don’t need to know anything about the error other than the fact that it occurred. The following code provides a class with a static method named ValidateXmlDocument. This method takes the name of an XML file, figures out the most appropriate validation schema, and returns a Boolean value.
using System; using System.Xml; using System.Xml.Schema; public class XmlValidator { private static bool m_isValid = false; // Handle any validation errors detected private static void ErrorHandler(object sender, ValidationEventArgs e) { // Go on in case of warnings if (e.Severity == XmlSeverityType.Error) m_isValid = false; } // Validate the specified XML document (using Auto mode) public static bool ValidateXmlDocument(string fileName) { XmlTextReader _coreReader = new XmlTextReader(fileName); XmlValidatingReader reader = new XmlValidatingReader(_coreReader); reader.ValidationType = ValidationType.Auto; reader.ValidationEventHandler += new ValidationEventHandler(XmlValidator.ErrorHandler); // Parse the document try { m_isValid = true; while(reader.Read() && m_isValid) {} } catch { m_isValid = false; } reader.Close(); return m_isValid; } }
The ValidateXmlDocument method loops through the nodes of the document until the internal member m_isValid is false or the end of the stream is reached. The m_isValid member is set to true at the beginning of the loop and changes to false the first time an error is found. At this point, the document is certainly invalid, so there is no reason to continue looping.
Because the ValidateXmlDocument method is declared static (or Shared in Microsoft Visual Basic .NET), you don’t need a particular instance of the base class to issue the call, as shown here:
if(!XmlValidator.ValidateXmlDocument("data.xml")) MessageBox.Show("Not a valid document!");
Note
The reader’s internal mechanisms responsible for checking a document’s well-formedness and schema compliance are distinct. So if a validating reader happens to work on a badly formed XML document, no event is fired, but an XmlException exception is raised.
The ValidationType property indicates what type of validation must be performed on the current document. To be effective, the property must be set before the first call to Read. Setting the property after the first call to Read would originate an InvalidOperationException exception. If no value is explicitly assigned to the property, it defaults to the ValidationType.Auto value.
The ValidationType enumeration defines all the feasible values for the property, as listed in Table 3-3.
Type | Description |
---|---|
None | Creates a nonvalidating reader and ignores any validation errors |
Auto | Determines the most appropriate type of validation by looking at the contents of the document |
DTD | Validates according to the specified DTD |
Schema | Validates according to the specified XSD schemas, including in-line schemas |
XDR | Validates according to XDR schemas, including in-line schemas |
When the validation type is set to Auto, the reader first attempts to locate a DTD declaration in the document. The DTD validation always takes precedence over other validation types. If a DTD is found, the document is validated accordingly. Otherwise, the reader looks for an XSD, either referenced or in-line. If no XSD is found, the reader makes a final attempt to find a referenced or an in-line XDR schema. If a schema is still not found, a nonvalidating reader is created. If more than one validation schema is specified in the document, only the first occurrence, in accordance with the order just discussed, is taken into account.
When the ValidationType property is set to Auto, you know at the end of the process whether the semantics of your XML document are valid. But valid against which schema? The Auto mode forces the parser to make various attempts until a validation schema type is found in the source code—whether it be DTD, XSD, or XDR. Is there a way to know what type of validation the parser is actually performing when working in Auto mode?
The validating reader class provides no help on this point, but with a bit of creativity you can easily identify the information you need. This information is not directly exposed, but it is right under your nose and can be inferred from the node type and the schema type without too much effort.
If the parser detects a node of type DocumentType, it can only be validating against a DTD. By definition, the DOCTYPE node must appear outside the information set (infoset). If no DOCTYPE node is found, check whether the SchemaType property evaluates to an XmlSchemaType object. This can happen only if an XML Schema Object Model (SOM) has been created, and hence only if XSD validation is taking place. The XmlSchemaType object has even more in store. By checking the contents of the SourceUri property, you can also determine whether the schema is in-line or a reference. If the schema is in-line, the SourceUri property matches the URI of the XML document being processed. Finally, if the validation type is neither DTD nor XSD, it can only be XDR! The following source code illustrates a function that determines the actual validation type:
string GetActualValidationType(XmlValidatingReader reader, string filename) { string realValidationType = ""; if(reader.ValidationType == ValidationType.Auto) { if(reader.NodeType == XmlNodeType.DocumentType) realValidationType = "Auto.DTD"; else { if(reader.SchemaType is XmlSchemaType) { XmlSchemaType xst = (XmlSchemaType) reader.SchemaType; string xsd = Path.GetFileName(xst.SourceUri); string doc = Path.GetFileName(filename); if (xsd == doc) realValidationType = "Auto.Schema.Inline"; else realValidationType = "Auto.Schema.Ref (" + xsd + ")"; } } } return realValidationType; }
This code alone is not sufficient to produce the desired effect. It must be used in combination with the main parsing loop, as shown in the following code. The function should be called from within the loop as you read nodes, and at the end loop, you should check for the results. If neither DTD nor XSD has been detected, the document can be validated only through XDR.
string valtype = "";
while(reader.Read())
{
if (valtype == "")
valtype = GetActualValidationType(reader, filename);
}
// No DTD, no XSD, so it must be XDR...
if (valtype == "" && reader.ValidationType==ValidationType.Auto)
valtype = "Auto.XDR";
Figure 3-2 shows how the ValidateDocument application implements this feature.
Although it’s easy to use, the Auto option is the most expensive of all in terms of performance because it must first figure out what type of validation to apply. Whenever possible, you should indicate explicitly the type of validation required.
Note
When the ValidationType property is set to None, the DTD-specific DOCTYPE node, if present, is not used for validation purposes. However, default attributes in the DTD are correctly reported. General entities are not automatically expanded but can be resolved using the ResolveEntity method.
The typical way to detect validation errors is by means of a validation event handler. If a validation event handler is specified, no validation exception is ever raised. In practice, once the reader has found an error, it looks for an event handler. If a handler is found, the handler raises the event; otherwise, it throws an XmlSchemaException exception.
For the reader class, handling an exception is much more expensive than firing an event, so use the ValidationEventHandler event whenever possible and do not abuse exceptions. Using exceptions automatically stops the validation process after the first error. As shown in the section “Detecting the Actual Validation Type,” on page 86, you can obtain the same behavior from the event by using a slightly smarter Boolean guard for the loop. Instead of using the following statement:
while(reader.Read());
you resort to this:
while(reader.Read() && !m_errorFound)
where the m_errorFound private member is updated in the body of the event handler according to what you want to do.
So far, we’ve looked exclusively at how the validation process works for XML readers. But what about the XmlDocument class for XML Document Object Model (XML DOM) parsing? How can you validate against a schema while building an XML DOM? We’ll examine XML DOM classes in detail in Chapter 5, but for now a quick preview, limited to validation, is in order.
The XmlDocument class—the key .NET Framework class for XML DOM parsing—uses the Load method to parse the entire contents of a document into memory. The Load method does not validate the XML source code against a DTD or a schema, however—Load can only check whether the XML is well-formed.
If you want to validate the in-memory tree while building it, use the following overload for the XmlDocument class’s Load method:
public override void Load(XmlReader);
You can create an XML DOM from a variety of sources, including a stream, a text reader, and a file name. If you load the document through an XML validating reader, you hit your target and obtain a fully validated in-memory DOM, as shown here:
XmlTextReader _coreReader = new XmlTextReader(fileName); XmlValidatingReader reader = new XmlValidatingReader(_coreReader); XmlDocument doc = new XmlDocument(); doc.Load(reader);
As you’ll see in Chapter 5, in the .NET Framework, an XML DOM is built using an internal reader. The programming interface of the XmlDocument class, however, in some cases allows you to specify the reader to use. If this reader happens to be a validating reader, you are automatically provided with a fully validated in-memory DOM.
18.117.81.240