2

XML from the Builder’s Perspective: Using XML to Support DM

Chapter Overview

We use this chapter to present an overview of how XML works and how it has been used to support DM from the system builder’s perspective. We also want to introduce some concepts that are specific to XML, and are used repeatedly in XML-based solutions. We will start by introducing XML terminology and the structural characteristics of XML. We will then outline the basic XML “ground rules” about how the documents are structured. We will not spend much time on the actual structure and syn-tax of XML documents since that is a topic very capably handled by a number of other texts. * We will, however, provide a set of basic characteristics of XML documents, and point out how they relate to data management and what they say about XML as a language. Toward the end of the chapter, we will cover a number of different uses of XML, from its use as an integration language, to performing data translation, and a host of other examples, along with discussions of their business benefits. By the end of the chapter, you should have a firm idea not only of what XML is and how it works, but where it has been successfully used in the past to solve specific problems.

XML Builder’s Overview

Before jumping into the terms, it will be useful to note that we expect users to already be aware of the wealth of information, shareware, and standards upon which to build. There already exist metadata markets, XML databases, stacks of standards, lots of free software, interoperability testing, etc. In this section, our discussion begins with a number of XML-related terms and definitions, along with descriptions of how the concepts behind them benefit data management. Understanding these concepts is important in grasping the “hows” and the “whys” of XML’s impact on data management.

XML Terms

Embedded Tagging

When people refer to “embedded tagging,” they are talking about the powerful ability to add structure to data elements—grouping them in various ways to provide support for organizational objectives. Knowledge workers are familiar with relational and tabular data storage using the Relational Database Management System, or RDBMS. While XML is fully capable of representing these types of data, most XML documents are stored hierarchically. An XML document is simply a text document that contains at least one XML element (pair of tags). Embedded tagging is the process of including sub-elements to the primary data element in a document. This means that any data element in an XML document can have subordinate data elements underneath it, which inherit the higher-level “container” element properties. As an example, consider a communications framework expressed in XML. The communications system uses the framework to keep track of who communicated with whom, and when the communication took place. All of this is metadata and it should be maintained using modern data management technologies. The actual content of the communication (in this case, a quick note) is stored in a separate location and is often the only portion of the system to use any sort of database at all. Figure 2.1 shows how the data and the metadata can be integrated and structured into a result.

image

Figure 2.1 Combining two different XML documents into a single, unified document.

As the need arises to have a more complete picture of a particular communication that has taken place, it would be helpful to combine the two pieces of data into a unified document. Figure 2.1 shows how the XML message is combined with the framework to produce a resulting document that contains all of the data of both. While this should be considered a reasonably advanced use of XML, it really isn’t that complex. The XML message is “embedded” inside the <content> element of the framework document. In this way, XML documents are quite like Russian nesting dolls; inside of one XML element is another, which itself contains another, and so on. With nesting dolls, there is a limit based on how small or large a doll you can actually pick up and hold, but with XML, there is no predefined limit.

Tag embedding in general simply allows XML documents to be mixed and matched, combined, taken apart, and reassembled. Just as we inserted one document into another in this example, we might also have extracted one document from another. The business value that arises from this is that no direct one-to-one mapping from an XML document to a data source or system is necessary. One document may represent several sources, or several XML documents may be combined to represent one source. The impact to data management is that data can be structured according to how it will be understood and used, rather than according to artificial platform or representational constraints.

Meta-Lanimageuaimagee

A meta-language is a language that describes another language. XML is a meta-language that has several important capabilities:

image Referenceability: Data in one XML document can refer to the element names in a different XML document.

image Structurability: The ability to nest itself. As described in the section on embedded tagging, XML documents can be packaged up and included in larger documents.

image Layerability: The ability to layer itself. The core XML language is quite small and compact. Other technologies in the XML “family” are all built on the features of base XML. With each additional layer of technology, the toolset becomes more powerful and flexible.

Data managers understand that XML provides two specific opportunities. Organizations can develop languages for very specific tasks, tailoring their language to the requirements at hand, and they can also develop “bridge languages” to permit communication in ways that are less cumbersome than strict application of standards.

XML Parser/XML Processor

The term “XML parser” is frequently heard in discussions of XML. Another term frequently heard in its place is “XML processor.” An XML parser is simply a piece of software that is designed to take the raw text of an XML document and convert it into a data structure that can be used by just about any tool that is capable of dealing with XML. Parsers are used to get data out of a document for actual use. There are generally two different types of parsers—SAX parsers, and DOM parsers.

SAX parsers tend to present the data to the application that is using the parser as a sequence of individual events. Those events might be a particular element in the document, or data within an element. DOM parsers read the entire XML document and represent it as a hierarchical tree internally. Basically, SAX parsers allow for very selective, sequential access to XML documents, while DOM parsers allow for global random access to documents. Broadly speaking, SAX parsers tend to be faster and more memory efficient, while DOM parsers provide more options for accessing data in documents. It is important to note that a parser acts as the intermediary between an XML document and the software that uses the document for some purpose. This piece of software is the “magic” that happens between XML that is stored somewhere and data from the document that an application is using. In other words, the parser makes the data in the document available in terms that the application under-stands.

For data managers, the parser is the most basic piece of XML software necessary to make XML documents useful. Fortunately, most XML toolkits and just about any application that knows how to deal with XML either comes with its own built-in parser, or uses one that may already be distributed with the platform being used.

Vocabularies

The vocabulary of an XML document often refers to the concepts, elements, and attributes that are present in the document. Since XML elements and attributes are just text names, it is possible to label each item of data in a document with a name that best represents the data’s meaning or applicability. The vocabulary of a document is then the collection of all of the elements that are found in a document, along with their meanings. XML as a whole has a way to formalize and express these vocabu-laries—this is usually done in the form of a DTD (Document Type Definition) or an XML Schema, although these two are only the most recognized among many options.

Vocabularies are important because they are intricately tied to the meaning of the XML document. When converting from one form of XML document to another, it is crucial to understand the vocabulary of both documents in order to make sure that the document comparison and conversion is apples-to-apples, and not apples-to-oranges. This problem arises when similar terms mean different things in different contexts, as described in Chapter l’s example of “secure the building.” Data managers are often justifiably interested in these vocabulary reconciliation issues, since they are crucial to ironing out issues of semantics in data.

XML-Wrapped

The term “XML-wrapped” is often used as a buzzword, but at its core, it just means that the data in question has XML elements surrounding key data items, and that the document has been given some sort of XML structure. It is important to keep in mind that just because data is “XML-wrapped” does not mean that it is useful or easy to work with, just that the data is represented as XML.

Many data managers have seen quite a bit of data that is coming “XML-wrapped,” and will see even more of it as time goes on. Data managers can understand “XML-wrapped” to mean that someone along the line is attempting to pair meaning with his or her data.

HTML Drawbacks

Many people have pointed out that visually, XML looks quite similar to HTML, so they wonder why XML was even needed. Why couldn’t they just stick with HTML?

“What’s odd about HTML is that while it does a perfectly good job of describing how a web page will look and function and what words it will contain, HTML has no idea at all what any of those words actually mean. The page could be a baby food ad, or plans to build an atomic bomb. While HTML knows a great deal about words, it knows nothing at all about information. ”—Robert X“ Cringely

When representing data in computers, there are essentially two bases to cover—the structure and the semantics. The structure of the data is physically how it is put together, and it includes what format particular data items are in, and where they occur. Facts paired with context and meaning produce data. The semantics of the data are what the data actually mean—is the isolated piece of data “1.86” a person’s height expressed in meters? Is it the price of a pound of fruit? Is it a particular company’s yearly revenue, in billions of dollars? The semantics represent the rhyme and reason of the data. The problem with HTML is that it does not cover semantics at all. And as for structure, HTML only knows how to represent the visual structure of the data and nothing else.

In summary, the drawbacks of HTML include the following:

image There is no effective way to identify the content of a page. HTML tags describe the visual layout of a page. Web browsers

use the tags for presentation purposes, but the actual text content has no meaning associated with it. To a browser, text is only a series of words to be presented visually.

image HTML creates problems locating content with search engines. Because there is no semantic data, there is no automatic way that search engines can determine meaning, except by indexing relevant words. Search engines have improved tremendously over the past few years, but clever search algorithms still occasionally fail, and are no substitute for representing data properly.

image Many capabilities require special extensions to HTML. In order to perform more complex functions such as dynamic content, special extensions are needed that don’t necessarily have much in common with HTML. For example, ask your local web developer, and you may find that he or she is familiar with a number of different computer languages, all of which have different syntaxes that must be cobbled together to produce the desired effect. Examples of this are DHTML, JavaScript, CSS, and others.

image HTML has encouraged “bad habits.” In the early days of HTML, browsers and other software that interpreted HTML came with a variety of different exceptions and special cases to correctly interpret badly written HTML that didn’t conform to the standard. As users got accustomed to this, taking shortcuts in HTML became the norm, to the point that the standard has warped slightly. Allowing bad habits in HTML has had the strange side effect of making the task of parsing what was originally a very simple markup language into an incredibly complex task.

XML “Rules”

Before diving into a discussion of the rules associated with XML documents, let us first take a look at the technological foundations of XML. The Extensible Markup Language can be considered a descendant of an earlier markup language, SGML (Standard Generalized Markup Language). Developed in 1974 by Charles F. Goldfarb and a host of others, SGML was meant to create a single basis for any type of markup language. In 1986 it became an international standard (ISO), and was in use in numerous organizations for many years. XML can be thought of as a subset of SGML. The two are similar in some ways, but SGML is a more complicated language. As a structure, SGML provides the ability to create arbitrary markup languages, while XML is a particular markup language. SGML acted as the foundation for the development of HTML and later XML, but is not widely used today. Most SGML documents seen today are either legacy documents, or deal with the typesetting of printed documents in different ways. SGML didn’t ever achieve the penetration into the realm of representation of organizational data that XML has.

At its core, XML is a very simple language. One could say that the two main architectural goals of XML were simplicity and power. Simplicity is important because the simpler a language is, the easier it is to learn and use, and the less likely that it will contain errors, problems, or internal contradictions that make the language inconsistent. To maximize the power of a language, it must be created in such a way that it can be applied to a range of different problems efficiently, not designed to serve a few niche cases. XML accomplished these two goals admirably. It is built in a very simple way, and its similarity to other existing markup languages such as HTML means that the learning curve is about as good as it could possibly be. In this section, we will discuss the core set of “rules” that dictate how XML documents are structured and interpreted. The designers of XML attempted to minimize the number of rules that users must know in order to use the language. This simplicity has led to many of the advantages for which XML is known; namely, the fewer rules a language has, the easier it is to learn, and the easier it is to build software to work with the language.

Tags Must Come in Pairs

In the XML vernacular, a tag or an element is the marker that is put around a piece of data. It effectively acts like parentheses in writing, showing the reader where some comment or piece of information starts, and where it ends. Figure 2.2 shows how XML doesn’t have a tabular or field structure like other data representation formats. In other formats, where the data begins and ends is largely a function of where it falls in a grid-like structure. Because there is no strictly enforced grid-like structure in XML, data must be delimited clearly. There are no exceptions to this rule, although in some XML documents you may see what are called “singleton” tags—these appear as XML tags that contain a slash (“/”) character after their name. This is XML short hand for an opening and a closing tag; it just means that the implied content within the tag is empty. Singleton tags are not exceptions to the rule that all tags must come in pairs; they are simply a shortcut in the case where the tag contains no content. In the example provided here, the singleton tag for “name” is used—note the slash before the end of the tag. This is the same as expressing the “name” tag as <name first=“David” last=“Allen”></name>.

image

Figure 2.2 An example of XML tag pairing.

Tag Pairs Can Be Nested Inside One Another at Multiple Levels

The first rule, that tags always come in pairs, essentially acts to identify the XML concept of a piece of data. The second rule, that tags can be nested within one another, is what provides XML documents with their real structure. If one tag is nested within another, we can say that there is a “parent” tag (the outer tag) and a “child” tag (the inner tag). When this principle is applied to complex data structures, the result is an XML tree. The very first element in an XML document can be seen as the root of the tree, with various branches corresponding to the child tags of the first element. Those sub-tags may in turn have their own child tags, each of which represents further branching of the tree.

There are several reasons why this hierarchical structure is used. When documents are structured in this way, it allows software to treat the entire document (or subsections of it) as a tree internally. Trees are data structures that are extremely well understood in computer science terms—there are already a multitude of algorithms in existence that can effectively process trees in a variety of ways. This allows a large body of previous research on trees to be applied to XML documents. But that is a low-level technical reason. At the user level, it permits easy representation of

common parent-child relationships in data. It also allows related data items to be “grouped” together under a common parent tag for semantic reasons. In the relational model, this logical grouping of related pieces of data is often done at the record level, with one record holding a number of different data points that logically hang together. Figure 2.3 illustrates a conversion between tabular data and XML-wrapped data. In XML documents, that might translate into an XML structure where each record of data is captured as a “chunk” of XML with an enclosing tag, such as the “flight” tag seen in this example.

image

Figure 2.3 Comparing tabular and XML representations of data.

One interesting point about the hierarchical nature of XML that rarely gets mentioned is this: While there are established methods of converting tabular and relational data structures into hierarchical structures, the opposite is not true. Converting hierarchical structures into relational or tabular data can be extremely difficult, not because the hierarchical structure is somehow more powerful than the relational form of data storage, but simply because of differences between the two models. Converting from one model to the other is a challenge often faced when organizations want to store XML data in relational databases. In later chapters, we will discuss XML databases and how their storage model may be better in certain cases.

XML Is Unicode Based

Unicode is a relatively new form of encoding characters in computers. The need for Unicode resulted from the recognition that the primary standard for representing characters (known as ASCII) was extremely limited and basically only allowed representation of characters from Latin alphabets. As computers crossed more international boundaries and came to be used in almost every human language on the planet, it became desirable to have a single way of representing characters that could cover languages that do not necessarily use a Latin alphabet. In short, Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. **

XML uses UTF as the default character encoding for all documents. UTF in turn is a way of representing the numbers and characters from Unicode as bytes that can be stored. This basic feature of XML makes a lot of sense when examined together with XML’s other features. It certainly wouldn’t do to create a generalized data representation markup language that was unable to express characters from different languages. XML’s use of Unicode makes it easy to exchange documents over national and cultural boundaries, and does not require the romanization of foreign words. XML element names, attribute values, and document data can be specified in a way that is easiest for the author and is still interoperable, making XML truly an international format.

Designing Your Own Language

XML allows element names to be whatever the author prefers. Data publishers are free to refer to pieces of data in their own terms. But with this capability comes some responsibility. In order for a defined language to make sense, it too must conform to several rules that help give the language meaning for other computers and software packages. It wouldn’t do to have everyone choose their own unique set of 20 different elements, mixing them freely and in any order. The confusion that would result from this would be analogous to people randomly ordering the words in their sentences, perhaps occasionally mixing in words from other languages. The results might sometimes be humorous, but would not often be of much use. So the definition of any language within XML should include the following three pieces of information that form the rules of the language.

1. The list of allowed tags must be defined.

2. The nesting structure of the defined tags must be laid out.

3. There should be descriptions of how to process the tags.

These rules are not strictly required. It is possible to use XML documents without following any of these rules, but it is rarely a good idea. If XML documents could be thought of as cities, with individual data items as citizens, then these rules are like a sheriff and a court system. While it’s technically possible to do without them, nobody will like the resulting anarchy.

Figure 2.4 illustrates how creating a list of allowed tags in the document is similar to writing a dictionary for the language. It lays out which terms can be used. Words that are in this dictionary make sense in the language, just like the words in an English dictionary are understandable to speakers of English. The nesting structure of the defined tags must also be defined, to give some meaning to the collection of “words” in the language. This is similar to grammar rules in the English language. Speakers of English cannot simply order words in any way they want, since result it would sentences in garbled!

image

Figure 2.4 Why nesting rules matter.

In this example, we see two documents that contain identical data, with different nesting conventions. On the left of the figure is a standard representation with a “route” tag clearly specifying that the flight goes from Washington to New York City. The right side contains the same data, but it is rearranged, so it has a different meaning. Without solid and enforced nesting rules, this second document might be interpreted as a flight with no departure time that goes from one place named “559” to another place named “9:30”!

As an illustration of how XML itself is a “meta-language,” consider how most of the time when new XML languages are created, the tags that are allowed in the document and the nesting rules that must be in place are themselves expressed in the XML document. The two major XML architectural components used to accomplish this are referred to as DTDs and XML Schemas. Later, we will go into more depth about these particular components and how they fit into the overall XML component architectural model.

The last part of defining any created XML language is specifying how the tags should be processed. At times, it may be important to deal with certain data items before others, or even to skip processing some items depending on the values of others. The need for formalized rules about how the tags are processed varies with the complexity of the language that is created. The more optional tags and special cases that exist in the language to represent particular data structures, the wiser it is to lay out a set of rules that describe how the language is processed, and what that processing means in terms of the phenomenon the data represent.

XML is to data what Java is to programming languages. The excitement over Java was that you could “Write it once. Play it back anywhere!” With XML capabilities you can “Wrap your data once in XML and utilize it anywhere!” XML represents what Mike Hammer terms a disruptive technology—that is, a technology that has the potential to impact-related processes (Hammer & Champy, 1993). XML has the potential to simplify our data management operations in much the same way that EDI simplified data communications exchange.

HTML versus XML

With regard to the way the rules of the two markup languages work, there are a number of differences between HTML and XML. In HTML, there is a fixed set of tags that are allowed to be included in HTML documents, while all other tags are ignored. In XML documents, on the other hand, tag formatting can be specified for any application—there is no fixed set of tags that are acceptable in base XML. While tag limitations can be put in place on XML documents for particular applications, as discussed earlier, there are no inherent limitations on XML itself as there are in HTML. Another difference between the two concerns how browsers interpret the data. HTML documents are interpreted in terms of a visual layout, whereas XML documents can be interpreted in a number of different ways according to what data is actually in the document. XML technologies exist that allow browsers to display data documents in a visually pleasing way just as HTML would have done, while at the same time preserving the metadata and semantic context of the data.

In an excellent article, two of XML’s creators describe two of the motivations that led them to develop XML (Bosak & Bray, 1999). The first was Internet congestion. Using HTML required updating a quality field by accessing a server and reloading the page. With XML, more work can be accomplished on the client side. The second motivation was to address information location difficulties. With HTML, you cannot search for any-thing marked as “price” on the web beyond searching for the string “price.” XML-based data is “self describing,” which means that devices will be more capable of local analysis of information. As we shall see, the goals of XML’s creators directly complement several data management objectives.

Now that we have taken a look at some of the important terms to understand with XML and what rules of the road need to be followed, it’s time to take a look at examples of how XML is actually used in practice.

XML Usage in Support of DM: Builder’s Perspective

We begin this section by showing how understanding the way a browse] works can be an important tool for learning XML. We next present a number of increasingly detailed examples of how XML has been used to sup port DM. The examples are presented from the perspective of someone responsible for developing and implementing these types of solutions.

Integratlon at the Browser

XML has been making its way to the browser—support for rudimentary XML structures existed in the Microsoft Internet Explorer (IE) browser in version 5 and in Netscape V6. In spite of frustration with popup windows, viruses, and tracking software, the majority of knowledge workers today are running version Microsoft Internet Explorer 6, which includes support for a number of XML technologies. Browser designers as a group have wisely decided that as the browser moves toward being a standard application deployment platform, XML will not only be a nice feature, it will be pivotal. The reason it’s safe to bet on XML is that the core of the technology is stable and will not change. The “layered” aspect of the related standards also allows implementers to put into place a core functionality that allows subsequent higher layers to be easily added.

Currently, XML integration at the browser is not being used to its full potential. Part of the reason for this is the worry that different browsers implement some of the XML technologies in slightly different ways, so the proliferation of different browsers makes effective use more difficult. In fact, XML within the browser is more frequently used in situations where the browser population is largely known, such as for internal applications rather than those distributed over the wider Internet to the general public. Open up an XML document with Internet Explorer 5 or higher and you will see some XML features illustrated. Figure 2.5 shows that Internet Explorer correctly laid out the hierarchical structure of XML—for example, notice how the “<head>” and “</head>” tags align. Because browsers understand the structure of XML, they can be used to illustrate how XML can be used to integrate disparate data items.

image

Figure 2.5 Opening an XML document (in this case an XML style sheet—XSLT) with IE 5.5 for Mac.

Figure 2.6 illustrates how browsers can be used to achieve integration on a small scale if both sources of data (in this case from applications 1 and 2) are “wrapped” in the appropriate XML. This is not to say that you should depend on your browser to manage your data integration, but a browser can be a valuable means of prototyping and debugging XML solutions.

image

Figure 2.6 Browser-based XM L integration is useful for prototyping and debugging XML solutions.

Integration via Hub and Spoke

We can extend the model of this example to further illustrate the power of XML. Figure 2.7 shows how the addition of an XML parser or processor can be used to ensure that application 3 can “understand” the data of any application wrapped in the appropriate XML.

image

Figure 2.7 XML application integration.

Extending the previous example leads to one of the most important XML architectural devices—the hub and spoke integration model. XML and the transformation powers permit organizations to take advantage of interconnectivity. While realizing that not every application is connected to every other application, it is useful to realize the ranges that are of concern here.

Figure 2.8 shows that the possible number of interconnections that an organization would be required to maintain is given by the following formula:

image

where N is the number of interfaces and we get a total of 15 interfaces that might be required to completely interconnect 6 applications. When the number of applications that might need interconnecting reaches 60, then N becomes 1,770. (Some would argue for higher numbers because interfaces are typically developed to work in one direction only.) That the same interconnection could be resolved using just 60 interfaces illustrates the efficacy of the adoption of a hub and spoke solution to this problem, since the XML hub approach requires N connections for N applications. The XML hub approach requires a number of connections that grows proportionately to the number of applications present. This shift in thinking about the way systems are connected has actually been around for quite some time. What XML really brings to the table here is the ability to use freely available software components to facilitate the connection to the hub, and the tremendous benefits that come from the “ripple effect” as all systems start to use this interface.

image

Figure 2.8 Interconnections between applications using point-to-point organization versus an XML hub.

A virtual hub and spoke implementation is illustrated in Figure 2.9. As each application becomes aware of XML, it is able to “understand” and translate the mapping between the internally required data formats and the XML-wrapped data from each application. This is not an intuitive understanding of concepts like humans have, but rather a structural understanding of the pairing of metadata with data in XML. In Figure 2.9, just the finance data is wrapped initially. Once at least one other application (for example, the Mfg. data) can understand the XMLwrapped finance data, the business value from the use of XML becomes apparent. This is because data engineers can correctly specify the trans-formations required to make data from one application useful to other applications.

image

Figure 2.9 Virtual hub and spoke implementation.

There are two main ways to expand the way that organizations deal with XML.

image One way is to increase the number of applications that can understand the XML-wrapped finance data. The parser itself does not understand the data, but allows the application to understand it by using the metadata.

image The other way of expanding XML capabilities is to increase the number of tags that each application is able to “understand.” Adding new sets of tags might be as simple as replacing a table. Technically, all that is happening is that you are expanding the rules that each application understands and trying to maintain them so each application might implement the same rules using different technologies. Adding “understanding” of XMLwrapped data from new applications is done using a single source of entry and maintenance.

Figure 2.9 depicts a situation that can be described in terms of three different organizational levels. First, consider the situation as presented-multiple applications working in concert to support attainment of objectives. We will refer to this as the 100% view. The smaller view (say 25%) is that instead of applications, the code components are subroutines that need to communicate with each other. XML in this view facilitates the inter-connectivity between application subsystems. There is also an interorgani-zational view (say 500%) where instead of cooperating systems within an organization, the applications are cooperating business partners and the interfaces are connected to external entities. In each instance, the XML hub and spoke architectural model supports the business objectives, decreasing the effort spent in dealing with data interchange and transformation.

B2B Example

While the term B2B (business to business) may not seem as useful now that we are on the bust side of the .COM boom, the next example will show how XML was used to solve a problem at a major airline by implementing the hub that we just described.

World Wide Airlines needed to solve the data integration problem described in Figure 2.10. World Wide (WW) needed to get large amounts of data into its Mileage Accounting and Reporting System (MARS).

image

Figure 2.10 Before and after XML implementation supporting MARS data importation.

The data was provided by many different business partners, each sending in data using its own format. Transforming the data from each business partner was originally done using standard “conversion programs” consisting of spaghetti code implementing countless transformation rules.

Replacing the transformation portion of the system required separating the conversion rules from the remainder of the code. The business value added by the program was now focused on

image Reward rates (within and across partners)

image Distinguishing different status program participants

image Integrating rotating promotional offers

image Integrating with airline booking systems

image Integrating with affiliated airline web sites

image Printing accurate and timely statement packages

World Wide simply asked that each affiliate wrap the data using XML before sending it to MARS. This enabled them to decode the tags used, clarify any confusion, and encode the XML into a big parsing engine that focused on putting the data into the correct format required by MARS. There were two main benefits to this approach—not only were the savings substantial since there was much less manual data analysis of problematic input, but call center volumes decreased as well, as fewer customers had problems with what did go through.

For other data managers, let us abstract this to something more generally useful. Many organizations must deal with the flow of various forms of data—a credit card company that pulls credit reports from many different bureaus; an insurance carrier that electronically resells and purchases policies; or a pharmaceutical company that receives data related to studies on its upcoming products. In any case, we know that this data is often difficult to deal with because the producers usually send it whichever way is most convenient for them. The MARS example points out that it is possible to agree on a central format—XML—that both parties can deal with. That format in turn is specified in terms of the vital metadata involved in the process. XML as a format is a compromise, and a good one. The vendor is not imposing its own pet data format on the consumer of data, and the consumer is not imposing its format on the vendor. Both are able to get what is needed from the transaction by compromising, with the crucial metadata at the center of that compromise.

Legacy Application Conversion

In this next example, the hub is replaced by an XML message bus that is used to manage the flow of XML-wrapped data between system components.

XML played a key role in selling a legacy application conversion to management and technically performing it. This effort has been partially described elsewhere (see Swafford et al., 2000). The core legacy system was more than 16 million lines of COBOL. The overall plan of the project was broken down into three steps:

1. Split the system into smaller chunks. Enable better understanding of it using the “divide and conquer” approach.

2. Allow communication between areas of the system using an XML message bus. By using this bus, components could talk to one another in the way seen in Figure 2.8.

3. Reengineer the components simultaneously using modern methods and technologies. By breaking down the system into parts, different groups could work on each part individually, speeding progress.

This approach was ultimately successful (see Krantz, 2003). Figure 2.11 illustrates how each individual component was added to the XML-based repository. The value of this repository lies in its ability to store XML descriptions of the components that the system was broken down into. These components were used to develop improved functionality using object- and non-object-oriented technologies. By storing the components individually and examining their descriptions, it was also possible to see where effort was being duplicated. Components could then be altered to work together more efficiently. When a project is undertaken that breaks a system up into components and changes them while the system is still functioning, we refer to this as “changing the tires on the bus while it rolls down the road.”

image

Figure 2.11 Overview of legacy architecture conversion project—changing the tires on the bus as it is rolling down the road—replace existing, understood components with more maintainable/better-performing components.

The value of this example is that data managers do not have to pull the plug on a system and tear it down to the ground in order to make substantial improvements. These systems can be broken down into logical components that are examined individually. By understanding the relationships between them, individual components can be changed and plugged back into the system.

XML relates to this approach in two ways. First, it can facilitate communication between components when there is an “XML hub” through which the components can send information. Second, it provides a method for actually representing the components, so that their metadata and interaction with the rest of the system can be better understood. The information that goes into an XML representation of the components is typically taken from the subject matter experts who work with the system. The process of representing this in XML is situation specific and non trivial, but when it is accomplished, it provides a level of architectural freedom that can be used to take old, groaning systems and make them do things one might not have thought possible.

XML Conversion

Next are two longer examples illustrating the importance of understanding XML architecture when using it to solve an organizational data interchange problem.

Since the move to XML will inevitably involve a substantial amount of work converting existing data to XML, it is appropriate to take a look at an example of such an effort, as well as to point out the potential pit-falls and benefits. This particular example involved two companies-one will be referred to as MedCorp, a large multinational health company, and the other is the actual firm Data Blueprint, our university spin-off company that develops and delivers data and metadata solutions.

The Challenge

MedCorp approached Data Blueprint about converting about 15,000 legacy HTML documents that contained internal documentation into an XML format. MedCorp actually stored their documents in a number of different formats, but most often referred to the HTML edition. One problem they were experiencing was that the format used to create the print edition of their documentation was not the HTML edition, so that when corrections were made to the HTML edition, they often didn’t find their way back to the original. With several different formats floating around the organization, change management was rapidly becoming a nightmare with the growing need for resources to keep all editions in sync enough for them to be useable.

Display of the documents was also an issue. Ideally, MedCorp wanted to be able to deploy these documents to other platforms as well, such as wireless phones or PDAs for salesmen in the field, but they weren’t about to try to manage yet another edition of the documentation, given all the problems they were already having. What was really needed was a way to store the documents centrally in one place, where all modifications would be made. The actual presentation of the documents could then be regenerated automatically from the source documents by pressing a button as the need arose. Enter XML. One of its most basic strengths is write it once and read it everywhere (see Figure 2.12).

image

Figure 2.12 Write it once and read it everywhere! (Source unknown.)

The Process

The project consisted of different parts, which would in the end yield a master set of documentation for MedCorp that could be easily managed, edited, and repurposed. The steps of the conversion process included

1. Requirements Gathering and Planning—Before even starting the project, it is critical to have a solid idea of exactly what is desired as a result of the project. This allows efforts to be focused on critical goals, and non-value-added work may be avoided.

2. Construction of the Data Model-A solid representation of the data in the documents was needed. The data model must not only accurately reflect the current contents of the documents, it must also take into account relationships within and between documents, as well as potential future enhancements and modifications.

3. Creation of an Automated XML Conversion Process—In this step, a toolset had to be developed that would extract the data out of the source documents and populate a new XML document.

4. Quality Assurance—In order to assure that documents are converted properly, random samples must be taken and inspected to ensure that all of the source data has been transi—tioned.

5. Presentation—A presentation framework must be created to allow for rapid repurposing of the resulting XML documents. This phase involves the distribution of the data in a number of different formats on different platforms.

Important Outcomes and Insights

There were several important project outcomes:

image The material was completely converted to XML, delivered on time and under budget.

image Data siphoning was developed as a new semi-automated technology for wrapping XML—capturing the content of some object in order to be able to subsequently manage the content in XML form. This often requires content re-architecting (see Figure 2.13).

image

Figure 2.13 XML siphoning technique.

• Conversion facilities were slowly developed and made available to the business team participants who had initial problems “viewing” the conversion results. These conversions in turn were built on other XML components.

• Indexing and other restructuring decisions dictated use of “conversion freezes” and limited subsequent architectural options.

The last one is hardly typical but bears repeating. Earlier recognition of unanticipated business requirements would have made it possible to increase efficiency. Instead, many business requirements surfaced after conversion freezes—other business requirements were assumed; for example, the need to create new materials, new linking and navigation structures within XML, and code retention from the original HTML through to the XML.

Given this last bullet point, the real lesson was that it is important to identify all architectural decisions before the process begins, to prevent the substantial amount of work that it takes to change those decisions and assumptions midstream. When organizations underestimate the value of the baseline architecture, they are more likely to neglect proper planning in the initial stages, and later in the project they may want to change some of their earlier faulty assumptions. One prominent example that required a work-around was the determination of the actual physical XML file sizes. Selection of a physical size below a certain “object size” required the development of a programmatic work-around. The difference forced by the rework is illustrated in Figure 2.14.

image

Figure 2.14 Planned and actual XML conversions due to incomplete architectural requirements.

To wrap up this example, we offer a few important points on the process of converting data from its original format to XML:

image The process can and has been successfully accomplished by many organizations.

image It is important to identify the use of the data that is being con-verted—its metadata—along with the architectural requirements of how the resulting XML will be used. It can make the difference between a project that runs smoothly, and a much more costly engagement.

image If the process is performed correctly, the resulting XML data will be more valuable than the source was, in part because of the work put into the metadata and architectural aspects of that data.

Metadata Management Example

We close the chapter with an example of how the role of XML was made visible by one organization who used it to make some of their legacy metadata more useable.

The preceding example makes obvious the reasons for carefully integrating your XML and your DM operations. The next example describes a more practical introductory approach to organization XML learning where we have taken some rather cumbersome legacy documentation and made it more accessible by evolving it into XML.

Imagine a typical legacy system that an organization is trying to eliminate before moving to a newer system that will provide more flexibility. As a major component of the migration process, the data and processes from the legacy system must be thoroughly understood, extracted, and cleaned before they can be moved over to the new system. Typical for most systems, the data ideally should have been online yesterday. The primary documentation for the functions and processes of the legacy system is a 700-page Word document that describes the functional decomposition of the system (see Figure 2.15). The legacy system itself is composed of a total of 1,887 units, of which 35 are high-level functions and 1,852 are processes. New programmers were given copies of this document in MS-Word format to help them understand the functionality.

image

Figure 2.15 MMS functional decomposition.

Metadata Identification and Extraction

The gigantic document itself is fairly well structured, laying out the hierarchy of functions and processes and how they relate to each other. Still, there is no way to effectively take advantage of this structure since searching through the documents for particular strings is slow, and often returns “false positives” of a particular term used in a context other than what was meant. (For example, searching for a particular function might first find it in the context of being a subordinate of some other function, rather than finding its actual definition.) Ideally, the document would be extensivelycross-referenced, allowing the reader to jump back and forth from any point to any other point according to what was relevant at the time. Since the functions and processes are arranged hierarchically, it would be useful to be able to move up the tree and down it, but also across it, enabling the analyst to take a look at the sibling nodes in the tree for any particular item.

What is needed here is effective management and application of the metadata, since that is really all the document is. XML in and of itself does not automatically do this, but what it does provide is a solid framework in which the concepts of interest can be articulated and exploited. The first step is to take the original document and use its inherent structure to build a corresponding XML document that marks each element of data in the document appropriately. The purpose of this stage is to attach names and meanings to individual pieces of data, rather than having them float around in amorphous blocks of text. The second stage then is to exploit this newly created structure to make the document more readable and searchable, so that the analysts can better understand the legacy system they are working with and spend more of their time solving problems, and less of it working around the annoyances of the original manual’s inadequate data representation.

Figure 2.16 shows a conversion example. It is it clear that the text of the document implies a certain structure to the data without actually specifying that structure. Converting the information into XML allows particular data items to be explicitly referred to according to their metadata. For example, when searching the original Word document for the term “Function_5,” many dozens of matches might have been found—one where Function_5 was defined, one where it was listed as a child of another function, yet another where it was listed as a reference from some other process. By wrapping the data in XML, the analyst can essentially “ask” the document more specific questions using the metadata—“Show me the item that refers to Function_5 as a child,” instead of “Show me everything about Function_5.”

image

Figure 2.16 Converting a structured Word document into XML.

Presentation

Once the data has been extracted from its original semi-structured format, some way of presenting it to users must be devised. The original Word document itself was in something of a presentation format in that the document was page oriented, and had the concept of particular fonts and sizes attached to portions of text. Even though the original document was meant for presentation, now that the data has been moved to an XML format, a number of additional presentation options are available. Normally in Word or HTML format, the presentation would have been hard-wired and would not be modifiable without a serious reworking of the document. Putting the data into an XML format helps us focus on the data of the document, and to see the presentation as arising as a consequence of the relationships inside of the data, instead of seeing presentation as a unique way to simply throw the data up onto the screen.

Access to the metadata that is clearly marked in the XML document opens up new presentation possibilities. The original Word document contained a seemingly endless procession of blocks of text, each describing a particular function or process. With the addition of contextual data in the XML document comes the ability to have the presentation layer take advantage of it. In this case, the presentation of the document was completely reengineered to take advantage of the data about the relationships between the data items. Instead of presenting “chunks” of information about each process, the data was rearranged into a family tree, and navigation became possible up, down, and across the tree. The processes and functions in the document were no longer thought of as blocks of text, but as valuable links between other processes and functions.

Further, since the complete list of what constitutes a function or a process was available, it became possible to have an “index” into the document at any time. From within a browser, the user could view all of the data about a particular function or process, while also having links to related information, and an index on the left-hand side of the screen showing all other data items that were in the same family. The goal of presentation should be to provide a number of different ways of looking at the data wherever possible, since the more options the information consumers have in terms of drilling into the data, the more likely they are to find what they need quickly. The only way these different options can be provided is by understanding the relationships inherent in the data, and how everything fits together. Understanding the relationships in the data and for-malizing that understanding is the process of identifying and documenting the metadata.

Converting Documents to XML

For structured documents like the one in this example, the process of converting the data from its original form to XML is relatively straightfor-ward. Particular items in predetermined places in the source data correspond to well-known data items. There are a number of up-and-coming technologies that allow unstructured data in a number of different formats to be converted into XML documents. Generally, there are few problems with converting tabular and relational data into XML; most of the work in that case deals with finding the appropriate tags and nesting rules that form an overall coherent data model. However, in the case of unstructured information, the source data can be quite a bit more unruly. Take for example most current HTML pages, which describe how the page looks, but nothing at all about what is actually in the page. The analyst can either choose to use one of several widely available software packages to parse the data out of the unstructured format and wrap it in XML, or develop a small custom program to extract particular pieces of information out of documents. Typically, the amount of effort that is required to convert documents is directly related to how structured the document already was. For highly structured documents, less effort will be needed, while for very unstructured documents, the process is far more challenging.

In the context of extracting data, we sometimes refer to the idea of “metadata markers,” or particular “flags” seen in unstructured data that would normally signify to humans that a particular type of data is coming. In the above example, the word “Function” followed by a colon might act as a metadata marker for the function data. Converting unstructured information to XML is largely an exercise of locating the appropriate metadata markers, extracting the data that it is marking, and wrapping the result in a more appropriate XML tag. The XML tags in a document are themselves basically metadata markers. The advantage to using them over unstructured formats is that they were designed expressly for that purpose, while unstructured formats generally have ad hoc metadata markers that are frequently misleading or incomplete. Figures 2.17 and 2.18 illustrate the conversion outcome.

image

Figure 2.17 After conversion to XML, the contents are created as HTML by passing it through an XSLT transformation.

image

Figure 2.18 Conversion from a structured document to an immediately useful XML utility.

Overall, this example touches on several different issues. For data managers, it illustrates how XML can be useful for presenting, storing, and manipulating metadata about systems. It also provides another example of how document conversion can be accomplished, and how this conversion is not always done from source data, but sometimes from semi-structured information that represents system documentation. XML is a valuable tool for data managers in many different phases of system migration projects. In this example, we refer to the metadata of the system being migrated; in other cases, XML may be involved with the actual data being migrated, it might be used as a method of communication between the new system and other systems, or in any number of other areas.

Chapter Summary

This concludes a chapter that has attempted to provide you with a good overview of how XML works and how it is used in practice. It has covered basic XML rules and format, how XML differs from other technologies such as HTML, and a number of examples of the XML’s architectural concepts at work. When data managers embark on systems work, the work usually falls into one or more of several categories:

image Building a new system

image Modifying an existing system

image Simply understanding how something that is already in place operates

image Converting data for use in a system

image Facilitating communication between systems

In all of these situations, there is a potential application for XML. For the data manager, it is more important that he or she understands how XML fits into the architectural “big picture” than it is to know the nuances of XML syntax. This chapter has provided a number of examples of how XML fits into systems work. With this basic understanding, we can now build on it to take a look at the different technology components of XML, and how these can be applied to build higher-level solutions. Many other usages of XML are described in other upcoming chapters, particularly in these two areas, which have entire chapters devoted to them:

image Toolkits and XML frameworks (RosettaNet, BizTalk, ebXML, etc.). Companies build toolkits based on XML, creating further “layering” of standards to provide particular services.

image XML-based portal implementation

These examples present a number of ways that XML has been applied in support of DM functions. While there isn’t enough room here to catalog the specific benefits accruing to the developing organization, we believe you will see how you as a system builder can consider XML as a new tool in your builder’s toolkit.

References

Ahmed, K., Ayers, D., et al. Professional XML meta data. Birmingham, AL: Wrox Press; 2001.

Aiken, P., XML conversion observations. Proceedings of the XML and Meta-Data Management & Technical Conference. Dallas, TX, Wiltshire Conferences. 2001.

Bean, J. XML for data architects. New York: Morgan Kaufmann/Elsevier; 2003.

Bosak, J., Bray, T. XML and the second-generation Web: The combination of hypertext and a global Internet started a revolution. A new ingredient, XML, is poised to finish the job. Scientific American. 1999.

Chaudhri, A., Rashid, A., et al. XML data management: Native XML and XML-enabled data systems. Boston: Addison-Wesley; 2003.

Dick, K. XML: A manager’s guide. Boston: Addison-Wesley; 2000.

Hammer, M., Champy, J. Reengineering the corporation. New York: Harper Business Press; 1993.

Search Tech Target.com; search390.com;techtarget.com.Hall of Fame Krantz, G. Architecting your data center with XML. TechTarget Network. 2003.

Swafford, D., Elman, D., et al, Experiences reverse engineering manually. CSMR 2000. Zurich, Switzerland: IEEE Publishing; 2000.


*For example see Bean (2003); Dick (2000); Chaudhri, A., Rashid, A., et al. (2003);Ahmed, K., Ayers, D., et al. (2001). Also just google the term “XML” and you will gain access to much good technical content.

**http://www.unicode.org/

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.114.19