Creating a Schema for a Set of Documents: Document Analysis

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

20.2. Creating a Schema for a Set of Documents: Document Analysis

The process of examining a collection of documents and describing its “shape” by using a DTD or schema is called “document analysis.” The process works best when a schema specialist works with a cadre of people already dealing with the documents whose structure is to be analyzed (rather than doing it alone, with just sample documents, or rather than having some untrained person doing the analysis without a schema specialist). In this section, we call the cadre plus the specialist the “team.”

In the following sections, we cover a sample scenario showing how a document analysis might proceed and a few examples of things a document analyst must look for.

20.2.1. Scenario: A Document Analysis

The following kinds of people should be in the document analysis cadre. It generally works best if there are two or three from each group, except perhaps the senior editor, resulting in a cadre of about ten people.

Those who have been responsible for writing or rewriting the documents to be analyzed. They might be called “authors,” “writers,” or “editors,” depending on the corporate culture. They will have a good, albeit possibly subconscious, understanding of the kinds of content found in various parts of the documents.
Those who have been responsible for formatting and publishing the documents to be analyzed. They will complement the authors and will have different insights into the real structure of the documents. Two subgroups should be represented: editors, who make decisions about style, and compositors, who implement those decisions and make sure the final output is formatted correctly. (Sometimes composition is called “production”.)
A manager, possibly called a “senior editor” or “executive editor” (again, depending on the corporate culture) who is familiar with the documents and has the authority to make decisions when structure anomalies are found.
Those who will be responsible for maintaining the schema as it evolves over time and/or will be responsible for configuring the various XML processing products to be used. They might not yet be content/structure experts but will have to become so. Participation in the document analysis is the best and fastest way to immerse them in this area.

20.2.1.1. The First Meeting: Roughing Out the Structure

The first meeting is scheduled for an entire week but probably terminates by the fourth day. Two things are done during this first meeting:

During the first day and a half or so, the team engages in a structured XML exercise. The purpose of this exercise is to quickly familiarize everyone with some common terminology and—most important—help them understand the “XML point of view” toward textual data. This often involves quite a shift of mental gears—a “paradigm shift.”
The information will come too fast and too furiously—like drinking from a fire hose, some say—but the attendees don’t need to get all the details consciously sorted out right away. Only the schema maintainers have to worry about the details, and they will get extra help and training down the road. What is important is that attendees start to pick up the new paradigm intuitively. Reinforcement comes during the rest of the week.
Sometime during the second day, the exercise ends and the real document analysis begins. The specialist helps find the various parts of the documents and get descriptions of those parts on paper (or PC file). Within a couple of days, all will have a good understanding and description of what the cadre members (the subject-matter experts) believe is the structure of the documents. Together, the team will also have found, by then, a number of anomalies: things that are in existing documents that don’t fit the structure cadre members originally thought the documents were following. (Anomalies will be found. It has never failed yet.)

This terminates the first week’s effort for most of the cadre. The schema-maintainers-intraining, at least one of whom serves as the scribe for the team, will get some extra help from the specialist to correctly cast their notes in the form of a good schema (or possibly, to begin with, a good DTD, since that will be easier to input without fancy tools).

20.2.1.2. Between Meetings: The Anomaly Search

By now the cadre members need a chance to sort out what’s been done, let their heads clear, and take care of all the other work that piled up on their desks while they were meeting all day, every day. The schema specialist is available by phone if anything needs clarification, but stays out of the way for two to four weeks.

There is work to be done before the next formal meeting of the team, however. Specifically, the cadre members now make a page-by-page, document-by-document search for more kinds of anomalies. This search can be done part time and be spread over several weeks.

20.2.1.3. The Second Meeting: “Sorting Sheep”

The second team meeting probably occurs two to four weeks after the first one ends. It might be scheduled for a week, just to be safe, but more likely takes about three days. The team must work through the collection of anomalies and separate the sheep from the goats—decide which anomalies are useful and should be accounted for in the schema and which are just the result of a writer’s “getting the bit in his teeth” and deviating unnecessarily from the way other writers handle similar material.

It’s important to remember at this time that similar material should have similar presentations, especially when the material is technical. Otherwise, readers will waste time trying to figure out how the differently presented material is different when it isn’t different at all. This is why a manager with decision-making authority should already be involved in the analysis. If not involved, that person probably won’t have picked up the “XML point of view” and is liable not to understand the decisions that must be made.

20.2.1.4. Moving Onward

At the end of the second meeting, a pretty good schema emerges. It won’t anticipate every unusual situation that might come down the pike in the future, but it can be used as is. That’s the object of a document analysis: a schema that is usable. A well-designed schema can be extended, normally without requiring any modification of existing documents, when new requirements come to light.

Caution

Peterson’s Maxim: “A cast-in-concrete schema is an obsolete schema.” Always plan your schema and your processes that use that schema to be easily extended.

What is next? With a usable schema in hand, you can get on with developing your applications. And you can begin conversion of legacy data: all the existing documents that have to be converted to XML and to the newly developed schema. In most cases there is legacy data to be converted; these documents are the foundations on which revised and republished documents will be built. The schema specialist initially helps make the choices and develop applications. You might decide to convert your legacy data yourself, but more likely you will decide to have a conversion-specialist firm convert it for you. You need a DTD or schema to begin that effort, because that is what defines the form the documents must have after conversion. You can now begin converting—or start up your conversion contractor.

20.2.2. Structures to Look For

Some structures are obvious—especially because the structure of SGML documents (and hence of XML) was originally designed to capture the structure of “real, published documents.” Nonetheless, some structures seem to be well hidden, at least at first glance. Discerning hidden but important structures can make the difference between a mediocre document analysis and a useful one.

20.2.2.1. Big Pieces

Most documents are logically hierarchical: A book has front matter, a body, and back matter. The body has parts, which have chapters, which have sections, which have subsections, ....There might be differences in the terminology within a corporate culture—for example, some might call the large parts “sections” and the small sections “numbered paragraphs.” Some of the levels (such as parts) might be omitted. Some or all of these levels will begin with a title, sometimes called a “heading.” Most writing, at least in most Western European languages, winds up in paragraphs. But it’s all hierarchical: Each part is made up of some smaller parts. Ultimately, some parts are just plain raw text characters.

20.2.2.2. Specialized Pieces

Certain kinds of documents have structure elements that have specific semantics: Each chapter might have an introduction or a summary. There might be other special-purpose sections, such as in this book’s Chapter 12, where each section that describes one or more specific datatypes has a subsection listing the facets available for derivations, a subsection telling about the datatypes derived from the subject datatypes and the datatypes from which they are derived, and a subsection that suggests when another datatype might be more appropriate. If many documents—or even chapters or sections in a document—have the same structure, you can ensure that nothing is overlooked by creating special element types for those specific-topic subsections and creating content models that require them.

20.2.2.3. Paragraphs and Things That Break Them

Paragraphs typically include running text and things that break running text. Running text includes phrases that need emphasis in one form or another: foreign phrases, titles, new terms being introduced, short quotations, or words being used with weird, unusual, or special meanings—let your imagination run! Then there are the subject-matter-dependent, specialized words and phrases. For example, some parts of this book were originally authored in XML: We separately tracked element type names, attribute type names, XML samples, names of facets, and a number of other XML- and Schema-specific phrases. When we began authoring, we didn’t know what display format would be used for all of them. And we didn’t have to. (That is a big advantage of XML-based authoring.)

What breaks running text? Displayed formulas, long quotations, lists, small tables, code listings,....(You can have fun debating whether lists and tables occur within paragraphs or between paragraphs. But it is particularly easy to make up examples where, say, an equation should be displayed for emphasis but is grammatically part of a sentence that extends both fore and aft of the equation. If the equation is part of the sentence, it must be part of the paragraph—sentences do not span across paragraph boundaries.)

Some things clearly do not belong to paragraphs. The Notes, Tips, and Warnings you find in this book clearly occur between paragraphs (and occasionally “float”). Other things, usually for appearance reasons, regularly float when published; figures, large tables, long listings, footnotes or endnotes, and “sidebars” (separate mini-articles closely attached to a primary article in a magazine) are all examples. They are usually printed near where they are first referenced but may be moved to the top of the page, or even to another page. Such things are referenced by number or title in the text, and provision for a number or title must be made in the content model of corresponding element types.

20.2.2.4. Specialized and Non-obvious Structures

Accommodating automatic table-of-contents generation is quite easy when every chapter, section, and subsection is marked and its title identified. Automatic indexing requires additional markup and a specialized processor to generate the index, as does automatic generation of cross-references (identifying chapter and section numbers and/or titles and then automatically generating copies where the reference occurs). This is especially useful when titles and sequence of chapters and sections have not been locked in when the reference was written. You can use the capabilities of ID/IDREF or the more flexible capabilities of schema-based identity constraints and a special post-processor that can generate the copies automatically.

Recognizing logical structure, rather than just physical appearance, can be tricky. Some years ago, the U.S. Department of Defense CALS documentation included a table that actually had eight entries for each thing that it had data about: Each thing had eight “properties” whose values were given in the table, but the physical display in the book used a three-column table, with five pieces of data listed in each cell in the last column. If you look, you can find even more interesting logical-structure-versus-display examples. Find a copy of the Alaska Fish and Game Regulations—or just get the owner’s manual for your car and try to figure out the logical structure (not just the display structure). With many car brands, the latter is an excellent exercise. Catching all these unexpected structures in your documents is one reason it pays to have an experienced document analyst help with your document analysis.

20.2.3. More Detail

The problem of correctly finding an appropriate description of the structure of a set of books or magazines is complex—it warrants an entire book. Fortunately, an excellent text, DevelopingSGML DTDs ^[1], covers the process. Although it was written for SGML, the processes it describes are the same; XML and SGML were cut from the same mold.

^[1] Maler, Eve and Jeanne El Andaloussi. Developing SGML DTDs: From Text to Model to Markup. Englewood Cliffs, NJ: Prentice Hall, 1995.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Creating a Schema for a Set of Documents: Document Analysis

Create new playlist

Sign In

Sign Up