Document analysis

Document analysis is often only applicable to publishing applications of XML, and not to the development of data exchange applications.

Existing rules

It is typically the case that a new XML-based publishing system replaces a system that adopted procedural markup principles, whether an old-fashioned typesetting system or a more modern DTP package. Existing books or documents to some degree conform to in-house style guides (whether formally defined, loosely described in notes, or existing simply in the heads of senior editorial staff), and naturally form the foundation of document analysis. The better the style guide, and the more rigorously it was previously applied, the easier it is to define a suitable model (and to convert these legacy documents to XML format too).

Human expertise

Studying existing documents reveals much about the required structure. However, an author or editor with widespread experience of the content of these documents should still be involved in the process. When the document collection is vast, only a small proportion of the material can realistically be assessed, and it is very important that the selected material be representative of the whole collection.

Another important principle to adopt is to be realistic about the technology and its capabilities. DTP operators have become accustomed to a degree of artistic freedom that cannot be sustained in a controlled XML environment.

Formatting constraints

XML is usually implemented in order to improve the efficiency of publishing and republishing to different target audiences on a variety of media. Software is used to provide the necessary automation, but programs require predictable input. Utilities that locate, extract, manipulate and present information from XML documents must be given manageable tasks to perform. In particular, regard must be given to the limitations of stylesheets and structure-oriented publishing products.

Creating complex coding structures to deal with document structures that appear infrequently may not be practical. One common example of such a problem is a small, vertically aligned fragment, as shown below. Compromise may be necessary. Perhaps these structures can be formatted slightly differently, more simply, without any loss of legibility:

   300
    25.6
and  1.3

Relevant features

For every feature identified in existing documents, the following set of questions may be asked:

  • can it be given a name?

  • does it always appear?

  • may there be more than one?

  • must it always appear before (or after) some other feature?

  • does it deconstruct into smaller objects (to which these same questions apply)?

  • is some of the textual content always the same (if so, it could be generated automatically)?

The answers to these questions form the basis of a document specification. Every object in the document is given a descriptive name, and is assigned rules governing where and how often it may appear, and what it may contain.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.2.184