Line ending interpretation

Separate lines of text are often found in XML documents, and indeed in other text-based document formats. Text may be broken into lines for various reasons; either for convenience, or to signify and isolate important sub-units of information. Either way, the presence of multiple text lines can also cause complications.

Line-end codes

The points at which line-end codes appear may have been carefully chosen to avoid corrupting the text. It is possible to interpret line-end codes in three ways. The line-end code can:

  • be retained, and used to force a line-break when presented

  • be removed

  • be replaced by a space.

These interpretations can be illustrated with three examples:

<software>10 PRINT "Hello World"[CR]
20 GOTO 10.</software>


<geneSequence>TCTCGATTACACCGC[CR]
TAATCGCGATTACAC</geneSequence>


<para>This is a normal[CR]
paragraph.</para>

These examples each require a different interpretation, giving the following output:


       10 PRINT ”Hello World”
       20 GOTO 10.

       TCTCGATTACACCGCTAATCGCGATTACAC.

       This is a normal paragraph.

Each application of XML requires a clear policy on this issue. In most cases, line-end codes are deemed to stand in for spaces, as this is standard practice in the publishing application that underlies the historical roots of XML.

A further issue to consider concerns identification of line-end codes that do not belong to the document text at all (discussed below).

Hyphenation

To complicate matters further, many documents are converted into SGML or XML format directly from previously typed or published material (possibly using OCR or ICR technology). This material often contains hyphens at the end of lines, where the author or publishing software has chosen to split a word so as to better balance the text over lines:

This paragraph is too long to comfor-[CR]
						tably fit on one line of text.

In this case, an application may be intelligent enough to simply remove the line-end code (not replace it with a space), and also remove the hyphen:


       This paragraph is too long to comfortably fit on one line of text.

But it must be careful not to remove the hyphen from a double-barrelled word:

The hyphen must not be removed from 'line-[CR]
						end' when re-formatting.


   The hyphen must not be removed from 'lineend' when re-formatting.

Because software is not usually sufficiently intelligent to make this distinction, an alternative strategy is to manually separate the two kinds of hyphen. Extended character sets include a special 'soft hyphen' character (character 176, '&#176;'), which looks the same as a normal hyphen, but is interpreted as one that can be safely removed. The normal hyphen is assumed to be a 'hard' hyphen, which must be retained. When the hyphen is soft, it is manually identified as such and changed to the soft hyphen character.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.115.120