Create a Text File from an XML Document

Use this stylesheet to extract only the text from any XML document.

Sometimes you just want to leave the XML behind and keep only the text found in a document. The stylesheet text.xsl can do that for you. (There’s an even easier way; see “Built-in Templates” following). It can be applied to any XML document, which includes XHTML. It is shown in Example 3-15.

Example 3-15. text.xsl

<xsl:stylesheet version="1.0" 
<xsl:output method="text"/>
            xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   
<xsl:template match="/">
 <xsl:apply-templates select="*"/>
</xsl:template>
   
</xsl:stylesheet>

This stylesheet finds the root node and then selects all element children (*) for processing. To test, apply this stylesheet to the XHTML document magnacarta.html , the pact between King John and the barony in England that was first signed at Runnymede on June 15, 1215 (see http://www.cs.indiana.edu/statecraft/magna-carta.html):

xalan magnacarta.html text.xsl

A small portion of the output is shown in Example 3-16. The result is shown in IE in Figure 3-18.

Example 3-16. A portion of the Magna Carta

Magna Carta
   
The Magna Carta
JOHN, by the grace of God King of England, Lord of Ireland, 
Duke of Normandy and Aquitaine, and Count of Anjou, to his 
archbishops, bishops, abbots, earls, barons, justices, 
foresters, sheriffs, stewards, servants, and to all his 
officials and loyal subjects, Greeting.
   
KNOW THAT BEFORE GOD, for the health of our soul and those of 
our ancestors and heirs, to the honour of God, the exaltation 
of the holy Church, and the better ordering of our kingdom, at 
the advice of our reverend fathers Stephen, archbishop of 
Canterbury, primate of all England, and cardinal of the holy 
Roman Church, Henry archbishop of Dublin, William bishop of 
London, Peter bishop of Winchester, Jocelin bishop of Bath and 
Glastonbury, Hugh bishop of Lincoln, Walter Bishop of Worcester, 
William bishop of Coventry, Benedict bishop of Rochester, Master 
Pandulf subdeacon and member of the papal household, Brother 
Aymeric master of the knighthood of the Temple in England, 
William Marshal earl of Pembroke, William earl of Salisbury, 
William earl of Warren, William earl of Arundel, Alan de 
Galloway constable of Scotland, Warin Fitz Gerald, Peter Fitz 
Herbert, Hubert de Burgh seneschal of Poitou, Hugh de Neville, 
Matthew Fitz Herbert, Thomas Basset, Alan Basset, Philip Daubeny, 
Robert de Roppeley, John Marshal, John Fitz Hugh, and other loyal 
subjects:
The Magna Carta (magnacarta.html) in IE

Figure 3-18. The Magna Carta (magnacarta.html) in IE

Built-in Templates

You can also extract text from a document just by relying on XSLT’s built-in templates. A stylesheet as simple as this single line:

<xsl:stylesheet version="1.0" 
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"/>

will invoke the built-in templates because there is no explicit template for any nodes that might be found in the source document. The built-in templates process all the children of the root and all elements, and copies text through for attributes and text nodes (the built-in templates do nothing for comment, processing-instruction, or namespace nodes). The benefit of using text.xsl over built-in templates is that text.xsl gives you a framework to exercise some control over the output (e.g., through additions of templates). However, adding templates to text.xsl won’t make any difference, unless those templates match the document element more precisely (and therefore have higher priority than the template matching *). An empty stylesheet is the simplest one to start from if you want to add more precise templates.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.6.75