A.3. The Harvard-Kyoto Classics Project with Vedic Literature

The academic work done with XML is varied in many ways and is still in its exploratory stages. An international collaboration between Dr. Michael Witzel of Harvard University and colleagues at Kyoto University, Japan, led by H. Nakani and M. Tokunaga, is working to reconstruct an entire catalogue of the classics of Asian, East Asian, and South Asian sacred literature under the title “Towards a Reconstitution of Classical Studies.” It will be done in XML with unprecedented complexity, richness of links, and other aspects of XML technology.

A multitude of ancient texts have been entered into computer formats over the past decades, beginning with the valiant and noble efforts of Lehman and Ananthanarayanan in 1971, with the Rig Veda and Shatapatha Brahmana, two ancient texts of India dating as far back as pre-1800 b.c.e. Formats were a problem, and even when markup was used, differing DTDs and tag names were employed, some in SGML/TEI (Text Encoding Initiative) tags, some in plain text, and some in HTML. In addition, there are multiple versions of the texts particularly of important texts like the Rig Veda (RV). In this applied example, we will see how the use of XML and XSLT is still readily possible, even when the actual content is not known, but the logical structure of the tags is.

In Example A8a, we see one such example where the version by Lubotsky, designed to remove changes in spelling due to sound combinations of words, is combined with a version maintained in the TITUS project at Frankfurt University by Jost Gippert (http://titus.uni-frankfurt.de/texte/texte.htm). These versions have been woven with TEI <div> tags, nonconforming IDs (remember, IDs need to begin with a nonnumeric, therefore alphabetic, character), and HTML tags.

In this project, we wanted to separate the “L” (Lubotsky) version from the “T” (TITUS) version, for some high-precision searching in pure XML, with no HTML tags, but with proper IDs (which could be validated if needed) and tag names that reflected the actual common naming among scholars (like paada for each little part of a verse, also called a mantra). Notice that if you open the resulting file in a browser, it looks like

Example A-8a. XML input from the Rig Veda.
<?xml version="1.0"?>

<html>
<body bgcolor="#ffffff">

<div class="Rgveda">

<hr size="8" />
<br />
<font size="5"><b>Mandala I</b></font>
<br />
<hr width="200" />
<br />

<div1 class="maNDala" id="1">
<dl>
<div2 class="hymn" id="1.1">

<div3 class="verse" id="1.1.1">
<a name="1.1.1"></a>
<dt>
1.1.1
</dt>

<dd>
<ol class="mantra" type="a">
      <li class="T">
agni;m ILe puro;hitaM yajJa;sya deva;m Rtvi;jam /
<ul>
<li class="L">
agni;m ILe puro;hitam
</li>
<li class="L">
yajJa;sya deva;m Rtvi;jam /
</li>
</ul>
</li>
</ol>
<ol type="a" start="3" class="mantra">
<li class="T">
ho;tAraM ratnadhA;tamam //
<ul>
<li class="L">
ho;tAram ratnadhA;tamam //
</li>
</ul>
</li>
</ol>
</dd>
</div3>

<div3 class="verse" id="1.1.2">
<a name="1.1.2"></a>
<dt>
1.1.2
</dt>

<dd>
<ol class="mantra" type="a">
      <li class="T">
agni;H pU;rvebhir R;Sibhir I;Dyo nU;tanair uta; /
<ul>
<li class="L">
agni;H pU;rvebhiH R;SibhiH
</li>
<li class="L">
I;DyaH nU;tanaiH uta; /
</li>
</ul>
</li>
</ol>
<ol type="a" start="3" class="mantra">
<li class="T">
sa; devA;M; e;ha; vakSati //
<ul>
<li class="L">
sa; devA;n A; iha; vakSati //
</li>
</ul>
</li>
</ol>
</dd>
</div3>
</div2>
</dl>
</div1>
</div>
</body>
</html>

Figure A-1, with definition lists (<dl>) used to format the verse numbers and so forth.

Figure A-1. Browser view of the hybrid XML-HTML Rig Veda.


This HTML formatting is not necessary for the raw processing in XML where we wanted tags for specific rhythms of meter, deities addressed in the hymn, author, and so forth were to be added and individual word strings searched. This did not require the HTML format, the extra <dl>, <dt>, and <dd> tags, or the "T" version. So, a simple series of XPath matches would give us the "L" versions in the output by selecting them and using xsl:apply-templates. In addition, we can easily remove the "T," or “TITUS” versions, because they are located only in ordered lists <ol>. The “L” versions are in unordered lists (<ul>). Thus when we do an <xsl:template> match on an <ol> (which is the T version we want to get rid of for our research use), and not use xsl:apply-templates in other words, give an empty body to the template they are simply removed from the output. We are still preserving the TEI <div> tag structure, however. The first template matches on the root and processes all children with <xsl:apply-templates>, as shown in Example A8b.

Example A-8b. Stylesheet to create HTML.
<?xml version="1.0"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<!-- 1 -->
  <xsl:template match="/">
    <xsl:apply-templates />
  </xsl:template>
<!-- 2 -->
  <xsl:template match="li">
    <xsl:copy>
      <xsl:apply-templates select="@*"/>
      <xsl:apply-templates select="text()"/>
    </xsl:copy>
  </xsl:template>
<!-- 3 -->
  <xsl:template match="div3">
    <xsl:copy>
      <xsl:apply-templates select="@*"/>
      <xsl:apply-templates select="dt"/>
      <xsl:apply-templates select="dd"/>
    </xsl:copy>
  </xsl:template>
  <!-- 4 -->
  <xsl:template match="dd">
    <xsl:copy>
      <xsl:apply-templates select=".//ul"/>
    </xsl:copy>
  </xsl:template>
  <!-- 5 -->
  <xsl:template match="dl">
<xsl:copy>
      <xsl:apply-templates select=".//div2"/>
    </xsl:copy>
  </xsl:template>

The second template matches on <li>, or list items, and copies them with <xsl:copy>. The use of <xsl:apply-templates> preserves the attributes and actual text nodes.

<!-- 6 -->
  <xsl:template match="div">
    <xsl:copy>
      <xsl:apply-templates select="@*"/>
      <xsl:apply-templates select=".//div1"/>
    </xsl:copy>
  </xsl:template>
  <!-- 7 -->
<xsl:template match="*|@*|text()">
    <xsl:copy>
      <xsl:apply-templates select="*|@*|text()"/>
    </xsl:copy>
  </xsl:template>
</xsl:transform>

The <div3> element is matched in the third template and copied, and its definition list children, <dt> and <dd>, are processed along with the attributes.

Those <dd> elements are matched and copied in the fourth template, along with any <ul> node branches thus preserving the “L” versions, which are contained in the <ul> tags, and processing them to the output result tree.

With the fifth template matching on <dl>, the main definition list is matched and preserved with <xsl:copy>, and all its <div2> node branch children are preserved with <xsl:apply-templates>. This maintains the basic TEI tag format (thoughit is not a valid TEI document in this use).

The <div> and <div1> tags are then processed to the output tree in the sixth template, along with their attributes (we'll use these for IDs later).

Finally, text and attributes are output using the seventh template.

The result is a Lubotsky-only version of the sample file that looks like Example Example A8c. This is well-formed XML, which will also display in a browser.

Next, we want to remove all the HTML tags—the <html>, <body>, <dl>, and so forth. Further, we want to change the abstract TEI <div> tags to the terms scholars use for the levels of division in the Rig Veda book/div1, hymn/div2, and verse/div3 and label the individual unordered list segments (<ul>) as the more common term paada, meaning foot. In Sanskrit and Vedic, a foot of divine meter is considered a footstep of the gods in a sense, so the term applies. If the word seems familiar, the Western podiatrist, who treats feet, derives from the same root word.

Example A-8c. Resulting HTML file after "cleaning" input tags.
<?xml version="1.0" encoding="utf-8"?>
<html>
<body bgcolor="#ffffff">
<div class="Rgveda">
<div1 class="maNDala" id="1">
<dl>
<div2 class="hymn" id="1.1">
<div3 class="verse" id="1.1.1">
<dt>
1.1.1
</dt>
<dd>
<ul>
<li class="L">
agni;m ILe puro;hitam
</li>
<li class="L">
yajJa;sya deva;m Rtvi;jam /
</li>
</ul>
<ul>
<li class="L">
ho;tAram ratnadhA;tamam //
</li>
</ul>
</dd>
</div3>
<div3 class="verse" id="1.1.2">
<dt>
1.1.2
</dt>
<dd>
<ul>
<li class="L">
agni;H pU;rvebhiH R;SibhiH
</li>
<li class="L">
I;DyaH nU;tanaiH uta; /
</li>
</ul>
<ul>
<li class="L">
sa; devA;n A; iha; vakSati //
</li>
</ul>
</dd>
</div3>
</div2>
</dl>
</div1>
</div>
</body>
</html>

We also need to recreate the id attributes with an alphabetic prefix of rv at each level, using <xsl:attribute> and <xsl:text> to add rv to the <xsl:value-of> of the existing id attributes. In each case, LREs for book/div1, hymn/div2, and verse/div3 are inserted to remove the more abstract TEI element-type names for scholars unfamiliar with the otherwise versatile academic DTD. The <xsl:apply-templates> element processes the children of each matched element to the result tree. See Example A9a.

Now, we want to remove the dt elements that remain. We do this with an empty <xsl:template> that matches on them and puts nothing in their place. Following that, the <paada> LRE replaces the remaining HTML <ul> tags for the individual verse portions.

The result output is shown in Example A9b, now nearly ready for detailed research, with only the basic tags, so more complex XSLT stylesheets can be added (you can see other such stylesheets in a prior publication by one of the authors at http://www1.shore.net/~india/ejvs, http://www.asiatica.org/publications/ijts/default.asp, and http://nautilus.shore.net/~india/ejvs/ejvs0601/ejvs0601.html). This new stripped down version is a much smaller and correspondingly speedier file to use. In case you're wondering at this point, agni is the Vedic word for fire, and this hymn is a famous praise of fire in the rituals. The first line says, “Agni I call upon, the priest.” The fire was considered a priest because, as the smoke rose to the sky, it “carried” the message of the ritual to the deities (see http://vedavid.org/diss/ for more).

Example A-9a. XHTML-to-XML conversion and calculation of id attributes with XSLT.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                 version="1.0"
                 >
<xsl:output type="xml" indent="yes"/>
<xsl:template match="div1">
<book>
      <xsl:attribute name="id">
            <xsl:text>rv</xsl:text><xsl:value-of select="@id" />
      </xsl:attribute>
<xsl:apply-templates />
</book>
</xsl:template>
<xsl:template match="div2">
<hymn>
      <xsl:attribute name="id">
            <xsl:text>rv</xsl:text><xsl:value-of select="@id" />
      </xsl:attribute>
<xsl:apply-templates />
</hymn>
</xsl:template>
<xsl:template match="div3">
<verse>
      <xsl:attribute name="id">
            <xsl:text>rv</xsl:text><xsl:value-of select="@id" />
      </xsl:attribute>
<xsl:apply-templates select="*" />
</verse>
</xsl:template>
<xsl:template match="dt" />
<xsl:template match="ul">
<paada>
            <xsl:apply-templates />
</paada>
</xsl:template>
</xsl:stylesheet>

Example A-9b. Resulting XML file.
<?xml version="1.0"?>
<book id="rv1">

<hymn id="rv1.1">
<verse id="rv1.1.1">
<paada>

agni;m ILe puro;hitam


yajJa;sya deva;m Rtvi;jam /

</paada>
<paada>

ho;tAram ratnadhA;tamam //


</paada>
</verse>
<verse id="rv1.1.2">
<paada>

agni;H pU;rvebhiH R;SibhiH


I;DyaH nU;tanaiH uta; /

</paada>
<paada>

sa; devA;n A; iha; vakSati //

</paada>
</verse>
</hymn>

</book>

Now, the only other thing that makes this a more usable text is to mark the individual <paada> elements with more detail. In Vedic parlance, each few syllables forming a mantra is sub-sequenced with a, b, c, d, and so on. These verses have a through d sections (some go up to g and h), and every other one is marked for example, a and c. To aid in identification for our new workhorse text of the Rig Veda, we're going to add id attributes to the <paada>s and format them with <xsl:number>. Example 10a presents the stylesheet, with comments. As usual, the first template match on the root assures processing of the entire input XML document instance.

Example A-10a. Using XSLT to enhance data identification in XML: basic source copying.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0"
                >
<xsl:output type="xml" indent="yes" />
<xsl:template match="/">
             <xsl:apply-templates />
</xsl:template>
<xsl:template match="paada">
<xsl:copy>
<xsl:attribute name="id">
      <xsl:value-of select="../@id" />
      <xsl:copy>
      <xsl:number format="a" value="position() -2"
             letter-value="alphabetic" />
      </xsl:copy>
</xsl:attribute>
             <xsl:apply-templates />
</xsl:copy>
</xsl:template>
<xsl:template match="*|@*|text()">
    <xsl:copy>
      <xsl:apply-templates select="*|@*|text()"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

We begin by matching on <paada>. It is copied with <xsl:copy>. Then, the <xsl:attribute> instruction adds an attribute named id. To get the proper verse id as the base of the id for each <paada>, we select its value (we could also use AVTs ({}) here, see Chapter 6, Section 6.6.1). The simple path that gets the <xsl:value-of> of the parent (..) attribute id furnishes this base. Next, we want to calculate sub-identifiers a, c, e, and so on for each <paada>, based on its position. The <xsl:number> instruction element allows us to format it as a letter. Further, the value is set by the current position, minus two spaces (there is text and then an attribute node, and we only want to count the node that is the paada itself: the first is a, third is c, and so on). The children are processed to the output XML document instance with <xsl:apply-templates>.

The last template assures output of any unmatched elements, attributes, and text nodes.

The resulting output XML document instance, ready for detailed book, hymn, verse, and now paada identification, is shown in Example A10b.

Just to take this one step further, let's use XSLT to search this new text we've created. We can create a simple template to do this. With all the standard <xsl:output> and by removing the whitespace with <xsl:strip-space>, we can match on the path to a <paada> (you could imagine replacing “verse” with an author, for instance, to get all <paada>s by that author) to start the template. Then, a simple <xsl:if> test with the contains() function searches for a <paada> containing Ile. When that is found, <xsl:copy-of> copies its ancestor, for instance, so we get the entire verse context for our match, as shown in Example A-11a.

Example A-11b is the output result of the search. It is important to remember, however, that XSLT is not a proper query language, nor was it intended to be one. It works well for many query-like functions, but as has been said when all you have is a hammer, everything looks like a nail. At a certain point, querying with XSLT and XPath is going to run into intractable limits, including processor power. The reader might notice that, in effect, we're using XPath with XSLT here to “query” in a database sense. Future evolving standards from the W3C will weave a query langauge in XML, XQL, together with these standards. Until then, these kinds of content-based selections from a large resource are still quite efficient depending on how much detail is there in the tagging of your source.

Example A-10b. Resulting XML document instance.
<book id="rv1">
<hymn id="rv1.1">
<verse id="rv1.1.1">
<paada id="rv1.1.1a">
agni;m ILe puro;hitam
yajJa;sya deva;m Rtvi;jam /
</paada>
<paada id="rv1.1.1c">
ho;tAram ratnadhA;tamam //
</paada>
</verse>
<verse id="rv1.1.2">
<paada id="rv1.1.2a">
agni;H pU;rvebhiH R;SibhiH
I;DyaH nU;tanaiH uta; /
</paada>
<paada id="rv1.1.2c">
sa; devA;n A; iha; vakSati //
</paada>
</verse>
</hymn>
</book>

Example A-11a. A simple content-based search query with XSLT.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                 version="1.0">
<xsl:output type="xml" indent="yes"/>
<xsl:strip-space elements="*" />
<xsl:template match="verse/paada">
   <xsl:if test="contains(., 'ILe')">
          <xsl:copy-of select='ancestor::verse'/>
     </xsl:if>
</xsl:template>
</xsl:stylesheet>

Example A-11b. Resulting XML document instance from XSLT search query.
<?xml version="1.0" encoding="utf-8"?>
<verse id="rv1.1.1">
<paada id="rv1.1.1a">
agni;m ILe puro;hitam
yajJa;sya deva;m Rtvi;jam /
</paada>
<paada id="rv1.1.1c">
ho;tAram ratnadhA;tamam //
</paada>
</verse>

Remember that, using XSLT, we can add more detailed categories, like who wrote a hymn, its meter, and other information. This makes it possible to further contextualize the search with XPath, such as requesting all <paada>s composed by Agastya, in the jagati meter, dedicated to Agni, containing the word tanuu. This and other plans are in the works from Harvard and Kyoto, including a use of Topic Maps and XLink.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.186.241