Searching in XML with GPath

When using XmlSlurper or XmlParser with Groovy (see the Reading XML using XmlSlurper and Reading XML using XmlParser recipes), the returned parsed result can be queried using GPath. GPath is a way to navigate nested data structures in Groovy. Sometimes GPath is called an expression language integrated into Groovy; but, in fact, GPath does not have a separate compiler or interpreter, it's just the way Groovy language and core classes are designed to make data structure navigation and modification concise and easy-to-read.

GPath, in certain ways, is similar to XPath (http://www.w3.org/TR/xpath/) that is used for querying XML data. The main difference is that it uses dots instead of slashes to navigate the XML hierarchy and it can be used for navigating the hierarchy of objects (Plain Old Java Objects (POJOs) and Plain Old Groovy Objects (POGOs) respectively).

The GPath syntax closely resembles E4X (ECMAScript for XML), which is an ECMAScript extension for accessing XML content.

This recipe will show how to parse an XML document and query its values and attributes through GPath.

Getting ready

As usual, we start by defining an XML snippet to test out our XML recipe. In this recipe, we deal with movies. IMDB (Internet Movie Database) at http://www.imdb.com is the de facto standard for everything you want to know around the cinema universe. IMDB stores information about pretty much every movie ever made, along with actors and production data. IMDB is also a good citizen of the Internet and exposes API to retrieve the same data available on the site in XML. The API is not really documented, so we decide to roll out our own XML format based on the data we fetch from IMDB. Let's print the information about some movies having the word groovy in the title (admittedly not many!).

def groovyMoviez = '''<?xml version="1.0" ?>
  <movie-result>
    <movie id="tt0116288">
      <title>Groovy Days</title>
      <year>1996</year>
      <director>Peter Bay</director>
      <country>Denmark</country>
      <stars>
        Ken Vedsegaard,
        Sofie Gråbøl,
        Martin Brygmann</stars>
    </movie>
    <movie id="tt1189088">
      <title>Cool and Groovy</title>
      <year>1956</year>
      <director>Will Cowan</director>
      <country>USA</country>
      <stars>Anita Day,
        Buddy De Franco and
        Buddy DeFranco Quartet</stars>
    </movie>
    <movie id="tt1492859">
      <title>Groovy: The Colors of Pacita Abad</title>
      <year>2005</year>
      <director>Milo Sogueco</director>
      <country>Philippines</country>
      <stars/>
    </movie>
  </movie-result>
'''

How to do it...

Let's see how we can navigate the previous XML with GPath.

  1. The first step is to process the XML using XmlSlurper so that we can access the GPath API:
    def results = new XmlSlurper().parseText(groovyMoviez)
  2. The rest is just Groovy's magic. The XML structure is magically converted into a navigable object structure based on the tags and attribute of the XML document:
    for (flick in results.movie) {
      println "Movie with id ${flick.@id} " +"is directed by ${flick.director}"
    }
  3. The code snippet yields the following output:
    Movie with id tt0116288 is directed by Peter Bay
    Movie with id tt1189088 is directed by Will Cowan
    Movie with id tt1492859 is directed by Milo Sogueco
    

    The code becomes extremely fluent and ceremony-free, just code and data.

  4. Want to see something more awesome? Check out what we can do with the Groovy's spread-dot operator (*.):
    results.movie*.title.each { println "- ${it}" }
    results.movie.findAll {
      it.year.toInteger() > 1990
    }*.title.each {
      println "title: ${it}"
    }
  5. The output will be as follows:
    title: Groovy Days
    title: Groovy: The Colors of Pacita Abad
    
  6. Searching inside the XML document for specific content is also pretty simple:
    results.movie.findAll {
      it.director.text().contains('Milo')
    }.each {
      println "- ${it.title}"
    }

    The findAll method is applied to the movie node and the closure passed to the findAll method checks for the presence of the word Milo in the director tag.

  7. Should you need to search across all the nodes of the document (as opposed to a specific set of children, such as movie), it is possible to use the depthFirst method (or its shortcut **) on the root node:
    results.'**'.findAll {
      it.director.text().contains('Milo')
    }.each {
      println "- ${it.title}"
    }
  8. The output will be as follows:
    - Groovy: The Colors of Pacita Abad
    

How it works...

In the first step, we create a parser, an XmlSlurper in this case.

Note

Please note how this time, we use the parseText method to read a string containing a valid XML document.

Step 2 shows how we can easily access the attributes and values of a movie, while iterating on the results.

Groovy cannot possibly know anything in advance about the elements and attributes that are available in the XML document. It happily compiles anyway. That's one capability that distinguishes a dynamic language.

Step 4 introduces the spread-dot operator: a shortcut to the collect method of a collection. The result is a new collection containing the outcome of the operation applied to each member of the original collection.

There's more...

To iterate over collections and transform each element of the collection in Groovy, the collect method can be used. The transformation is defined as a closure passed to the method:

assert [0,2,4,6] == (0..3).collect { it * 2 }

By using a spread-dot operator, we can rewrite the previous snippet as follows:

assert [0,2,4,6] == (0..3)*.multiply(2)

In the context of a GPath query for XML, the spread-dot accesses the properties of each node returned from the findAll method, movies produced after the year 1990.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.106.135