Reading an Article

In this example, we look at how we can extract and use information from an XML document using SAX. The particular documents our script works with are simple news articles, but we’ll see how to work with elements, attributes, and textual content.

Some of the trade-offs of using SAX depend on what you’re trying to accomplish, and how the XML is structured. SAX treats XML as a continuous stream, firing events to your handler as they happen. Example 3-1 shows article.xml.

Example 3-1. article.xml
<?xml version="1.0"?>
<webArticle category="news" subcategory="technical">
    <header title="NASA Builds Warp Drive"
           length="3k"
           author="Joe Reporter"
           distribution="all"/>
    <body>Seattle, WA - Today an anonymous individual
           announced that NASA has completed building a
           Warp Drive and has parked a ship that uses
           the drive in his back yard.  This individual
           claims that although he hasn't been contacted by
           NASA concerning the parked space vessel, he assumes
           that he will be launching it later this week to
           mount an expedition to the Andromeda Galaxy.
    </body>
</webArticle>

Example 3-1 contains markup that is structured in a few different ways, and can be interesting to parse via SAX. A document such as article.xml requires that we understand how the document is structured prior to writing a handler to parse it. Therefore, the handler is tightly coupled to the document’s structure.

Writing a Simple Handler

You can write the ArticleHandler class to a new file, handlers.py; we’ll keep adding new handlers to this file throughout the chapter. Keep it simple at first, just to see how SAX works:

from xml.sax.handler import ContentHandler
class ArticleHandler(ContentHandler):
  """
  A handler to deal with articles in XML
  """
  def startElement(self, name, attrs):
    print "Start element:", name

Now we need to create a script to instantiate the parser, assign the handler, and do the actual work.

Creating the Main Program

No matter how complex your handler objects become, there is rarely much code involved in setting up the parser. Let’s look at Example 3-2, in which we use only the ArticleHandler class just created, and parse what we find on the standard input stream. The file art.py, shown in Example 3-2, demonstrates how to do this.

Example 3-2. art.py
#!/usr/bin/env python
# art.py

import sys

from xml.sax  import make_parser
from handlers import ArticleHandler

ch = ArticleHandler(  )
saxparser = make_parser(  )

saxparser.setContentHandler(ch)
saxparser.parse(sys.stdin)

Once created, you can run the code from the command line using file redirection to populate standard input (both Unix and Windows):

$> python art.py < article.xml

The output using the simple article handler appears as:

Start element: webArticle
Start element: header
Start element: body

The output reflects the simple rule in your ArticleHandler class, which just prints out the name of each tag it encounters. To really use the XML, you have to add more functionality to the handler class in the handlers.py file.

Adding Intelligence

XML allows information to be parsed for different purposes. If you create a news article in XML, one application can grab it and display it as HTML, while another can index it to a search database. It’s easy to imagine that a service might like to offer intelligent agents to scour Internet sources for news items, special offers, and other items of interest for you based on preferences that you set up. XML makes this process manageable, as opposed to the alternative of reliably parsing HTML for structured information, which is nearly impossible. HTML only communicates the appearance of a document and not its organizational structure. In HTML, two documents may look exactly alike in the browser, but use wildly different tags under the hood. Parsing the HTML for its information won’t work, unless of course the page designer had that goal in mind when setting out to create the page.

Your news agent is configured to go after technology stories, especially ones that relate to space travel. When it discovers such an article, it displays a message, the headline, and the first few words of the body text. You can add functionality to your handler class to support this.

Since SAX is stream-based, it’s sometimes necessary to set flags so that you can track when you’ve entered certain elements in and when you haven’t. If you find that you’re setting too many different flags, you might consider using a DOM approach as opposed to SAX. SAX is perfect when doing bulk operations on a lengthy XML stream. However, if you are trying to pull a complex data structure out of the document, you may be better off using the DOM.

To keep our example simple, set a few flags as the events are propagated, and go after the desired information. In the startElement method, check to see if you’re indeed inside a news article and if your article is indeed technical. If it satisfies both of these requirements, change a Boolean data member so that other methods start paying attention to the data they receive. Also set a property on the handler itself so that the main application knows the handler has found a technical article, as that was its assignment:

def startElement(self, name, attrs):
  if name == "webArticle":
    subcat = attrs.get("subcategory", "")
    if subcat.find("tech") > -1:
      self.inArticle = 1
      self.isMatch = 1

  elif self.inArticle:
    if name == "header":
      self.title = attrs.get("title", "")
    if name == "body":
      self.inBody = 1

The last conditional test is to see if the parser has entered the body element of a relevant article. If so, the characters method now knows to begin buffering data as the it is called:

def characters(self, characters):
  if self.inBody:
    if len(self.body) < 80:
      self.body += characters
    if len(self.body) > 80:
      self.body = self.body[:78] + "..."
      self.inBody = 0

Finally, look for the close of the body tag to indicate to the characters method that it no longer needs to pay attention to character data:

def endElement(self, name):
  if name == "body":
    self.inBody = 0

Beyond implementing these three methods, the class is also modified to initialize data members, and to provide an isMatch data member to indicate to the main application whether this handler has found something worth keeping. The complete class (replacing the earlier class of the same name) is shown in Example 3-3.

Example 3-3. Enhanced ArticleHandler
from xml.sax.handler import ContentHandler

class ArticleHandler(ContentHandler):
  """
  A handler to deal with articles in XML
  """
  inArticle = 0
  inBody    = 0
  isMatch   = 0
  title     = ""
  body      = ""

  def startElement(self, name, attrs):
    if name == "webArticle":
      subcat = attrs.get("subcategory", "")
      if subcat.find("tech") > -1:
        self.inArticle = 1
        self.isMatch = 1

    elif self.inArticle:
      if name == "header":
        self.title = attrs.get("title", "")
      if name == "body":
        self.inBody = 1

  def characters(self, characters):
    if self.inBody:
      if len(self.body) < 80:
        self.body += characters
      if len(self.body) > 80:
        self.body = self.body[:78] + "..."
        self.inBody = 0

  def endElement(self, name):
    if name == "body":
      self.inBody = 0

Using the Additional Information

Now that the handler has been modified to collect more information and determine if the article is interesting, we can add a little more code to art.py so that when an interesting article is found, it prints a report for the user and ignores everything else. To do this, we need only append this code to the end of art.py, which was originally shown in Example 3-2:

if ch.isMatch:
  print "News Item!"
  print "Title:", ch.title
  print "Body:", ch.body

With article.xml as input, you should see the following output:

$> python art.py
 < article.xml
News Item!
Title: NASA Builds Warp Drive
Body: Seattle, WA - Today an anonymous individual
                announced that NASA has completed building a...
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.247.125