Chapter 5. First-class functions in action with XML

This chapter covers

  • First-class and higher order functions

  • Python decorators

  • Reading and writing XML

  • An XML file format for MultiDoc

In the previous chapter, we started creating a multipage document editor called MultiDoc. The aim is to create a desktop application that can load and save multipage documents stored in a single file. In this chapter, we show the implementation of a core part of this specification: the ability to save and load multipage documents stored as XML. You’ll add new load and save commands to MultiDoc, using classes in the System.Xml assembly. In the process, we’ll explore an extensible approach to reading XML with IronPython.

One of the big differences between Python and traditional .NET languages is its support for first-class functions. We start this chapter by looking at what first-class functions are and how they can help you write shorter and more beautiful code. Next, we turn our attention to working with the .NET XML classes and putting what you’ve learned about functions to work.

First, on with the first-class functions.

First-class functions

In a programming language, functions are first class if they’re just another kind of object—you can create them at runtime and pass them around your code, including using them as arguments to functions or return values from functions.

First-class functions are a core part of the functional-programming style of programming. Functional programming breaks problems into a set of functions. Preferably, these functions should only have inputs and outputs, neither storing internal state nor having side effects.

To programmers used to working with object-oriented languages, this seems an odd set of constraints—almost a step backwards. But there are some advantages to this style. Along with encouraging modularity, elegance and expressiveness are the biggest advantages. It’s also theoretically possible to construct formal proofs of purely functional programs—which has attracted a lot of interest in academic circles.

One aspect of first-class functions that you met in the Python tutorial is inner functions. These are functions defined inside the body of another function or method. If you have code repeated inside a function, it’s often tidier to factor this out into another function. If the code uses several local variables, then it may require too many arguments if turned into a separate method. In this case, it can make sense for it to be an inner function that has access to any of the variables in the scope in which it’s defined.

There’s another class of functions commonly used in functional programming: higher order functions.

Higher order functions

Functions that take functions as arguments or return functions are called higher order functions. Functions that work with functions can be extremely useful. They allow you to separate out parts of a program more easily; for example, a higher order function might provide a traversal of a data structure, where the function you pass in decides what to do for each item.

This technique is highly reminiscent of the strategy pattern from Design Patterns: Elements of Reusable Object-Oriented Software:[1]

  • Define a family of algorithms, encapsulate each one, and make them interchangeable. Strategy lets the algorithm vary independently from clients that use it.

You can implement this pattern very simply with higher order functions. Imagine you’re writing a banking application that handles many different transactions. You need to apply these transactions to accounts, rolling back the transactions in the event of an error (such as insufficient funds in one of the accounts). Transactions of many different types can be created, but the code that consumes the transactions should be able to apply them without knowing anything about how they’re implemented.

A function to create a transaction may look something like this:

def get_transfer(account1, account2, amount):
   def do_transfer():
      account1.withdraw(amount)
      account2.deposit(amount)

   return do_transfer

The do_transfer function is for transactions that transfer money from one account to another. do_transfer closes over the parameters to get_transfer and encapsulates them. The parameters will be used when do_transfer is applied.

The transaction is applied by apply_transaction.

def apply_transaction(transaction, connection):
   connection.begin_transaction()
   try:
      transaction()
   except TransactionError:
      connection.abort_transaction()
      return False
   else:
      connection.commit_transaction()
      return True

apply_transaction can apply any transaction. It handles errors, aborting the transaction in case of error. It returns True or False depending on whether the transaction succeeds or not. The advantage of this kind of structure is that you can change how the transactions are applied in a single place, and individual transactions can be modified without affecting how they’re used.

Another place where higher order functions commonly turn up in Python is in the form of decorators.

Python decorators

It turns out that writing a function that takes a function, and then wraps it, is a common idiom in Python. Python has syntax to make this easier, syntactic sugar called decorators.

First, we show you an example that doesn’t use the decorator syntax. We wrap a function so that it prints whenever it is called. In practice, you might put logging or timing code inside the wrapper function.

Python decorators

The decorator syntax allows a nicer way of expressing the line wrapped = wrapper (somefunction). You’ll especially appreciate it if you want to decorate all the methods in a class. Decorators use the @ symbol, along with the name of the decorator function, above the function definition.

Python decorators

When you decorate a function, the function name is automatically rebound to the wrapped function rather than the original one. So is this actually useful? Let’s look at how you can use this to automate a repetitive task like checking method arguments for null values.

A null-argument-checking decorator

Even with static typing, method parameters can still be None (or null, using .NET speak) in .NET languages such as C#. If your code requires that the argument is valid and can’t be null, then it’s common to have code like the following:

void Present(Region present, Region selection)
{
  if (present == null)
  { throw new ArgumentNullException("present"); }
  if (selection == null)
  { throw new ArgumentNullException("selection"); }

  // Actual code that we care about
}

In Python, you can write a decorator that checks function arguments and raises a TypeError if any of them are None. Any methods that you want to check in this way can then be decorated.

To do this, you need a decorator that returns a wrapped function. The wrapper should check all the arguments before calling the function or method that’s wrapping. The wrapper function will need to call the wrappee with the same arguments it’s called with, and return the value that the wrapped function returns. This should do the trick:

def checkarguments(function):
   def decorated(*args):
      if None in args:
         raise TypeError("Invalid Argument")
      return function(*args)
   return decorated

Any arguments a wrapped function is called with are collected as a tuple (args). If any of these arguments are None (if None is in args), then a TypeError is raised. You use it like this:

class MyClass(object):

   @checkarguments
   def method(self, arg1, arg2):
      return arg1 + arg2
>>> instance = MyClass()
>>> instance.method(1, 2)
3
>>> instance.method(2, None)
Traceback (most recent call last):
TypeError: Invalid Argument

You should pass the parameter name when you raise the exception. You could do this by using introspection on the function object, but that’s another subject altogether.

OK, we’ve had some fun with functions in Python. Now it’s time to put what you’ve learned to practical use in working with XML.

Representing documents with XML

XML is a text-based format that uses tags to structure data. XML is certainly no silver bullet when it comes to persisting data; it’s generally verbose and, for complex data structures, inefficient. On the other hand, XML is a text-based format. If something goes wrong with your program, it’s easier to see what’s going wrong than with a binary format.

An XML document will look something like the following segment:

<?xml version="1.0" encoding="utf-8"?>
<rootelement>
   <childelement1 attribute="An attribute value.">An element value, with parsed
     character data.</childelement1>
   <childelement2><![CDATA [An element value, with unparsed character data.
     ]]></childelement2>
</rootelement>

The encoding declaration at the start is optional, but useful if the document is in an encoding other than ASCII or UTF-8. XML is a hierarchical format, so it must contain a root element, which contains the rest of the document. The root element can contain child elements, which themselves can contain child elements nested to whatever degree of horrific complexity you desire.

XML elements are simple, but wrap together several concepts. Figure 5.1 shows all the components of an element.

An XML element in all its glory

Figure 5.1. An XML element in all its glory

The text content contained between an element start tag and the end tag is restricted. Angle brackets in the text could be confused for the start of a new tag, unless they’re escaped. The XML spec stipulates five characters that should be escaped with entity references. Table 5.1 shows the characters that need to be escaped and their corresponding entity references.

Table 5.1. XML escaped characters and their corresponding entity references

Character

Entity reference

Name

<

&lt;

less than

>

&gt;

greater than

&

&amp;

ampersand

'

&apos;

apostrophe (quoting only mandatory in attribute values)

"

&quot;

quotation mark (quoting only mandatory in attribute values)

Good XML writing and parsing libraries will handle the escaping and un-escaping for you; but, if your text contains a lot of these characters, it can make the resulting XML much less human-readable. Plain text content, which needs escaping and un-escaping, is called parsed data. An alternative is to include the text content in unparsed character data blocks, like the one from the previous XML document:

<![CDATA[An element value, with unparsed character data. ]]>

But there’s a disadvantage to CDATA blocks. They must not contain ]]> inside the text, and there’s no way of escaping them if they do occur. If you need to include arbitrary text—sourced from a user, for example—then CDATA blocks may not be suitable.

Back to the subject at hand: representing documents with XML. document will make an excellent root level element. You also need an element to represent pages. Each page has a title and contents. It seems logical that the title should be an attribute of the page start tag, and that the page contents should be the text content between the start and end elements. This will leave you with documents that look like the following segment:

<?xml version="1.0" encoding="utf-8"?>
<document>
   <page title="The page title">This is the contents of the first page.</page>
    <page title="Page Two">This is another page, as beautiful as the first.</
     page>
</document>

Now that the structure for representing documents is decided, you need to write the code to turn the model objects into XML.

The .NET XmlWriter

The .NET XML support is enshrined in the System.Xml assembly. The System.Xml namespace has classes for reading and writing XML. There are further namespaces for working with XML schema, XSLT transformations, and XPath queries. Table 5.2 shows the major XML namespaces in .NET 2.0.

Table 5.2. The .NET XML namespaces

Namespace

Purpose

System.Xml

Provides support for standards-based XML support. Includes the XmlReader and XmlWriter classes.

System.Xml.Schema

Support for XML Schemas in the form of schema definition language (XSD) schemas.

System.Xml.Serialization

Classes to serialize objects in XML form.

System.Xml.XPath

Classes to work with the XQuery 1.0 and XPath 2.0 Data Model.

System.Xml.Xsl

Support for Extensible Stylesheet Language Transformation (XSLT) transforms.

The basic classes for reading and writing XML documents are the XmlReader and XmlWriter. Because you’ll be modifying the SaveCommand, we start with the XmlWriter.

XmlWriter is designed for creating conformant documents. The documents it creates will be valid XML, capable of being read by any standards-based reader. Along with the XmlWriter class, you use XmlWriterSettings. This is a class used for configuring an XmlWriter instance; you set attributes on the XmlWriterSettings instance to configure how the XML is written out.

Table 5.3 shows the different settings (properties) on XmlWriterSettings. The defaults are largely sensible, but we do like to change a couple. We like the XML tags to be indented with each level of nesting. This gives you a visual indication of the structure of the document (and we all know that indentation to indicate structure is a brilliant idea). The following segment creates an XmlWriterSettings instance, and sets the two properties required for indentation with four spaces. Because you haven’t yet used the System.Xml assembly, you first need to add a reference to it.

Table 5.3. The properties of XmlWriterSettings and the default values

Property

Initial value

CheckCharacters

True

CloseOutput

False

ConformanceLevel

Document

Encoding

Encoding.UTF8 (Encoding lives in the System.Text namespace, and is a useful class.)

Indent

False

IndentChars

Two spaces

NewLineChars

(carriage return, new line)

NewLineHandling

Replace

NewLineOnAttributes

False

OmitXmlDeclaration

False

>>> import clr
>>> clr.AddReference('System.Xml')
>>> from System.Xml import XmlWriter, XmlWriterSettings
>>> settings = XmlWriterSettings()
>>> settings.Indent = True
>>> settings.IndentChars = '  ' # four spaces

You don’t instantiate a new XmlWriter instance directly; instead, you call the static method Create, which returns a new instance of the correct type of writer for the settings passed in. There are various call signatures for Create; for example, you could pass in a filename and your settings object, and the writer would create the file for you. If you don’t want the writer to be responsible for creating the file, you can pass in an opened FileStream instead. In the segments that follow, you’ll pass in a StringBuilder instance. StringBuilder is in the System.Text namespace, and is a mutable string type—it allows strings to be built up incrementally.

An odd side effect of passing a StringBuilder to the XmlWriter is that it will refuse to write any encoding in the XML declaration other than UTF-16.[2] Because you’re likely to be happy with the default encoding of UTF-8, you set OmitXmlDeclaration to True.

>>> settings.OmitXmlDeclaration = True
>>> from System.Text import StringBuilder
>>> document = StringBuilder()
>>> writer = XmlWriter.Create(document, settings)
>>> writer.WriteStartDocument()
>>> writer.WriteStartElement("document")
>>> writer.WriteStartElement("page")
>>> writer.WriteAttributeString("title", "A page title")
>>> writer.WriteString("This is a page contents")
>>> writer.WriteEndElement()
>>> writer.WriteEndElement()
>>> writer.WriteEndDocument()
>>> writer.Flush()
>>> writer.Close()
>>> print document.ToString()
   <document>
    <page title="A page title">This is a page contents</page>
   </document>

This is great because (by happy coincidence) it’s exactly the kind of structure that we want for our documents. Having to remember to close all the elements in the right order is a nuisance, though. This is what happens if you get it wrong:

>>> document = StringBuilder()
>>> writer = XmlWriter.Create(document, settings)
>>> writer.WriteEndElement()
Traceback (most recent call last):
 File System.Xml, line unknown, in WriteEndElement
 File System.Xml, line unknown, in AdvanceState
 File System.Xml, line unknown, in ThrowInvalidStateTransition
SystemError: Token EndElement in state Start would result in an invalid XML document.
Make sure that the ConformanceLevel setting is set to ConformanceLevel.Fragment or
ConformanceLevel.Auto if you want to write an XML fragment.

Oops. A sensible way to avoid this is to make sure that your XML document structure and your program structure are as similar as possible. Ideally, a top-level method should create (and close) the root node, calling down to submethods to write out its child elements. Each method should only be responsible for creating and closing a single element, again calling submethods for their child elements. This way you know that, as long as the code runs to completion, every element will be closed and the result will be valid XML. A nice side effect is that this is also a good way to write modular and readable code—which is important because you’re writing Python. In the next section, we apply this strategy to MultiDoc.

A DocumentWriter Class

Writing out XML from a MultiDoc document is an important enough job that we can encapsulate it in its own class. This keeps the logic separate from the other machinations of the save commands, and easier to understand.

Note

By the end of chapter 4, we were still keeping the MultiDoc project as a single file, which was getting large and unwieldy. To make it easier to work with and provide a better application structure, you can break MultiDoc down into several modules containing the core classes. You can see this structure if you download the source code that accompanies this book. All the code from here on contains the appropriate imports to use our classes from the correct modules.

Listing 5.1 shows the DocumentWriter for MultiDoc; it uses XmlWriter and is instantiated with a filename. To write out the document, you must call write, passing in an instance of a MultiDoc document. The write method is responsible for creating the XmlWriter instance and opening and closing the root element of the XML. It calls down to writePage for each page in the document, creating the page element with title attribute.

Example 5.1. A DocumentWriter class that writes out MultiDoc documents as XML

A DocumentWriter class that writes out MultiDoc documents as XML

This class needs to be saved as the file documentwriter.py. To plug this into MultiDoc, you need to modify the commands to use it. First, DocumentWriter needs to be imported inside the savecommands module.

from documentwriter import DocumentWriter

The SaveAsCommand inherits from SaveCommand, and the file writing is done in the saveFile method. You can get most of the way toward the changes you want by modifying saveFile as follows:

def saveFile(self, fileName, document):
      try:
         writer = DocumentWriter(fileName)
         writer.write(document)
         return True
      except IOError, e:
         ...

Previously the execute and promptAndSave methods of the commands only needed to pass some text to saveFile; now they need to pass the document instead. The document still needs to be updated before saving, so getText becomes getUpdatedDocument.

def getUpdatedDocument(self):
   self.tabController.updateDocument()
   return self.document

The execute method of both SaveCommand and SaveAsCommand must be modified to call getUpdatedDocument, and to pass the document through to saveFile and save-AndPrompt. These changes are simple; and, rather than using space here we leave you to figure them out. If you want to see the changes, they’re in the 5.3 folder of the sources that go with this book.

The DocumentWriter you’ve created follows the structure we suggested earlier, with a top-level method that writes out the top-level node (write). This calls down to writePage to write out the child page nodes. To do this, the XmlWriter has to be stored as state on the DocumentWriter, as the self.writer instance variable. Because you only have one root node, the structure is simple. You could avoid having to store state by refactoring to use an inner function.

An alternative with an inner function

The writePage method is simple. You can refactor this into an inner function that takes a page as its argument. It needs access to the writer, which can be a local variable in its enclosing scope, which is the body of the write method.

The refactored write method looks like listing 5.2.

Example 5.2. Implementation of DocumentWriter.write using an inner function

def write(self, document):
   settings = XmlWriterSettings()
   settings.Indent = True
   settings.IndentChars = '   '
   settings.OmitXmlDeclaration = True
   writer = XmlWriter.Create(self.fileName, settings)

   writer.WriteStartDocument()
   writer.WriteStartElement("document")

   def WritePage(page):
      writer.WriteStartElement("page")
      writer.WriteAttributeString("title", page.title)
      writer.WriteString(page.text)
      writer.WriteEndElement()
      for page in document:
         WritePage(page)

      writer.WriteEndElement()
      writer.WriteEndDocument()
      writer.Flush()
      writer.Close()

This version of write is still an acceptable length, and the number of occurrences of self have been dramatically reduced, making the code more readable!

Tip

There’s a performance implication when defining inner functions. The def statement is executed every time the method containing it is executed. It isn’t a high cost, but may become significant if it’s in a performance-sensitive part of your code. Inner functions close over the variables in their containing scope. They’re most useful when you need a new closure for each execution.

Now that we’ve created a way of writing MultiDoc documents in XML formats, we ought to provide a way of reading them back in again.

Reading XML

The counterpart to XmlWriter is XmlReader. Although logically it’s the inverse of XmlWriter, XmlReader is slightly more complex; it has almost twice as many public properties and methods. A lot of these are to allow you to read typed data from an XML file.

XMLReader

XmlReader is usually instantiated in the same way as XmlWriter—through the static Create method. There’s a plethora of different overloads for creating new instances; you can supply a stream, TextReader, or a resource locator as a string. You can also optionally pass in XmlReaderSettings and an XmlParserContext, and just about any combination of these items.

Note

There are two common approaches to parsing XML. The first, perhaps more intuitive, is to read the whole document and access the Document Object Model in memory. This is known as DOM parsing, and it turns out to be very cumbersome in practice. This chapter uses event-driven parsing, which fires events as elements of the document are read in.

The resource locator doesn’t need to be a filename; it can also be a URI[3] so that the XML document can be fetched from the internet. The default XmlResolver used by XmlReader is an XmlUrlResolver, which supports URIs that use the http:// and file:// protocols. You can supply authentication credentials, or use a different resolver by setting it on the instance of XmlReaderSettings that you use to Create your XmlReader.

The most convenient way, especially for this use case, is to supply a filename and an XmlReaderSettings instance. Table 5.4 shows the configurable properties on XmlReaderSettings, along with their default values. If you were reading only a fragment of XML, you might want to set ConformanceLevel to ConformanceLevel.Fragment. If the XML could have processing instructions that you don’t want to handle, then you can set IgnoreProcessingInstructions to True.

Table 5.4. The properties of XmlReaderSettings and the default values

Property

Initial value

CheckCharacters

True

ConformanceLevel

ConformanceLevel.Document

IgnoreComments

False

IgnoreProcessingInstructions

False

IgnoreWhitespace

False

LineNumberOffset

0

LinePositionOffset

0

NameTable

None

ProhibitDtd

True

Schemas

An empty XmlSchemaSet object

ValidationFlags

ProcessIdentityConstraints enabled

ValidationType

ValidationType.None

XmlResolver

A new XmlUrlResolver object

Curiously, the default is for the XmlReader to not ignore insignificant whitespace (for example, whitespace that indents elements rather than being part of an attribute value or element content). Because you probably won’t want to handle insignificant whitespace, the following code segment shows the pattern you’ll be using for creating and configuring an XmlReader:

from System.Xml import XmlReader, XmlReaderSettings
settings = XmlReaderSettings()
settings.IgnoreWhitespace = True
reader = XmlReader.Create(filename, settings)

Once you’ve opened the XML file with XmlReader, the most straightforward way to use your instance is to repeatedly call the Read method. This consumes the document, one node at a time, exposing information about each node.

On completion, you should call reader.Close(), which frees the file. Forgetting to do this will cause the file to be held open until the reader is garbage-collected.

If you take the straightforward approach to reading MultiDoc files, the code would be simple, but also tedious and trivial. To make this section more interesting (and possibly even useful), we look at a more general approach to reading XML documents.

An IronPython XmlDocumentReader

A useful pattern for implementing a Read loop is to establish handlers for the types of node that you expect. Read can be called in a simple loop, delegating to the handlers for each node you encounter.

Our documents have elements, with or without attributes and with or without contents. This doesn’t use the whole XML spec, not by a long stretch of the imagination. XML has many additional aspects to it, such as CDATA, namespaces, and processing instructions. You could write more handlers for these components we haven’t yet dealt with, but a significant proportion of XML documents are made up of nothing more than elements, attributes, and text.

The following code is a general XmlDocumentReader class. It’s adapted from a similar class used in Resolver One; many thanks to the kind Resolver Systems folk for letting us use and abuse the code. We’ve mangled it quite a bit from the original, so any bugs are entirely of our own devising. As well as being a general (and easily extensible) XML reading class, it’s another example of using Python functions as first-class objects. The node handlers are packaged as a dictionary of node types mapping to functions (or methods) that know how to process them.

Listing 5.3 shows the imports and the constructor for XmlDocumentReader. It’s instantiated with the node handling functions. These will be explained when we get to using them, and we’ll put together a concrete example for reading MultiDoc documents.

Example 5.3. Importing code and constructor for XmlDocumentReader

Importing code and constructor for XmlDocumentReader

Listing 5.4 is the read method of XmlDocumentReader. The element handlers passed into the constructor will be called when the reader encounters different elements in the document. This happens within the onStartElement method, which is called whenever an element start tag is encountered. Because you’re just handling elements, attributes, and contents, you need only three general node handlers: element start tags, element end tags, and the text contents of elements. Element attributes will be dealt with inside the element start tag node handler (listing 5.4).

Example 5.4. XmlDocumentReader read method

XmlDocumentReader read method

nodeTypeHandlers is a dictionary mapping the different XmlNodeTypes to handler functions. The call to reader.Read() advances the reader to the next node. It returns True if the reader finds a node, or False when the end of the document is read—which ends the reading loop.

nodeTypeHandlers.get(reader.NodeType) looks up the current node in the dictionary of handlers. If the node type isn’t recognized, then get returns None and an XmlException[4] is raised. If the node type is recognized, then the handler is called and the reader passed in. Figure 5.2 shows how the read loop maps XML components it encounters into the handler method calls.

The mapping of an XML document to node handler calls

Figure 5.2. The mapping of an XML document to node handler calls

The finally block around the read loop ensures that the reader is closed, whatever error might occur.

The first node handler is onStartElement. This will be called when the reader encounters a start element tag such as <document> or <page title="Page title">. onStartElement is shown in listing 5.5.

Example 5.5. Node handler for start element tags

Node handler for start element tags

onStartElement needs to extract the attributes from the element. These are collected in the attributes dictionary in another loop that calls reader.MoveToNextAttribute(). Next you check in the _elementHandlers to see if you have a handler for this type of element. If you do, the handler is called with the line number and attributes as arguments. Otherwise, an XmlException is raised.

_elementHandlers is a dictionary, and get returns a default value if the key (the element name) is missing. You use MISSING_HANDLERS as the default value (as defined in listing 5.1). Every element needs a handler for the start tag, end tag, and contents.

Element handlers are passed into the constructor as a dictionary that maps element names to the three handler functions. Elements can be self-closing (like <element />), which means that they won’t have any contents or a separate end tag. The only required handler is the start tag.[5] The three handlers for each element type should be provided in a tuple (which will become clearer later when you use XmlDocumentReader to read MultiDoc documents). For handling the page element, you have these handlers:

elementHandlers = {
   'page': (handleStartPage, handlePageText, handleEndPage)
}

If you have elements that don’t have text contents or don’t need their end tag handling, then you can replace these handlers with None.

elementHandlers = {
   'someElement': (handleStartPage, None, None)
}

MISSING_HANDLERS is a tuple of three Nones. self._elementHandlers.get(name, MISSING_HANDLERS) will always return a tuple of three values. The start tag handler is the first value; if the element isn’t contained in the elementHandlers dictionary, then this value will be None and an exception will be raised.

The middle handler in the tuple, index position one, is the handler for the text contents of elements. Listing 5.6 shows the code for the onText method. It’s similar to onStartElement, but simpler, because it doesn’t need to collect attributes. Because the text handler is optional, it doesn’t throw an exception if the text handler is None. The text handler gets called with the same arguments as the start tag handler.

Example 5.6. Node handler for element text values

Node handler for element text values

Tuples of element handlers have three members. The third member is the handler for end element tags. Listing 5.7 shows onEndElement.

Example 5.7. Node handler for end element tags

def onEndElement(self, reader):
   endHandler = self._elementHandlers.get(reader.Name,
                  MISSING_HANDLERS)[2]
   if endHandler:
    endHandler(reader.LineNumber)

The end tag handler is called with the line number.

In the first part of the chapter, we looked at first-class functions in Python. In this part, we’ve put together a general-purpose XML reader class that uses handler functions stored in a dictionary to process different nodes. As long as the handler functions all have the same signature (by taking the same arguments), you’re free to implement your handler functions how you want.

To illustrate our general purpose reader at work, let’s implement the handler functions needed for the MultiDoc XML save format.

Handler functions for MultiDoc XML

MultiDoc documents are represented programmatically by the classes Document and Page. To read them in, you can use XmlDocumentReader to read the XML and re-inflate the model classes.

The stages of reading a saved document are shown in figure 5.3.

Turning a MultiDoc XML document into model class instances

Figure 5.3. Turning a MultiDoc XML document into model class instances

Steps 2 through 4 will obviously be repeated for every page in the document.

To use XmlDocumentReader, you need to provide handler functions for these steps; the handlers map to the steps as follows:

  1. Document start element handler

  2. Page start element hander

  3. Page text handler

  4. Page end element handler

  5. Document end element handler

Before reading the file, you need to set up some state that will be used in the reading process. You need to store the document when it’s first created, a list to read pages into, and the filename to read. You also need to keep track of the current page so that the text handler can attach the text to it.

The document reader needs access to the model classes and the XmlDocumentReader. It will also do some verifying of the document structure, so you should import XmlException. DocumentReader sounds like a reasonable name for a document reader class. Listing 5.8 shows the imports and constructor for DocumentReader.

Example 5.8. Initializing the DocumentReader

from model import Document, Page
from xmldocumentreader import XmlDocumentReader, XmlException

class DocumentReader(object):
   def __init__(self, fileName):
      self.fileName = fileName
      self.document = None
      self.currentPage = None
      self.pages = []

Because we spent some time creating and discussing XmlDocumentReader, it ought to be easy to use. Listing 5.9 demonstrates just how easy.

Example 5.9. Setting up the handlers and calling XmlDocumentReader

Setting up the handlers and calling XmlDocumentReader

We’ve identified five handlers you need to pass in. Each element needs a tuple of three handlers: start tag handler, text handler, and end tag handler. MultiDoc documents have a humble two tags: document and page. The document element has child page tags, but no content; it doesn’t need a text handler, and the middle element can be None. The read method of DocumentReader sets up these handlers, using yet-to-be-defined methods. When XmlDocumentReader works its magic, the handlers are called—which builds the document.

The important pieces of the jigsaw puzzle are the handlers themselves. Listing 5.10 contains the document start and end tag handlers.

Example 5.10. Handlers for the document element

def onStartDocument(self, lineNumber, attributes):
   self.document = Document(self.fileName)

def onEndDocument(self, lineNumber):
   self.document.pages = self.pages

Both handles are simple. When you start reading the document (or, encounter the document start tag), you create a new document with the right filename. By the time you encounter the document end tag, you should have read all the pages, and onEndDocument should attach the pages to the document. XmlDocumentReader can then complete and read return the document.

Reading pages requires three handlers; the code for these is in listing 5.11.

Example 5.11. Handlers for the page element

def onStartPage(self, lineNumber, attributes):
   title = attributes.get('title')
   if title is None:
      raise XmlException('Invalid data at line %d' %
                    lineNumber)
   self.currentPage = Page(title)

def onPageText(self, lineNumber, value):
   self.currentPage.text = value.replace('
', '
')

def onEndPage(self, lineNumber):
   self.pages.append(self.currentPage)

Pages need a title, which is extracted from the attributes in onStartPage. If the title is missing, then the document is invalid, and an exception is raised. If a title is present, then a new page is created and set as the current page.

When the page contents are read in, the page is set as the current page in onPageText. XmlDocumentReader reads in files using an XmlReader returned by XmlReader.Create. XmlReader.Create returns an instance of XmlTextReader, which unsurprisingly opens files for reading in text mode. Python tries to do you a favor by converting line-endings into when you read files in text mode, which normally enables you to ignore cross-platform differences when dealing with line endings. Unfortunately, this favor backfires when you need to set the text on a Windows Forms control, which doesn’t recognize as a line ending. To avoid this problem, onPageText converts into in the value passed to it.

When you reach the end page element, onEndPage adds the current page to the page list.

It may seem like there isn’t a lot of checking to ensure that the document is valid, beyond a cursory check that the title is present. In fact, there’s quite a lot of checking being done for you. Any unrecognized nodes or elements would cause an XmlException to be raised from inside XmlDocumentReader. If the XML document itself is invalid (due to missing or misplaced tags, for example), then the .NET XmlReader will raise an error. It will be important for you to catch these potential errors and alert the user when you read in documents.

DocumentReader is a concrete example of using XmlDocumentReader. DocumentReader is a simple class, only 44 lines of Python code, but extending it to read more complex documents and construct more intricate objects from them should be easy.

Now that you’ve created these classes to form MultiDoc document objects from saved files, you need to plug them into the user interface. To do this, you need an open command.

The Open command

The OpenCommand is going to provide the mirror functionality to the SaveCommand and will look similar.[6] Instead of the SaveFileDialog, it will use its close (but equally familiar) cousin, the OpenFileDialog (figure 5.4), so that the user can choose a file to open. The open command is very similar to the save commands. The similarity even extends to the imports—except that, obviously, OpenFileDialog is imported instead of SaveFileDialog. OpenCommand also needs access to the DocumentReader class and Xml-Exception. If any errors are raised while reading a file, an XmlException will be raised. To catch these errors, you need to import XmlException.

The OpenFileDialog in action

Figure 5.4. The OpenFileDialog in action

Listing 5.12 is the imports and initialize for OpenCommand. Like SaveCommand, you have a title and a filter for use on the dialog control.

Example 5.12. Initializing the OpenCommand

from System.IO import Directory, Path
from System.Windows.Forms import (
   DialogResult, MessageBox,
   MessageBoxButtons, MessageBoxIcon,
   OpenFileDialog
)

from documentreader import DocumentReader, XmlException
from savecommands import filter

class OpenCommand(object):

   title = "Open Document"

   def __init__(self, mainForm):
      self.openFileDialog = OpenFileDialog()
      self.mainForm = mainForm

      self.openFileDialog.Filter = filter
      self.openFileDialog.Title = self.title

There are two things of note in this otherwise unremarkable code. The first is the reuse of the filter from the save commands. When you switched to an XML file format, you could have switched the save file extension too; personally, I (Michael) like .txt because it makes the files easier to open with a text editor. The second thing is that the OpenCommand constructor needs the MainForm passed in. When a new document is created, it needs to be set back on MainForm.

These commands need to implement an execute method. For OpenCommand, this is the method that asks the user to choose a file and attempts to open it as a MultiDoc file. You do the same manipulation with the filename (and directory) as you did for the save commands. If the current document already has a filename, then you set the initial directory and filename on the dialog box (listing 5.13).

Example 5.13. execute method of OpenCommand

execute method of OpenCommand

If the call to openFileDialog.ShowDialog() returns DialogResult.OK, then the user has selected a file, and self.getDocument is called with the filename. Reading the document could result in an exception being thrown if the document is invalid. If an exception is raised, it’s trapped inside getDocument, and a MessageBox is displayed to the user before None is returned. If a real document is returned, then it’s set on the MainForm.

Listing 5.14 shows the implementation of getDocument.

Example 5.14. The getDocument method of OpenCommand

def getDocument(self, fileName):
   try:
      reader = DocumentReader(fileName)
      return reader.read()
   except (IOError, XmlException), e:
      name = Path.GetFileName(fileName)
      MessageBox.Show(
         'Could not read file "%s"
The error was:
%s' %
         (name, str(e)),
         "Error Saving File",
         MessageBoxButtons.OK,
         MessageBoxIcon.Error
      )
      return None

getDocument is almost identical to the saveFile method of SaveCommand, except for the following three differences:

  • It uses DocumentReader instead of DocumentWriter.

  • It returns a new document object or None, instead of True or False.

  • It traps for a tuple of exceptions (IOError, XmlException).

Two possible errors can occur. The first is that you fail to read the file from disk, perhaps due to a hard drive failure or the user whipping out a USB stick in between selecting the file with the dialog box and you actually managing to read it. This would cause an IOError. Alternatively, the document could be badly structured, resulting in an XmlException. If either situation happens, then you trap the error and alert the user with a message box.

At this point, we’d love to say that our work is done. Unfortunately, there’s a problem with this implementation of OpenCommand.

When the new document is returned from getDocument, it’s set on the MainForm. It is not just the MainForm that holds a reference to the current document, but also the TabController and the save commands. The TabController also needs to take action when a new document is loaded—it needs to update the tabs to reflect the new document.

To solve this problem, chapter 6 will set up a document observer system. But before we get to that, let’s review what we’ve done in this chapter.

Summary

We started the chapter by looking at first-class functions in Python. Being able to treat functions (or other callable objects) as ordinary objects enables some useful patterns. There’s a lot more to functional programming than we’ve covered here. Functional programming languages have been used mainly in academia, but they’ve been enjoying something of a renaissance recently. Languages like Haskell and Erlang are starting to gain popularity, but they’re still seen as harder to learn than imperative languages. Python supports functional programming, without limiting you to one particular style.

The rest of the chapter was concerned with XML reading and writing for MultiDoc. The approach we showed for reading should handle enough of the XML components for reading complex documents. It should also be easily extensible (simply add more node type handlers) for supporting parts of the XML spec that we haven’t covered. The XmlDocumentReader class demonstrates the power of IronPython. The .NET framework provides powerful and effective ways of dealing with XML that can be used from IronPython with much less code (which means more readable code) than from other .NET languages.

We did a small refactoring of the SaveCommand to use the XML writing capabilities, but we haven’t yet extended the MultiDoc user interface to incorporate the new OpenCommand.

The next chapter creates the additional features that MultiDoc is still missing, and integrates them into the user interface. This is the exciting part; you get to transform MultiDoc from the bare shell it is now (with exposed wires and plumbing—to return to our building analogy) into a functioning application. But you’ll need to get a little assistance from Visual Studio.



[1] The strategy pattern is the last pattern described in this 1995 book.

[2] Because the writer is writing into a string, which is still Unicode and has no encoding yet, it’s no wonder that it gets confused. The logic is possibly that the Windows internal UCS2 Unicode representation is most like UTF-16. Still, ignoring the explicit encoding on the XmlWriterSettings is a dubious practice.

[3] Uniform Resource Indicator—a term often used interchangeably with URL (Uniform Resource Locator), but supposedly more general.

[4] From the System.Xml namespace.

[5] In fact, for self-closing elements, the XmlReader doesn’t call the end tag handler.

[6] But, annoyingly, it’s different enough that little code that can be shared.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.107.210