Case Study: Writing XML/HTML

NOTE

This case study is a more advanced demonstration of how a collection of cooperating classes (or class library) can simplify the generation of HTML. It isn't necessary to know HTML to follow the code, but it will help.


eXtensible Markup Language (XML) has become an important buzzword, and programs are increasingly being expected to be XML compliant. Markup languages should not be confused with programming languages. A markup language is, in fact, all form and very little content. It is a standard way of representing structured data with tags, but the meaning of those tags is not specified. It is more like an alphabet than a language. People get together to decide on a standard for a particular kind of knowledge, such as banking transactions, books, and class hierarchies, and then they give the tags specific meaning. For example, if you needed a more precise way to describe vegetables, you could create Vegetable Markup Language (VEGML), which might look like this:

<vegml>
  <root name="potato" size="15" color="brown">
    Boiled Fried Baked
  </root>
  <fruit name="tomato" size="8" color="red">
    Raw Boiled Fried
  </fruit>
  <nut name="acorn"/>
 </vegml>

The items in <> are usually called tags or elements, and they can have attributes, which are name/value pairs. Between the open tag and the close tag there can be character data. If there is no character data, then you can abbreviate the final tag with a slash (/) (for example, <nut name="acorn"/>).

A very popular XML-style markup language is Hypertext Markup Language (HTML). Traditional HTML is not quite XML compatible, but it can be made so (for instance, by closing every paragraph tag <p> with </p>). There are two main differences between HTML and XML: XML tags are case-sensitive and HTML tags are not, and whitespace matters in XML but not in HTML.


<html>
<head>
  <title>This is the name of the page</title>
</head>
<body bgcolor="#FFFFFF">
<p>Normal text <a href="anewpage.htm">continues </a>as so.....</p>
<ul>
    <li>here is item 1</li>
    <li>another item</li>
</ul>
<p>Here is <bold>bold</bold> text.
 Here is <italic>italic </italic>text.</p>
</body>
</html>

This chapter can't discuss the tags in detail (the excellent NCSA HTML primer at archive.ncsa.uiuc.edu/General/Internet/WWW/HTMLPrimer.html is the place to start). In this case study, you will generate some HTML, and a good place to begin is to identify the objects involved: tags, attributes, and documents.

When designing a class library, it is useful to write out how you would imagine it being used. This is easier than first defining the interface for each class. Here is the kind of code I wanted to write: a HTMLDoc object is created, a heading “A heading” is written, followed by a new paragraph, and “and some text” in bold.

HTMLDoc d("first.xml");
d << head(1) << "A heading" << para << bold << "and some text";
d.close();

The idea, of course, is to make our class library work like the iostream library. Most of the iostream manipulators (such as width(); see Appendix B, “A Short Library Reference”) that you can put in streams affect only the next field. An alternative is something like bold(on) << "bold text!" << bold(off). But it would be tedious to have to close each tag manually, and that tedium would make it easier to generate badly formed documents. So somehow you have to keep tabs on the tags and close them automatically.

A useful way to think about this problem is to use stacks: You can keep a stack of tags. When a tag is pushed onto the stack, it is opened, and when it is popped, it is closed. For instance, opening the bold tag will put out “<bold>”, and closing it will put out “</bold>”. So the idea is to pop the stack after putting out text, which closes any pending tags. For instance, after “and some text”, the stack contains a bold tag, and popping it will put out “</bold>”.

This unfortunately works too well: After "A heading", it closes the heading tag, and it also closes the body and the HTML tags. The solution is to label formatting tags such as headings and bold as temporary tags, and make tags like <body> persistent. That is, only temporary tags are popped after an output item. You would finally pop all of the stack, temporary or not, when the document is closed.

Tag and Attrib classes both have name properties, so you can factor out that part of them as a NamedObject. In addition, an Attrib object has a value. A Tag object can contain a number of Attrib objects, but it usually contains none. Any general XML-generating code can be used to generate HTML, and it makes logical sense to consider HTMLDoc to be a specialized derived class of XMLDoc. XMLDoc must have a reference to some open filestream, and it must contain a stack of Tag objects.

The following code is the interfaces for the Attrib and the Tag class (chap9attrib.h and chap9 ag.h):

class Attrib: public NamedObject {
    string m_value;
public:
  Attrib (string nm="", string vl="")
  : NamedObject(nm), m_value(vl)
  { }
  string value() const {  return m_value; }
};

typedef std::list<Attrib *> AttribList;
typedef AttribList::iterator ALI;

// tag.h
#include "attrib.h"
using std::string;

class Tag: public NamedObject {
  bool m_open,m_temp;
  AttribList *m_attribs;
  mutable AttribList::iterator m_iter;
public:
  Tag(string name="", bool temp=true);
  Tag& set(bool on_off);
  void clear();
  bool closed() const       {  return ! m_open; }
  bool is_temporary() const {  return m_temp; }
  void add_attrib(string a, string v);
  void add_attrib(string a, int v);
  bool    has_attributes() const;
  Attrib *next_attribute() const;
};

typedef std::list<Tag> TagList;

Both Attrib and Tag are derived from NamedObject, which you saw in “A Hierarchy of Animals” in Chapter 8. Attrib is basically just a name plus a value. As you can see, Tag is not a complicated class; it has a name, can be closed or open, and may be temporary. Its main responsibility is looking after a list of attributes. It would have been easy to export attribute list iterators to anybody using this class, but how a Tag object organizes its business is its own affair. In particular, other code should not have to make assumptions about the attribute list. So Tag has a pair of functions for accessing attributes; the first method, has_attributes(), must be tested before you call next_attribute(), which returns non-null attribute pointers until there are no attributes left.

In this implementation, AttribList is a list of Attrib *, and you keep a pointer to AttribList, which is usually null. This is the most space-efficient way to implement tags, which often don't have attributes. But these are specific implementation decisions, which could change, and it is the job of Tag to keep them to itself. Here is the implementation of Tag, based on these assumptions:


Tag::Tag(string name, bool temp)
   : NamedObject(name),m_open(true),
     m_temp(temp),m_attribs(NULL)
  {  }

Tag::Tag& set(bool on_off)
 {  m_open = on_off; return *this; }

void Tag::clear()
{
 delete m_attribs;
 m_attribs = NULL;
}

void Tag::add_attrib(string a, string v)
{
    if (m_attribs == NULL) m_attribs = new AttribList;
    m_attribs->push_back(new Attrib(a,v));
}

void Tag::add_attrib(string a, int v)
{
     add_attrib(a,int2str(v));
}

bool Tag::has_attributes() const
{
    if (m_attribs == NULL) return false;
    m_iter = m_attribs->begin();
    return true;
}

Attrib *Tag::next_attribute() const
{
    if (m_iter == m_attribs->end()) return NULL;
    Attrib *a = *m_iter;
    ++m_iter;
    return a;
}

Note that next_attribute() is a const method of Tag, which it must be if it is to act on const Tag&. But ++m_iter modifies the object in order to get the next attribute; how can this be allowed? If you look at the declaration of m_iter, you will see the new qualifier mutable. What we are saying with mutable is that modifying the member variable m_iter does not really modify the object. It is an iterator to the attribute list, but the attribute list is not itself changed.

You now need to think about the class that does the serious work of generating XML. A list of tags works well as a stack; in fact, most of what XMLDoc does is manage this stack, as you can see here:


class XMLDoc {
  TagList m_tstack;
  ofstream m_out;
public:
// Tag stack management
  void push_tag(const Tag& tag)  {  m_tstack.push_back(tag);  }
  void pop_tag()                 {  m_tstack.pop_back();      }
  bool empty()   const           {  return m_tstack.size()==0; }
  const Tag& current() const     {  return m_tstack.back();    }

  void push(const Tag& tag)
  {
     Attrib *a;
     m_out << '<' << tag.name();
     if (tag.closed()) m_out << '/';
     else if (tag.has_attributes()) {
        m_out << ' ';
        while ((a = tag.next_attribute()) != NULL)
          m_out << a->name() << '='
                << quotes(a->value()) << ' ';
     }
     m_out << '>';
     push_tag(tag);
   }

  bool pop()
  {
    if (empty()) return false;
    Tag& tag = current();
    m_out << "</" << tag.name() <<"> ";
    pop_tag();
    return ! empty();
  }
 // streaming out and document management
  virtual void outs(char *str)
  {
    m_out << str;
    while (!empty() && current().is_temporary()) pop();
  }

  void outs(const string& s)
  {   outs(s.c_str()); }
 void open(const string& name)
 {  m_out.open(name.c_str()); }

 void close()
 {
 // close out ALL pending tags
  cout << "closing....
";
  while (pop())  ;
  m_out.close();
 }

 ~XMLDoc()
 {  close(); }

}; // class XMLDoc

Notice that all references to m_tstack are in the first four methods, which effectively define XMLDoc as a stack-like class. The methods push() and pop() are where the XML specifics are. Because Tag is looking after the attributes, this code can run through them simply, without making any assumptions about how Tag stores the attributes, using Tag's has_attributes() and next_attribute() methods.

The outs() method is the gateway for all character data; after writing the text to the file stream, it closes any pending tags by popping any temporary tags off the stack. Finally, the document must be opened and closed; when it's closed, the tag stack must be completely emptied with the short and sweet statement while(pop());.

Note that XMLDoc is not derived from anything. You might be tempted by the thought “XMLDoc is a stack of Tag objects,” and try to inherit from some standard class. This is a bad idea because an XMLDoc object has a stack of Tag objects; it is not a stack of Tag objects. In particular, inheriting from list<Tag> would make XMLDoc export all kinds of things that have nothing to do with XMLDoc's job. Likewise, the idea of inheriting from ofstream is unwise; you want to force all text data through the narrow gate of the outs() method. In this case study, composition (that is, building a class out of other classes) makes more sense than inheritance. Inheritance should not be used just because it makes a program seem more object oriented.

To get the intended mode of use, you need to create a few more operator overloads. They may only be syntactic sugar, but these overloads makes creating documents a lot easier. Like ostream, operator<< is overloaded to output characters strings and int values; these all go through the outs() method. It is also overloaded for Tag arguments; the tag is pushed on the document's tag stack:

XMLDoc& operator<<(XMLDoc& doc, char *s)
{
  doc.outs(s);
  return doc;
}

XMLDoc& operator<<(XMLDoc& doc, const string& s)
{
  doc.outs(s);
  return doc;
}

XMLDoc& operator<<(XMLDoc& doc, int val)
{
  doc.outs(int2str(val));
  return doc;
}

XMLDoc& operator<<(XMLDoc& doc, const Tag& tag)
{
   doc.push(tag);
   return doc;
}

Up to now you have not seen anything specifically about HTML. But because HMTL is an XML-like language, it is not difficult to specialize XMLDoc by creating a derived class called HTMLDoc. You should note two things here. First, XMLDoc is now a useful part of your toolkit that is available to all other projects you're working on. (There is quite a bit of pressure to make all programs talk to each other in some XML-compatible language.)

Second, XMLDoc supplies a concise language that makes the actual job almost straightforward (I say almost because nothing in software is trivial). The open() method uses the tag-stack interface to generate the tedious bit at the front of all HTML documents. (All this code is found in chap9html.cpp.)

void HTMLDoc::open(string name, string doc_title, string clr)
{
  if (name.find(".") == string::npos) name += ".htm";
  XMLDoc::open(name);
  if (doc_title=="") doc_title = name;

  push(Tag("HTML",false));
   push(Tag("HEAD"));
   push(Tag("TITLE"));
   XMLDoc::outs(doc_title);
   XMLDoc::outs("
");

   Tag body_tag("BODY",false);
   body_tag.add_attrib("bgcolor",clr);
   push(body_tag);
   XMLDoc::outs("
");
}

Note how it is necessary to use the fully qualified name to call the inherited XMLDoc::open() method. Here is an example of what HTMLDoc::open() generates at the front of the HTML document.

<html>
<head>
  <title>This is the name of the page</title>
</head>
<body bgcolor="#FFFFFF">

You also have to explicitly use XMLDoc::outs() because HTMLDoc is going to overload it:

void HTMLDoc::outs(char *str)  // override
{
// calls original method to do output
   if (strchr(str,'
') != NULL) {
     char buff[256];
     strcpy(buff,str);
    for(char *t = strtok(buff,"
");
     t != NULL; t = strtok(NULL,"
"))
     {
      push(para);
      XMLDoc::outs(t);
      XMLDoc::outs("
");
    }
  } else XMLDoc::outs(str);
}

Here is some old-fashioned C-style code for a change. HTML text runs together unless you put out paragraph tags. If the character string contains , then it must be broken up and separated by using paragraph tags. The strchr() function returns a pointer to the first match of the specified character; otherwise, it is NULL.

strtok() is fairly eccentric: You give it a set of characters (in this case, " "), which is used to break up the string into tokens. strtok() is passed the character string for the first call; thereafter it is passed NULL. strtok() modifies the buffer we give it; hence the strcpy(). (See Appendix B, “A Short Library Reference,” for more information about strtok() and other C string functions.)

You still have to define some HTML-specific tags. Here are a few of the most important ones. Simple formating tags work directly (for instance, <bold>Some text</bold>), but some tags take parameters. Instead of defining a number of heading tags (<H1>, <H2>, and so forth) I've defined a function head() that modifies the name of the global head_tag variable and returns a reference to it. The function link() is passed an Internet address, and returns the global link_tag. The address is the value of the HREF attribute, which is added to link_tag.

Tag bold("BOLD");
Tag italic("ITALIC");
Tag link_tag("A");
Tag para("P");
Tag head_tag("");

Tag& head(int level = 0)
{
   head_tag.name("H" + int2str(level));
   return head_tag.set(level > 0);
}

Tag& link(string fname)
{
   link_tag.clear();
   link_tag.add_attrib("HREF",fname);
   return link_tag;
}

And finally, here is some exercise code for HTMLDoc:

void exercise()
{
  int n = 58;
  HTMLDoc doc("first","An introduction");
  doc << head(1) << "Here is a Title";
  doc << "There have been " << n << " visitors
";
  doc << bold << "bold text
";
  doc.close();
}

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.183.210