Anatomy of a Docbase Record

Docbase records are semistructured. Each has parts that correspond roughly to the header and the body of an email message. The header fields of a record contain values that are typically compact—for example, names or dates. These values often belong to controlled vocabularies—for example, names of companies, products, or authors. Header fields provide the hooks we’ll use in Chapter 7 to build navigational indexes for docbases and in Chapter 8, to organize search results.

The body fields of a docbase contain free-form text. They often exhibit patterns—for example, URLs—that provide hooks for the kinds of instrumentation we saw in Chapter 5. The body fields are subject to full-text search, as are header fields. But unlike header fields, they don’t provide hooks for building navigational indexes.

In this chapter, we’ll focus on a Docbase instance called ProductAnalysis. Its records are reports, written by industry analysts, that assess high-tech products. The creation of a record is a shared responsibility. A manager assigns a report to an analyst, specifying some of the header fields. These manager-specified header fields are as follows.

  • Analyst’s name

  • Date of assignment

  • Due date for report

  • Name of company

  • Name of product (optional, may be supplied by analyst)

There are four analyst-supplied body fields: the report title, a summary of the report (a sentence or paragraph), the full report (many paragraphs), and a chunk of contact information (names, phone numbers, email and web addresses).

Data-collection Strategies

The header fields of a docbase record are conventionally database-like. The body fields are typically just blobs of text. Here are some guidelines to help you build and manage collections of semistructured records:

Not too many fields

When you can make the form short, you should. Users find large flocks of input fields intimidating. From the implementor’s perspective, each field involves one more bit of template, validation, and indexing overhead. In Figure 6.2 we’ll see how you can override input fields with preassigned values. That’s an excellent way to streamline and simplify a form.

Multivalued fields

Collecting contact information can be a chore. Email address, phone number (home? work? both?), fax number, postal address (U.S.? International?)—it can be a real headache to shoehorn all this data into fields. But do you really need to? For years the Virtual Press Room successfully used the method we’ll also see at work here: a single contact field, accommodating free-form text, coupled with backend validation that looks for patterns that signify required elements. It’s not hard to pick out web addresses, email addresses, and phone numbers.

What if you need to find a particular value that isn’t explicitly fielded? It’s true that you can’t issue an SQL query like select * from docbase where areacode like '707%'. However, the record template that governs the ProductAnalysis docbase yields XML-formatted records, so we can search for things inside fields that hold free-form text. In practice, that’s often good enough. If finding records where the area code is 707 is a frequent operation, you should make the phone number a distinct header field, then index the docbase on that field using methods detailed in the next two chapters. But if that kind of search only happens once in a blue moon, you might do better to spare yourself (and your users) the overhead of dealing with one more explicit field.

Simple formatting

It’s tempting to invite users to format what they write using HTML tags. For this class of docbase, though, I think it’s wise to resist that temptation. The Docbase::Input module creates effective and readable reports using only the HTML styles that each page inherits from the record template. Within a field, it converts double newlines to paragraph (<p>) tags, then remaining newlines to <br> tags. When users supply paragraph-oriented rather than line-oriented input—as will happen when they paste in material from a word processor—this works out surprisingly well for paragraphs, subheads, and lists.

Of course, you can do better than this. It’s easy, for example, to define a stripped-down markup language for subheads, lists, and emphasis. Or you can pass through only these HTML tags while blocking all others. Or you can pass through all the HTML tags, on the theory that only those who know how to use them will even bother to try. If you find that any of these policies doesn’t work well, you can easily try another. But don’t try to outthink your users or overdesign the system. You can put a lot of effort into building features that deliver no real benefit or that make matters even worse by opening cans of worms best left tightly sealed. Measure the success of a docbase solely in terms of its ability to connect users and information. The Virtual Press Room ran for years, collecting documents from people who told me they found the application self-explanatory and easy to use. You can’t argue with simple solutions that work well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.19.56.114