Decomposing CyberCinema

Let's turn to our more complex example of the CyberCinema Web site. Let's revisit our skeletal requirements:

  • Reviewers must be able to write reviews.

  • Reviews must be searchable by movie.

  • Reviews must be searchable by director.

  • Reviews must be searchable by actor.

  • Reviews must be searchable by reviewer.

  • Movies must be searchable by director.

  • Movies must be searchable by actor.

Now refer to the XML DTD for CyberCinema that we built in Chapter 5 (see Appendix A).

Finally, we'll bring to bear our data model diagram (see Figure 6-2).

Figure 6-2. CyberCinema data model diagram


First, we determine the tables required for our database. From reviewing our data model in Figure 6-2, we find the minimal tables we need are a table for movie reviews (which will be the table containing our XML review instances), a table for movies, and a table for people (such as reviewers, actors, and directors). Because people can have different roles (that is, reviewers, actors, and directors) within our system, it makes sense to create a table for people and a separate table for roles people can play. That way, if we want to add a role (such as producer or screenwriter) and start tracking people in that role in addition to the three roles we've already identified, we don't have to go through SQL schema evolution.

From reviewing our requirements, we've got a list of requests with which we want to be able to query the database:

  • List the review of a given movie.

  • List the reviews by a given reviewer.

  • List the movies in which a given actor acted.

  • List the movies that a given director directed.

  • List the reviews that relate to a given actor.

  • List the reviews that relate to a given director.

Let's take a look at our person and role tables first because they're the simplest:

CREATE TABLE person (
    person_id      INTEGER(16)     NOT NULL     PRIMARY KEY,
    person_name    VARCHAR(255)    NOT NULL
);
CREATE TABLE role (
    role_id        INTEGER(16)     NOT NULL     PRIMARY KEY,
    role_name      VARCHAR(255)    NOT NULL
);

CREATE TABLE person_role (
    person_id   INTEGER(16) NOT NULL REFERENCES person(person_id),
    role_id     INTEGER(16) NOT NULL REFERENCES role(role_id)
    PRIMARY KEY (person_id, role_id)
);

Notice that, in addition to creating a person and a role table, we've also created a bridging table, person_role. Now people in our database can have more than one role. We've already solved the “Woody Allen” bug.[1] We've also solved the “Roger Ebert” bug[2] (people can be movie reviewers and be actors). Also, by making the combination of the columns unique as the primary key, we've forced uniqueness on person_id and role_id so that a given person can't be in the table with the same role more than once (which wouldn't make sense). Already, our system is surging with power, and we haven't even injected our XML into the mixture!

[1] Woody Allen writes, directs, and stars in most of his own films.

[2] Roger Ebert, famous for being a movie reviewer, has also appeared in some films (usually playing himself).

Bridging Tables

“Bridging tables” is my name for tables that exist solely to bridge the gap between two different tables. They're sometimes called linking tables, but we're throwing around the word “link” in some other contexts here, so I'm sticking to “bridge.” Any time you want to create a relationship between two different tables, you should use a bridging table. You can simply include a reference in one table directly to the other table (as we saw in the book database example in Chapter 2), but doing so is actually bad practice for anything other than the simplest uses of SQL databases. Using bridging tables provides much more flexibility. In the book database example, for instance, we couldn't have stored any books with multiple authors the way I did it (with a direct reference from one table to another). With the addition of a bridging table that contains references to both the books table and the authors table, multiple authors per book are possible.


Now let's take a look at movies:

CREATE TABLE movie (
    movie_id      INTEGER(16)    NOT NULL     PRIMARY KEY,
    movie_name    VARCHAR(255)   NOT NULL,
    movie_date    DATE           NOT NULL
);

Name and date are probably all we need to identify a movie uniquely. You rarely see two movies come out with the same title in the same year—it tends to confuse the viewing public.

Dates

The reason I've chosen to use a SQL DATE data type for the previous example is that the database now can respond reliably to requests like “list the movies that were released between 1987 and 1989” by interrogating the database. Even though this request isn't one of the ones we're presently making, it doesn't hurt to think ahead. The database can also return the date information to you in whatever format you need it. Don't be afraid to take advantage of the features of your RDBMS.


For our system to be able to respond to the last two requests (listing reviews that relate to given actors and listing reviews that relate to given directors), we also have to build bridging tables to bridge people and movies. But wait a second—what are these bridging tables really bridging? Movies to people? Actually, they're bridging movies to people's roles. At least, they are if you want to be able to respond to the request “list the movies a given person acts in.”

So our movies to people's roles table looks like this:

CREATE TABLE movie_person_role (
    movie_id    INTEGER(16) NOT NULL REFERENCES movie(movie_id),
    person_id   INTEGER(16) NOT NULL REFERENCES person(person_id),
    role_id     INTEGER(16) NOT NULL REFERENCES role(role_id),
    PRIMARY KEY (person_id, role_id, movie_id)
);

Now we've got an interconnecting web of movies and people that, after it's populated with data, can give us instant filmographies on any given person and tell us who directed or acted in any given movie. SQL rocks!

XML Nitro Injection: Adding Reviews

Now we're ready to add our review table. As we did earlier, let's start with a database schema that uses partial decomposition for maintaining data integrity. First we need a table to hold our reviews:

CREATE TABLE review (
    review_id    INTEGER(16)    NOT NULL     PRIMARY KEY,
    review_xml   CLOB           NOT NULL
);

Let's add the following bridging tables:

CREATE TABLE review_movie (
    review_id    INTEGER(16)    NOT NULL     REFERENCES review(review_id),
    movie_id     INTEGER(16)    NOT NULL     REFERENCES movie(movie_id),
    PRIMARY KEY (review_id, movie_id)
);
CREATE TABLE review_person (
    review_id    INTEGER(16)    NOT NULL     REFERENCES review(review_id),
    person_id    INTEGER(16)    NOT NULL     REFERENCES person(person_id),
    PRIMARY KEY (review_id, person_id)
);

Notice that we're not bridging to the person_role table for reviews as we do for movies. That's because the XML itself (as described in Chapter 5) doesn't include a role attribute for the person element, and it doesn't really need to. During decomposition, we won't know from the XML instance which role this particular person played in the film being reviewed, but we don't really need to know.

During XML decomposition, when you see the following:

<review id="123"> ... <person id="2345">Pauly Shore</person>'s movies are largely
misunderstood works of genius. ... </review>

your insert into the database looks something like this:

INSERT INTO review_person (review_id, person_id)  VALUES(123, 2345);

And because the XML decomposition happens when the XML is inserted into the database, you end up building an accurate relational map of everything you've chosen to decompose.

Link Management

Managing the links between documents (especially large numbers of documents) can be an arduous task. For example, so-called “link rot” on the Web often turns Web surfing into a frustrating experience. You click on one link and then another, only to find that the links lead to dead ends: error (404) screens, error messages from your browser, and the like. This phenomenon is due to a lack of link management. When Webmasters decide to remove a page from a Web site, they have no reliable way to find out how many pages link to it, so they have no way to determine the extent of the link rot they're causing. Similarly, people hosting lists of links have no way of knowing that a particular page they're linked to has been removed. At best, they can go through their list of links on a regular basis and try to sweep out the rubbish, but it's a losing battle.

Rewriting URLs

The problem of bad links is pervasive, and instituting good link management at the content management layer is one solution. However, skillful Webmasters have other tools that they can use to combat this problem. One way is to rewrite URLs on the fly within the Web server. This technique can be especially useful when you're making major changes to your Web infrastructure (such as moving from one publishing system to another). Apache's mod_rewrite (if you're using Apache as a Web server, which I highly recommend) is the pinnacle of achievement in this area. It enables you automatically to rewrite URLs on a request-by-request basis, using rules that can be as complex as you want them to be.

For instance, say your URLs are all in the form http://www.torgo.com/news/1234.html, where “1234” is an article or object ID number. Because of a change in your site's architecture (for example, instituting an XML-based publishing system), you want to change all your URLs to be in the form http://www.torgo.com/articles.jhtml?article_id=1234. If you institute a mod_rewrite rule to transform the string “news/<number>.html” into “articles.jhtml?article_id=<number>”, then you've proofed your site against everyone who bookmarked older articles or otherwise linked to your articles from other sites you may not know of. That's only the bare minimum that the wonders of mod_rewrite can achieve.

At this writing, the URL for mod_rewrite is http://httpd.apache.org/docs/mod/mod_rewrite.html.


The problem of link rot on the Web is pervasive and nasty, and there's no ready solution for it. While it's annoying to get a 404 message in response to clicking a link, I can be forgiving in most cases. I'm astounded when I get an error clicking on a link to a page within the same site. That kind of nonsense is completely preventable, as you'll see.

In Chapter 5, we added a link syntax (XLink) to link reviews to movies, people, and other reviews (see the section Linking Up: XLink in Chapter 5). With all of these links, but especially in linking reviews to other reviews, the concept of link management is key.

Reviewing the DTD syntax we have for our links, we see that the sample review we created in Chapter 5 looks like this:

<?xml version="1.0">
<CYBERCINEMA_REVIEW ID="123">
<HEAD>
...header stuff we're not interested in...
</HEAD>
<BODY>
<!--  Body information goes here  -->
The film <MOVIE ID="12345">Spartacus</MOVIE> stars <PERSON ID="932">Tony Cur-
tis</PERSON> as a fun-loving slave. Often confused with
<MOVIE ID="12346">Ben Hur</MOVIE> (see our <REVIEW ID="876">review</REVIEW>),
this 1960's classic is...
</BODY>
</CYBERCINEMA_REVIEW>

The decomposition for the PERSON and MOVIE elements were described previously (see the section XML Nitro Injection: Adding Reviews, earlier in this chapter). The decomposition for the REVIEW elements will be quite the same, starting with a relational table:

CREATE TABLE review_review (
     review_from_id   INTEGER(16)    NOT NULL     REFERENCES review(review_id),
     review_to_id     INTEGER(16)    NOT NULL     REFERENCES review(review_id),
     PRIMARY KEY (review_from_id, review_to_id)
);

The difference is that we're relating from a table back into the same table, so to differentiate the two references, we've labeled them the “from” reference and the “to” reference. In the case of the preceding XML instance, the insert statement performed on decomposition looks like this:

INSERT into review_review (review_from_id, review_to_id) VALUES (123, 876);

When you're finished, you'll have an exact map of which reviews reference which other reviews. If you decide not to use a review because of a pending lawsuit, the system can quickly and easily tell you which reviews link to it.

Selecting What You Need

After you've inserted a few hundred reviews, you'll be able to get your database to respond to fairly specific requests, such as list reviews of movies made between 1990 and 1995 by a certain reviewer where a certain actor was mentioned.

SELECT review.id, review.review_xml
FROM review, review_person, person AS reviewer,
  review_movie, movie, movie_person_role, person_role,
  role, person AS actor
WHERE review.review_id = review_person.review_id
AND reviewer.person_id = review_person.person_id
AND review_movie.review_id = review.review_id
AND movie.movie_id = review_movie.movie_id
AND movie.movie_id = movie_person_role.movie_id
AND person_role.person_role_id = movie_person_role.person_role_id
AND role.role_id = person_role.role_id
AND actor.person_id = person_role.person_id
AND movie.movie_date > '1990-12-31'
AND movie.movie_date < '1995-01-01'
AND reviewer.person_name LIKE '%Ebert%'
AND role.role_name = 'Actor'
AND actor.person_name = 'Alec Baldwin';

The database can reliably answer our request; why you would want to ask this request is another issue.

As a bonus, we've also addressed the dead link issue. Because all the links between your documents are mirrored in a database that enforces the integrity of that structure, it's much more difficult to have a bad link. If an author tries to insert a link to a nonexistent movie or review, the database rejects it because it will breach relational integrity to do so. Your authoring interface should give the author some kind of helpful error message, such as “the requested link target does not exist.” Again, this leverages the power of the relational database. You won't have to do as much link checking if you know that your database is humming away in the background, ensuring the integrity of your intradocument links.

Using Link Management to Help Power Suggestions

So, after you've developed this elegant database schema around your data, what else can you do with it? Remember when I was spouting off about “emergent properties” of systems in the Introduction? A database schema based on the idea of partial decomposition can also exhibit emergent properties, developing smarts beyond the original design. A good example of this is in the field of suggestions. The complex maps you've built in the review_movie, review_review, and review_person tables can be even more useful, especially if you're interested in making suggestions to users, that is, providing information they might be interested in based on what they're looking at now.

The science of suggesting relevant information is nothing new. Many software systems do this through methods such as collaborative filtering or lexical analysis.

The idea behind lexical analysis systems is to break documents down to their lexical components—words or phrases—and compare new documents based on a statistical relevance ranking of these components. In theory, documents that have similar subject matter should rank closely together; in reality, lexical analysis often groups documents that don't necessarily go together.

For instance, lexical engines can readily group all of Shakespeare's works together, but a lexical system can't differentiate between a comedy and a tragedy. If you were reading Troilus and Cressida and you wanted the system to show you “more plays like this one,” it probably could manage to rule out Speed the Plow, but you might just as easily be shown As You Like It as Romeo and Juliet. The system needs more information than purely lexical similarity to guide it.

Likewise, collaborative filtering, when left to its own devices, can develop strange behaviors. Collaborative filtering attempts to suggest items to you based on matching your own likes and dislikes against all other users on the system. Collaborative filtering's answer to the previous user request involves a complex equation based on what other readers who had read Troilus and Cressida had read and enjoyed. This means that you might get a list of other Shakespearean tragedies as well as Greek tragedies and other similar fare. Collaborative filtering misses any advantage it could gain from the content itself, however. You might as well be reading Dick and Jane—or eating avocados for that matter. Collaborative filtering system knows only about user responses.

Both lexical analysis and collaborative filtering lack a means to receive editorial input into the suggestion process. If, in the editorial notes for Troilus and Cressida, a discussion of Shakespeare's possible influences for this work and his similar works is included, and this discussion includes links to those other works, we now have something to work with. Instead of having to infer the relationships between documents by essentially throwing a dart at a board, the system has hard data about these relationships. It can use lexical analysis or collaborative filtering (or both) to supplement this hard data and give the user access to a richer experience.

In reality, especially in systems with lots of legacy information, hard data about the relationships between documents isn't always possible to get, but when this data is available, you should leverage it—and hard. If your database schema is built correctly and you've used the tools of link management and partial decomposition, you already have a mechanism to manage these types of relationships between documents and to use them to best effect. Your database now knows more about the relationships between your documents than you do because it knows all of the links between documents or from documents to other entities (movies, actors, directors, and so on). However, you have to manage the data input process very closely; otherwise, as they say in the biz, “garbage in, garbage out” will rule the day.

The key is in correctly building the input tools for your internal user base. For example, at CyberCinema, it might be useful to have two kinds of links between articles: One kind might be “related reviews based on topic” (such as “Academy Award nominees for 2001”), and the other might be “related reviews, based on genre” (such as “science fiction” or “Chinese films”). In both cases, you're relating one film review to another but for different reasons. You want to be able to store those links, but also you also want to be able to store the accompanying data of why those links exist. What's the semantic value of that link?

An effort is currently underway within the hallowed halls of the World Wide Web Consortium (W3C) to build an XML-based language to address just this problem. The language is called RDF (Resource Description Framework), and its stated goal is to build a new “semantic Web,” a Web composed of intelligent links that encode their own raison d'être. It's a splendid goal, but reaching it probably is way in the future. To read more about RDF, visit W3C's RDF pages at http://www.w3.org/RDF/. Tim Bray has written an excellent primer at http://www.xml.com/pub/a/98/06/rdf.html.

In the meantime, you can build your own semantically encoded links using the tools of XLink and partial decomposition. First, let's take our XLink example from Chapter 5 and add optional attributes:

<!ELEMENT review ANY>
<!ATTLIST review
    xlink:type        (simple|extended|locator|arc)
                                                 #FIXED "locator"
    <!-- This is a locator link because it points to an external
        resource -->
    xlink:href        CDATA                      #REQUIRED
    xlink:type        (topic | genre | other)    "other"
    xlink:show        (new | embed | replace)    "replace"
    <!-- When link is actuated (such as with a click) should the linked-to
         data come up in a new window, be embedded in the current window or
         replace the current content? -->
    xlink:actuate        (onRequest |onLoad)     "onRequest"
    <!-- How should the link be activated? Default is on user request
         (for example, the user clicks on the link text) -->
>

Now let's imagine the following partial decomposition table in which to store this link data:

CREATE TABLE review_review (
     review_review_id     INTEGER(16)    NOT NULL    PRIMARY KEY,
     review_review_type   INTEGER(16)    NOT NULL    REFERENCES
                          link_type(link_type_id),
review_from_id            INTEGER(16)    NOT NULL    REFERENCES
                          review(review_id),
     review_to_id         INTEGER(16)    NOT NULL    REFERENCES
                          review(review_id)
);

CREATE TABLE link_type (
     link_type_id    INTEGER(16)    NOT NULL    PRIMARY KEY,
     link_type_name  VARCHAR(16)    NOT NULL
);

In this example, we're assuming the link_type table is populated with all possible reasons why something would link to something else. With this arrangement, the database can tell us which review-to-review links are included because of the film's genre and which are included because the films are related by topic. Simply query the database like so:

SELECT review_review_id
    WHERE review_review.review_review_type = 1
    AND review_review.review_from_id = 127;

Assuming the link_type_id of genre links is “1,” you've just selected all genre-type links to other reviews from review 127. You can do something fun with this data such as format all these links in a separate box on the right side of your page (assuming you're formatting a page).

Link Semantics: Making It Easy for Content Creators

Adding semantic information to your links can be powerful, but remember, it requires human intervention to do it correctly. If your content creators aren't required to take the extra step to input this data or if they don't see the benefit, they won't do it. Here are a couple of ideas for easing (encouraging) them to add semantic information:

  1. Make your content entry interface easy to use, which usually means making use of interface elements (such as checkboxes, radio buttons, and so forth).

  2. Make it impossible for your content authors to skip steps. Doing this means they are shown a warning dialog box or a message stating that they have to fill in certain fields before the system will let them save their content. The disadvantage of doing this is that your content authors will find it annoying and will find ways to subvert it. For instance, if the system requires them to enter an expected publish date in order to “file” an article, the content authors might enter any old date in order to file the article. If the system is too restrictive, they just won't use it. You can build the most full-featured workflow system in the world, but if it isn't easy to use and if your content authors don't think it makes their jobs easier, they will subvert it.

  3. The best way to encourage content authors to add semantic information is to make the benefit of extra content entry visible to the author, as in an earlier example where I recommended moving the links relating to genre into a separate box on the side of the article. Content authors usually care about these types of features because readers care about them. Content editors, or managers in charge of content, really care about them. If your content authoring interface enables authors to put their links into a special box, that's a feature, not an added annoyance.

Unfortunately, there's no good way to force content authors to input semantic data or metadata for which they don't see tangible benefit. This is where those poor interns come in handy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.154.208