Chapter 22. Summarizing Web Pages with HTML::Summary

Tony Rose

Ave Wrigley

Canon, like many other large companies, is a multinational organization with multiple web sites, each managed by a different part of the company. This is a problem for the typical Canon customer, who knows nothing about Canon’s internal organization and simply wants to find information about their cameras or download a new printer driver. They need a single clear way to find what they want.

CS-Web: A Search Engine for Canon’s Web Space

Back in 1997, we wrote CS-Web, a set of Perl programs to collect information from all of Canon’s web sites, index it, and make it searchable from the web. We wrote our own solution because at the time the available products were either services designed for searching the entire web (such as AltaVista), or tools for indexing and searching a single web site.

CS-Web consists of a robot, a database, and a web interface (written in mod_perl). The robot traverses all of Canon web sites and stores a description of each page in the database. The search engine queries the database and gives you a list of candidate documents, and their descriptions. You can try CS-Web for yourself: it is linked from the main “gateway” page for Canon (http://www.canon.com/). You can also access it directly at http://csweb.cre.canon.co.uk/.

CS-Web presented a variety of challenges, many of which make suitable war stories for TPJ. However, for this article, we will focus on one crucial problem: generating the summary of an HTML document.

META Tags

Unlike some other search engines, CS-Web doesn’t index the full text of the document. Instead, it indexes document descriptions. When web page authors use the META tag’s NAME and CONTENT attributes to encode information about the document, CS-Web will use it. However, when no such information is conveniently provided by the author, CS-Web tries to boil down the text of the document into a description it can use.

One important limitation on a document’s description is length; each web page corresponds to a row in a relational database table, and for performance reasons the size of each field in each row is fixed in advance. This is because with the database engine we were using at the time, MySQL, fixed-width fields were much quicker to search than variable-width fields. You’ll see later how this length constraint introduced its own problems.

If we were deploying CS-Web across a lot of public web sites, we’d have quickly found that very few web page authors consistently provide accurate metadata. In fact, deliberately misleading metadata is often included by unscrupulous authors to enhance the page’s prominence in search engine results.

However, the outlook for CS-Web was a little more promising. Since Canon’s webmasters are generally working together, we could expect a certain level of integrity, and assume that they were not trying to deceive the CS-Web robot. In turn, the CS-Web robot could acknowledge this trust: if it found a page description within a META tag, it would accept it as legitimate. However, the task of adding these META tags to existing pages can be a time-consuming process, and we couldn’t rely on their presence on all of the pages.

So how could we generate a text description from the raw HTML when no metadata are present? By using a combination of some natural language processing techniques and Perl. The result was the HTML::Summary module. We’ll explore it shortly, but before we do we’ll look at some basic summarization methods for text.

Basic Summarization Methods

The basic approach of most summarization systems is to examine each sentence in the original document, assess its importance (using one or more known heuristics) and then output a summary of the desired length by omitting the less important sentences. Obviously, our success relies on how well we can measure importance. Usually, a combination of the following six simple methods is used:

Location method

Sentences are scored according to their position or location within the document. For example, sentences occurring at the beginning or end of the first paragraph, or within a heading, are given a higher score than sentences in the middle of a paragraph.

Cue method

Certain words in the document indicate the presence of more (or less) important material. For example, strongly positive words like “best,” “significant,” and “greatest” increase the sentence score. By contrast, negative words like “impossible” or “hardly” decrease the sentence score.

Title-keyword method

The title of the document is assumed to be a reliable indication of the focus of its contents; sentences referring to those concepts are given a higher score. To help out, some pre-processing can be used; for example, a stemmer can conflate inflected terms to a single root (“runs” and “running” become “run”). Similarly, a stop list may be used to filter out stop words (“the,” “of,” “and,” and so on).

Frequency-keyword approach

The important concepts in a document will yield particular keywords that occur with a greater-than-expected frequency, so sentences containing these words are given a higher score. The keywords are usually identified by sorting word frequencies and removing the stop words. A slightly more sophisticated variant involves the use of “distinctiveness” rather than raw frequency—normalizing the frequency counts by a priori frequencies taken from an independent large text corpus.

Indicator phrase method

This method is similar to the cue method, except that in this case one looks for certain phrases rather than words. For example, “The aim of this paper is…” and “This document attempts to review…” both indicate that the important concept is about to be introduced, so documents containing such constructions should receive higher scores. There are obviously many different indicator phrases, but research suggests that these are usually derived from a small number of underlying templates.[2]

The syntactic method

Experiments from up to thirty years ago have attempted to correlate sentence importance with syntactic structure, so far without conclusive results.[3]

Edmundson performed a comparative evaluation of the above six methods, and found the first four to be superior, in the order shown above. In addition, he evaluated their performance in combination, and found a linear combination of the first three (with an appropriate weighting given to the scores obtained from each method) to be even better.[4]

HTML::Summary

HTML::Summary is available from CPAN; Version 0.016 is described in this article. This is how it is used.

First, you create an HTML::Summary object, using the new method. You can provide configuration parameters as arguments to new:

my $html_summarizer = new HTML::Summary LENGTH => 200;

The LENGTH parameter is the maximum length in bytes for the generated summary. Next, you need an HTML::Element object corresponding to the HTML page that you want to summarize. You can generate one with HTML::TreeBuilder:

my $html_tree = new HTML::TreeBuilder;
$html_tree->parse( $html_document );

$html_document is a string containing the HTML of the web page; this could have been read in from a file, or returned as the contents of an HTTP request, such as through LWP::Simple’s get method.

Finally, you call the generate method of the HTML::Summary object, with the HTML::Element object as an argument, which returns the summary of the page as a string:

$html_summary = $html_summarizer->generate( $html_tree );

That’s how you use it. But how does it work?

The Summarization Algorithm

One of the main tasks before us was generating a good fixed-length abstract from arbitrary text. As described above, this is known to be a important and difficult problem, and a quality solution requires sophisticated natural language techniques that can analyze the structure of the original, identify key phrases and concepts, and regenerate them in a more succinct format.

Luckily for us, there are some quick and dirty ways to generate summaries. We only needed to provide a gist of the original for someone browsing the CS-Web search results. In addition, for retrieval purposes, we want the summary to contain representative keywords.

One advantage that we had over people trying to generate summaries from plain text is that HTML pages contain markup information—the HTML tags. Markup tells us about the structure of the content, and often about its relative importance as well. For example, it is usually clear in HTML pages where paragraphs begin and end, and when important text is italicized, emboldened, or made into a heading.

The HTML::Summary module uses the location method of text summarization described above. This identifies important sentences (based primarily on their location in the text), and concatenating them together to produce an abstract. A simple example of this would be to take the first sentence of every paragraph in an article and string them together. This can sometimes be surprisingly effective:

Canon, like many other large companies, is a multi-national
organization with multiple (26) web sites, each managed by a different
part of the company. In 1997 we wrote CS-Web, a set of Perl programs
to collect information from all of Canon's web sites, index it, and
make it searchable from the web. CS-Web consists of a robot, a
database, and a web interface (written in mod_perl). CS-Web presented
a variety of challenges, many of which would make suitable war stories
for TPJ.

The text summarization method used in HTML::Summary is an adaptation of the location method. It works as follows:

Split into sentences

First, the text is split into sentences. (More about this later.)

Score the sentences

The sentences are scored according to what HTML element they appear in, and whether or not they are the first sentence in that element. The algorithm here is pretty simple: each element has a score. The first sentence in that element gets this score; the rest of the sentences get nothing.

Sort the sentences by score

The sentences are stored in an array of hashes. Each hash corresponds to a sentence, and contains information about the text in the sentence, its length, the HTML element it appeared in, its score, and its original order in the text.

$summary[ scalar( @summary ) ] = {
    'text'          => $text,
    'length'        => length( $text ),
    'tag'           => $tag,
    'score'         => $score,
    'order'         => scalar( @summary ),
};

The scores, as described above, are based on the HTML element that the sentences appear in. These scores are stored in a global hash:

my %ELEMENT_SCORES = (
    'p'         => 100,
    'h1'        => 90,
    'h2'        => 80,
    'h3'        => 70,
);

These scores were arrived at by empirical investigation; we have no theoretical justification for them.

Truncate the list of sentences

Calculate how many sentences are needed before the requested summary length is met (or exceeded).

Sort the sentences by original order again

Having remembered the original sentence order in the text in the hash for that sentence, we can now re-sort the sentences in that order.

Concatenate the sentences to create the summary

Spaces are added between the sentences, since whitespace was stripped when the sentences were split.

Truncate the summary at the requested length

This last step assumes that if you want a summary of 200 characters, 201 characters are not acceptable—even if it means chopping the summary off midsentence. This is what we wanted in CS-Web. Maybe in other applications a less severe approach would be appropriate—it’s easy to add more options to HTML::Summary, so let us know what you think.

Sentence Splitting

Now for the nitty gritty. The remainder of this article focuses on just one aspect of the HTML::Summary: splitting the element contents into sentences. Japanese character encodings were a particular problem for CS-Web; our approach is described in the section Afterword: Truncating Japanese Text.

The task of splitting text into sentences seemed like a more general problem than its application to text summarization, so this is contained in a separate module, Text::Sentence (also available from CPAN).

Text::Sentence is basically just a regex. It is has a non-object-oriented interface that exports one function, split_sentences, that takes the text to be split into sentences as an argument, and returns a list of the sentences.

sub split_sentences {
    my $text = shift;
    return () unless $text;

The function first checks if there really is any text to split into sentences; if not, it just returns an empty list.

# $capital_letter is a character set; to account for locale, this
# includes all letters for which lc is different from that letter.

my $capital_letter =
    '[' .
        join( '',
            grep { lc( $_ ) ne ( $_ ) }
            map { chr( $_ ) } ord( "A" ) .. ord( "xff" )
        ) .
    ']'
;

Although it would be more efficient to compute this regex component once at the package level, doing it in split_sentences allows the user to change locales between calls.

The next few lines build up the components of the regex that split the text into sentences. The first of these components is the capital letter found at the start of a sentence. Instead of using the character class [A-Z] as you would normally, Text::Sentence accounts for locale-specific capital letters. For example, in French, a capital A acute (Á) won’t be matched by [A-Z] . The method used in Text::Sentence makes use of the fact that lc is sensitive to locale settings, and returns a lowercase version of all capitalized characters. A set of locale-specific capital letters can be built up for the extended ASCII range by filtering any characters changed by lc . For more information on how Perl handles locales, see the perllocale documentation bundled with Perl.

@PUNCTUATION = ( '.', '!', '?' );

The @PUNCTUATION array is a global variable in Text::Sentence containing any punctuation used to indicate the end of a sentence. The fact that it’s a global means that you’re able to change it (although the interface could be improved—an options hash passed to split_sentences, perhaps. For example, you might want to add locale specific punctuation for the Spanish ¡.

push( @Text::Sentence::PUNCTUATION, chr( 161 ) );

Back to split_sentences:

# This needs to be alternation, not a character class,
# because of multibyte characters
my $punctuation = '(?:' . join( '|', @PUNCTUATION ) . ')';

As mentioned above, one of our concerns was dealing with multibyte character encodings (see Afterword: Truncating Japanese Text). Japanese punctuation characters may be more than one character long, so we can’t use a character class for punctuation in the sentence splitting regex. For example, an exclamation point in the EUC Japanese encoding is “xA1xAA”.

# Return $text if there is no punctuation ...
return $text unless $text =~ /$punctuation/;

If these isn’t any end-of-sentence punctuation in the text, then we might as well return the text now.

my $opt_start_quote = q/['"]?/;
my $opt_close_quote = q/['"]?/;

# These are distinguished because (eventually!) I would like to do
# locale stuff on quote characters

my $opt_start_bracket = q/[[({]?/; # }{
my $opt_close_bracket = q/[])}]?/;

Sentences sometimes have quotation marks or parentheses that come before the capital letter at the beginning, or after the full stop (period, question mark, or exclamation point) at the end. For example, the following sentence:

Larry said "let there be light!" (And there was.)

is two sentences; the first ends after the second double quote. However, this is one sentence:

Larry said "let there be light!" (and there was).

Here is the regex in all its glory:

my @sentences = $text =~ /
(
                            # Sentences start with ...
    $opt_start_quote        # an optional start quote
    $opt_start_bracket      # an optional start bracket
    $capital_letter         # a capital letter ...
    .+?                     # at least some (non-greedy) anything ...
    $punctuation            # ... followed by any one of !?.
    $opt_close_quote        # an optional close quote
    $opt_close_bracket      # and an optional close bracket
)
(?=                         # with lookahead that it is followed by ...
    (?:                     # either ...
        s+                 # some whitespace ...
        $opt_start_quote    # an optional start quote
        $opt_start_bracket  # an optional start bracket
        $capital_letter     # an uppercase word character (for locale
                            # sensitive matching)
    |               # or ...
        

        # a couple (or more) of CRs (i.e. a new para)
    |               # or ...
        s*$        # optional whitespace, followed by end of string
    )
)
/gxs
;
return @sentences if @sentences;
return ( $text );
    }

This regex makes use of the lookahead feature of regular expressions. In this case, it allows us to specify that a sentence must not only start with a capital letter, and end in a full stop, but that there must be another capital letter that follows the full stop. The only exception to this is when the sentence is either at the end of a paragraph, or at the end of the string.

The lookahead accounts for the whitespace between sentences, so it’s not part of the matched patterns that end up in the @sentences array. That’s why concatenating the sentences won’t give you back the exact original text.

The main problem with trying to split text into sentences is that there are several uses for periods, such as abbreviations.

Dr. Livingstone, I presume.

This phrase counts as two sentences according to Text::Sentence—the first sentence is three characters long. The performance of Text::Sentence could be improved by taking into account special cases like honorifics (Mr., Mrs., Dr.), common abbreviations (e.g., etc., i.e.), and so on. However, as with many natural language problems, this obeys the law of diminishing returns; a little bit of effort will do a decent 90% job, but that last 10% is pretty difficult. For our purposes, the 90% is good enough.

Conclusion

We chose to use Perl for CS-Web because of the obvious benefits: the LWP modules for web programming, DBD/DBI, mod_perl, and so on. We found that Perl is also a useful tool for doing natural language work. Its text processing features, rapid development cycle, and ability to generate complex data structures on the fly make it particularly appropriate.

A lot of interesting work in natural language research involves analyzing corpus data; collecting statistics about language use over large databases of typical usage. The web is an obvious rich source of this type of data, and in view of this, it is a little surprising how few tools and modules appeared to be available in Perl for this field. Certainly, when we posted about Text::Sentence to a language processing mailing list, there seemed to be quite a lot of interest in what we were doing, as well as extensive Perl expertise in that community. Hopefully, natural language processing will become yet another nut for Perl to crack!

Afterword: Truncating Japanese Text

Canon is a Japanese company, with Japanese text on many of its web pages. Japanese text is usually encoded in one of several possible multibyte encoding schemes (not including Unicode!), and some of these schemes use variable numbers of bytes to represent single Japanese characters, or intermingle Japanese and regular ASCII characters. This was a problem.

The summaries generated by Text::Summary are truncated at a fixed length, and this length is specified in bytes, not characters. If Japanese text is truncated at an arbitrary byte length, this might mean truncation in the middle of a character.

Worse, our page abstracts can appear in result listings for keyword searches. If a page summary broken midcharacter is inserted into running text, the byte immediately following the summary could be interpreted as the next byte of the previously uncompleted Japanese character, upsetting the character boundaries for the rest of the text.

The Text::Sentence used another module, Lingua::JA::Jtruncate (also available on CPAN), which addresses this problem. Lingua::JA::Jtruncate contains just one subroutine, jtruncate, used as follows:

use Lingua::JA::Jtruncate qw( jtruncate );
$truncated_jtext = jtruncate( $jtext, $length );

where $jtext is some Japanese text that you want to truncate, $length is the maximum truncation length, and $truncated_text is the result. Here’s how it works.

First, some regexes are defined that match characters in each of the three main Japanese coding schemes: EUC, Shift-JIS, and JIS.

%euc_code_set = (
    ASCII_JIS_ROMAN     => '[x00-x7f]',
    JIS_X_0208_1997     => '[xa1-xfe][xa1-xfe]',
    HALF_WIDTH_KATAKANA => 'x8e[xa0-xdf]',
    JIS_X_0212_1990     => 'x8f[xa1-xfe][xa1-xfe]',
);

%sjis_code_set = (
    ASCII_JIS_ROMAN     => '[x21-x7e]',
    HALF_WIDTH_KATAKANA => '[xa1-xdf]',
    TWO_BYTE_CHAR       => '[x81-x9fxe0-xef][x40-x7ex80-xfc]',
);

%jis_code_set = (
    TWO_BYTE_ESC        =>
        '(?:' .
        join( '|',
            'x1bx24x40',
            'x1bx24x42',
            'x1bx26x40x1bx24x42',
            'x1bx24x28x44',
        ) .
        ')'
    ,
    TWO_BYTE_CHAR       => '(?:[x21-x7e][x21-x7e])',
    ONE_BYTE_ESC        => '(?:x1bx28[x4ax48x42x49])',
    ONE_BYTE_CHAR       =>
        '(?:' .
        join( '|',
            '[x21-x5f]',                      # JIS7 Half width katakana
            'x0f[xa1-xdf]*x0e',             # JIS8 Half width katakana
            '[x21-x7e]',                      # ASCII / JIS-Roman
        ) .
        ')'
);

%char_re = (
    'euc'       => '(?:' . join( '|', values %euc_code_set ) . ')',
    'sjis'      => '(?:' . join( '|', values %sjis_code_set ) . ')',
    'jis'       => '(?:' . join( '|', values %jis_code_set ) . ')',
);

Each of the regexes in %char_re matches one character encoded in the scheme corresponding to the keys of the hash.

Now for the definition of the jtruncate subroutine; first, some fairly obvious sanity checks:

sub jtruncate
{
    my $text            = shift;
    my $length          = shift;

    # sanity checks

    return '' if $length == 0;
    return undef if not defined $length;
    return undef if $length < 0;
    return $text if length( $text ) <= $length;

Now we save the original text; this is used later if the truncation process fails for some reason.

my $orig_text = $text;

Now we use Lingua::JA::Jcode::getcode to detect which encoding the text uses. Lingua::JA::Jcode::getcode is a simple wrapper around the jcode.pl Perl library for Japanese character code conversion. Kazumasa Utashiro kindly agreed to let us distribute the code with HTML::Summary.

my $encoding = Lingua::JA::Jcode::getcode( $text );

If getcode returns undef, or a value other than euc, sjis, or jis, then it has either failed to detect the encoding, or detected that it is not one of those that we are interested in. We then take the brute force approach, using substr.

if ( not defined $encoding or $encoding !~ /^(?:euc|s?jis)$/ )
{
    return substr( $text, 0, $length );
}

The actual truncation of the string is done in chop_jchars—more on this subroutine in a bit.

$text = chop_jchars( $text, $length, $encoding );

chop_jchars returns undef on failure. If we have failed to truncate the Japanese text properly, we resort to substr again. We had to decide whether it was more important to meet the $length constraint or risk returning a Japanese string with broken character encoding. We chose the former:

return substr( $orig_text, 0, $length ) unless defined $text;

Next, a special case: JIS encoding uses escape sequences to shift in and out of single-byte and multibyte modes. If the truncation process leaves the text ending in multi-byte mode, we need to add the single-byte escape sequence. Therefore, we truncate (at least) three more bytes from JIS encoded string, so we have room to add the single-byte escape sequence without going over the $length limit.

if ( $encoding eq 'jis' and
    $text =~ /$jis_code_set{ TWO_BYTE_CHAR }$/ ) {
    $text = chop_jchars( $text, $length - 3, $encoding );
    return substr( $orig_text, 0, $length ) unless defined $text;
    $text .= "x1bx28x42";
}

And we’re done!

  return $text;
}

Now for chop_jchars, which simply lops off Japanese characters from the end of the string until it is shorter than the requested length. It’s pretty ugly, and slow for large strings truncated to small values, but it does the job!

sub chop_jchars
{
    my $text = shift;
    my $length = shift;
    my $encoding = shift;

    while( length( $text ) > $length )
    {
        return undef unless $text =~ s!$char_re{ $encoding }$!!o;
    }

    return $text;
}


[2] Paice, C. “Constructing literature abstracts by computer.” Information Processing and Management, Volume 26(1), 1990. Emundson 1969.

[3] Earl, L.L. “Experiments in automatic extracting and indexing.” Information Storage and Retrieval, Volume 6, 313–334, 1970.

[4] Edmundson, H. P. “New methods in automatic extracting.” Journal of the ACM, 16(2):264-285, 1969.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.234.83