Canon, like many other large companies, is a multinational organization with multiple web sites, each managed by a different part of the company. This is a problem for the typical Canon customer, who knows nothing about Canon’s internal organization and simply wants to find information about their cameras or download a new printer driver. They need a single clear way to find what they want.
Back in 1997, we wrote CS-Web, a set of Perl programs to collect information from all of Canon’s web sites, index it, and make it searchable from the web. We wrote our own solution because at the time the available products were either services designed for searching the entire web (such as AltaVista), or tools for indexing and searching a single web site.
CS-Web consists of a robot, a database, and a web interface (written in mod_perl). The robot traverses all of Canon web sites and stores a description of each page in the database. The search engine queries the database and gives you a list of candidate documents, and their descriptions. You can try CS-Web for yourself: it is linked from the main “gateway” page for Canon (http://www.canon.com/). You can also access it directly at http://csweb.cre.canon.co.uk/.
CS-Web presented a variety of challenges, many of which make suitable war stories for TPJ. However, for this article, we will focus on one crucial problem: generating the summary of an HTML document.
Unlike some other search engines, CS-Web doesn’t index the full
text of the document. Instead, it indexes document descriptions. When
web page authors use the META
tag’s
NAME
and CONTENT
attributes to
encode information about the document, CS-Web will use it. However,
when no such information is conveniently provided by the author,
CS-Web tries to boil down the text of the document into a description
it can use.
One important limitation on a document’s description is length; each web page corresponds to a row in a relational database table, and for performance reasons the size of each field in each row is fixed in advance. This is because with the database engine we were using at the time, MySQL, fixed-width fields were much quicker to search than variable-width fields. You’ll see later how this length constraint introduced its own problems.
If we were deploying CS-Web across a lot of public web sites, we’d have quickly found that very few web page authors consistently provide accurate metadata. In fact, deliberately misleading metadata is often included by unscrupulous authors to enhance the page’s prominence in search engine results.
However, the outlook for CS-Web was a little more promising.
Since Canon’s webmasters are generally working together, we could
expect a certain level of integrity, and assume that they were not
trying to deceive the CS-Web robot. In turn, the CS-Web robot could
acknowledge this trust: if it found a page description within a
META
tag, it would accept it as legitimate.
However, the task of adding these META
tags to
existing pages can be a time-consuming process, and we couldn’t rely
on their presence on all of the pages.
So how could we generate a text description from the raw HTML when no metadata are present? By using a combination of some natural language processing techniques and Perl. The result was the HTML::Summary module. We’ll explore it shortly, but before we do we’ll look at some basic summarization methods for text.
The basic approach of most summarization systems is to examine each sentence in the original document, assess its importance (using one or more known heuristics) and then output a summary of the desired length by omitting the less important sentences. Obviously, our success relies on how well we can measure importance. Usually, a combination of the following six simple methods is used:
Sentences are scored according to their position or location within the document. For example, sentences occurring at the beginning or end of the first paragraph, or within a heading, are given a higher score than sentences in the middle of a paragraph.
Certain words in the document indicate the presence of more (or less) important material. For example, strongly positive words like “best,” “significant,” and “greatest” increase the sentence score. By contrast, negative words like “impossible” or “hardly” decrease the sentence score.
The title of the document is assumed to be a reliable indication of the focus of its contents; sentences referring to those concepts are given a higher score. To help out, some pre-processing can be used; for example, a stemmer can conflate inflected terms to a single root (“runs” and “running” become “run”). Similarly, a stop list may be used to filter out stop words (“the,” “of,” “and,” and so on).
The important concepts in a document will yield particular keywords that occur with a greater-than-expected frequency, so sentences containing these words are given a higher score. The keywords are usually identified by sorting word frequencies and removing the stop words. A slightly more sophisticated variant involves the use of “distinctiveness” rather than raw frequency—normalizing the frequency counts by a priori frequencies taken from an independent large text corpus.
This method is similar to the cue method, except that in this case one looks for certain phrases rather than words. For example, “The aim of this paper is…” and “This document attempts to review…” both indicate that the important concept is about to be introduced, so documents containing such constructions should receive higher scores. There are obviously many different indicator phrases, but research suggests that these are usually derived from a small number of underlying templates.[2]
Experiments from up to thirty years ago have attempted to correlate sentence importance with syntactic structure, so far without conclusive results.[3]
Edmundson performed a comparative evaluation of the above six methods, and found the first four to be superior, in the order shown above. In addition, he evaluated their performance in combination, and found a linear combination of the first three (with an appropriate weighting given to the scores obtained from each method) to be even better.[4]
HTML::Summary is available from CPAN; Version 0.016 is described in this article. This is how it is used.
First, you create an HTML::Summary object, using the
new
method. You can provide configuration
parameters as arguments to new
:
my $html_summarizer = new HTML::Summary LENGTH => 200;
The LENGTH
parameter is the maximum length in
bytes for the generated summary. Next, you need an HTML::Element
object corresponding to the HTML page that you want to summarize. You
can generate one with HTML::TreeBuilder:
my $html_tree = new HTML::TreeBuilder; $html_tree->parse( $html_document );
$html_document
is a string containing the
HTML of the web page; this could have been read in from a file, or
returned as the contents of an HTTP request, such as through
LWP::Simple’s get
method.
Finally, you call the generate
method of the
HTML::Summary object, with the HTML::Element object as an argument,
which returns the summary of the page as a string:
$html_summary = $html_summarizer->generate( $html_tree );
That’s how you use it. But how does it work?
One of the main tasks before us was generating a good fixed-length abstract from arbitrary text. As described above, this is known to be a important and difficult problem, and a quality solution requires sophisticated natural language techniques that can analyze the structure of the original, identify key phrases and concepts, and regenerate them in a more succinct format.
Luckily for us, there are some quick and dirty ways to generate summaries. We only needed to provide a gist of the original for someone browsing the CS-Web search results. In addition, for retrieval purposes, we want the summary to contain representative keywords.
One advantage that we had over people trying to generate summaries from plain text is that HTML pages contain markup information—the HTML tags. Markup tells us about the structure of the content, and often about its relative importance as well. For example, it is usually clear in HTML pages where paragraphs begin and end, and when important text is italicized, emboldened, or made into a heading.
The HTML::Summary module uses the location method of text summarization described above. This identifies important sentences (based primarily on their location in the text), and concatenating them together to produce an abstract. A simple example of this would be to take the first sentence of every paragraph in an article and string them together. This can sometimes be surprisingly effective:
Canon, like many other large companies, is a multi-national organization with multiple (26) web sites, each managed by a different part of the company. In 1997 we wrote CS-Web, a set of Perl programs to collect information from all of Canon's web sites, index it, and make it searchable from the web. CS-Web consists of a robot, a database, and a web interface (written in mod_perl). CS-Web presented a variety of challenges, many of which would make suitable war stories for TPJ.
The text summarization method used in HTML::Summary is an adaptation of the location method. It works as follows:
First, the text is split into sentences. (More about this later.)
The sentences are scored according to what HTML element they appear in, and whether or not they are the first sentence in that element. The algorithm here is pretty simple: each element has a score. The first sentence in that element gets this score; the rest of the sentences get nothing.
The sentences are stored in an array of hashes. Each hash corresponds to a sentence, and contains information about the text in the sentence, its length, the HTML element it appeared in, its score, and its original order in the text.
$summary[ scalar( @summary ) ] = { 'text' => $text, 'length' => length( $text ), 'tag' => $tag, 'score' => $score, 'order' => scalar( @summary ), };
The scores, as described above, are based on the HTML element that the sentences appear in. These scores are stored in a global hash:
my %ELEMENT_SCORES = ( 'p' => 100, 'h1' => 90, 'h2' => 80, 'h3' => 70, );
These scores were arrived at by empirical investigation; we have no theoretical justification for them.
Calculate how many sentences are needed before the requested summary length is met (or exceeded).
Having remembered the original sentence order in the text in the hash for that sentence, we can now re-sort the sentences in that order.
Spaces are added between the sentences, since whitespace was stripped when the sentences were split.
This last step assumes that if you want a summary of 200 characters, 201 characters are not acceptable—even if it means chopping the summary off midsentence. This is what we wanted in CS-Web. Maybe in other applications a less severe approach would be appropriate—it’s easy to add more options to HTML::Summary, so let us know what you think.
Now for the nitty gritty. The remainder of this article focuses on just one aspect of the HTML::Summary: splitting the element contents into sentences. Japanese character encodings were a particular problem for CS-Web; our approach is described in the section Afterword: Truncating Japanese Text.
The task of splitting text into sentences seemed like a more general problem than its application to text summarization, so this is contained in a separate module, Text::Sentence (also available from CPAN).
Text::Sentence is basically just a regex. It is has a
non-object-oriented interface that exports one function,
split_sentences
, that takes the text to be split
into sentences as an argument, and returns a list of the
sentences.
sub split_sentences { my $text = shift; return () unless $text;
The function first checks if there really is any text to split into sentences; if not, it just returns an empty list.
# $capital_letter is a character set; to account for locale, this # includes all letters for which lc is different from that letter. my $capital_letter = '[' . join( '', grep { lc( $_ ) ne ( $_ ) } map { chr( $_ ) } ord( "A" ) .. ord( "xff" ) ) . ']' ;
Although it would be more efficient to compute this regex
component once at the package level, doing it in
split_sentences
allows the user to change locales
between calls.
The next few lines build up the components of the regex that
split the text into sentences. The first of these components is the
capital letter found at the start of a sentence. Instead of using the
character class [A-Z]
as you would normally,
Text::Sentence accounts for locale-specific capital letters. For
example, in French, a capital A acute (Á) won’t be matched by
[A-Z]
. The method used in Text::Sentence makes use
of the fact that lc
is sensitive to locale
settings, and returns a lowercase version of all capitalized
characters. A set of locale-specific capital letters can be built up
for the extended ASCII range by filtering any characters changed by
lc
. For more information on how Perl handles
locales, see the perllocale
documentation bundled
with Perl.
@PUNCTUATION = ( '.', '!', '?' );
The @PUNCTUATION
array is a global variable
in Text::Sentence containing any punctuation used to indicate the end
of a sentence. The fact that it’s a global means that you’re able to
change it (although the interface could be improved—an options hash
passed to split_sentences
, perhaps. For example,
you might want to add locale specific punctuation for the Spanish
¡.
push( @Text::Sentence::PUNCTUATION, chr( 161 ) );
Back to split_sentences
:
# This needs to be alternation, not a character class, # because of multibyte characters my $punctuation = '(?:' . join( '|', @PUNCTUATION ) . ')';
As mentioned above, one of our concerns was dealing with
multibyte character encodings (see Afterword: Truncating Japanese Text). Japanese punctuation
characters may be more than one character long, so we can’t use a
character class for punctuation in the sentence splitting regex. For example, an exclamation
point in the EUC Japanese encoding is
“xA1xAA”
.
# Return $text if there is no punctuation ... return $text unless $text =~ /$punctuation/;
If these isn’t any end-of-sentence punctuation in the text, then we might as well return the text now.
my $opt_start_quote = q/['"]?/; my $opt_close_quote = q/['"]?/; # These are distinguished because (eventually!) I would like to do # locale stuff on quote characters my $opt_start_bracket = q/[[({]?/; # }{ my $opt_close_bracket = q/[])}]?/;
Sentences sometimes have quotation marks or parentheses that come before the capital letter at the beginning, or after the full stop (period, question mark, or exclamation point) at the end. For example, the following sentence:
Larry said "let there be light!" (And there was.)
is two sentences; the first ends after the second double quote. However, this is one sentence:
Larry said "let there be light!" (and there was).
Here is the regex in all its glory:
my @sentences = $text =~ / ( # Sentences start with ... $opt_start_quote # an optional start quote $opt_start_bracket # an optional start bracket $capital_letter # a capital letter ... .+? # at least some (non-greedy) anything ... $punctuation # ... followed by any one of !?. $opt_close_quote # an optional close quote $opt_close_bracket # and an optional close bracket ) (?= # with lookahead that it is followed by ... (?: # either ... s+ # some whitespace ... $opt_start_quote # an optional start quote $opt_start_bracket # an optional start bracket $capital_letter # an uppercase word character (for locale # sensitive matching) | # or ... # a couple (or more) of CRs (i.e. a new para) | # or ... s*$ # optional whitespace, followed by end of string ) ) /gxs ; return @sentences if @sentences; return ( $text ); }
This regex makes use of the lookahead feature of regular expressions. In this case, it allows us to specify that a sentence must not only start with a capital letter, and end in a full stop, but that there must be another capital letter that follows the full stop. The only exception to this is when the sentence is either at the end of a paragraph, or at the end of the string.
The lookahead accounts for the whitespace between sentences, so
it’s not part of the matched patterns that end up in the
@sentences
array. That’s why concatenating the
sentences won’t give you back the exact original text.
The main problem with trying to split text into sentences is that there are several uses for periods, such as abbreviations.
Dr. Livingstone, I presume.
This phrase counts as two sentences according to
Text::Sentence—the first sentence is three characters long. The
performance of Text::Sentence could be improved by taking into account
special cases like honorifics (Mr
.,
Mrs
., Dr
.), common abbreviations
(e.g
., etc
.,
i.e
.), and so on. However, as with many natural
language problems, this obeys the law of diminishing returns; a little
bit of effort will do a decent 90% job, but that last 10% is pretty
difficult. For our purposes, the 90% is good enough.
We chose to use Perl for CS-Web because of the obvious benefits: the LWP modules for web programming, DBD/DBI, mod_perl, and so on. We found that Perl is also a useful tool for doing natural language work. Its text processing features, rapid development cycle, and ability to generate complex data structures on the fly make it particularly appropriate.
A lot of interesting work in natural language research involves analyzing corpus data; collecting statistics about language use over large databases of typical usage. The web is an obvious rich source of this type of data, and in view of this, it is a little surprising how few tools and modules appeared to be available in Perl for this field. Certainly, when we posted about Text::Sentence to a language processing mailing list, there seemed to be quite a lot of interest in what we were doing, as well as extensive Perl expertise in that community. Hopefully, natural language processing will become yet another nut for Perl to crack!
Canon is a Japanese company, with Japanese text on many of its web pages. Japanese text is usually encoded in one of several possible multibyte encoding schemes (not including Unicode!), and some of these schemes use variable numbers of bytes to represent single Japanese characters, or intermingle Japanese and regular ASCII characters. This was a problem.
The summaries generated by Text::Summary are truncated at a fixed length, and this length is specified in bytes, not characters. If Japanese text is truncated at an arbitrary byte length, this might mean truncation in the middle of a character.
Worse, our page abstracts can appear in result listings for keyword searches. If a page summary broken midcharacter is inserted into running text, the byte immediately following the summary could be interpreted as the next byte of the previously uncompleted Japanese character, upsetting the character boundaries for the rest of the text.
The Text::Sentence used another module, Lingua::JA::Jtruncate
(also available on CPAN), which addresses this problem.
Lingua::JA::Jtruncate contains just one subroutine,
jtruncate
, used as follows:
use Lingua::JA::Jtruncate qw( jtruncate ); $truncated_jtext = jtruncate( $jtext, $length );
where $jtext
is some Japanese text that you
want to truncate, $length
is the maximum truncation
length, and $truncated_text
is the result. Here’s
how it works.
First, some regexes are defined that match characters in each of the three main Japanese coding schemes: EUC, Shift-JIS, and JIS.
%euc_code_set = ( ASCII_JIS_ROMAN => '[x00-x7f]', JIS_X_0208_1997 => '[xa1-xfe][xa1-xfe]', HALF_WIDTH_KATAKANA => 'x8e[xa0-xdf]', JIS_X_0212_1990 => 'x8f[xa1-xfe][xa1-xfe]', ); %sjis_code_set = ( ASCII_JIS_ROMAN => '[x21-x7e]', HALF_WIDTH_KATAKANA => '[xa1-xdf]', TWO_BYTE_CHAR => '[x81-x9fxe0-xef][x40-x7ex80-xfc]', ); %jis_code_set = ( TWO_BYTE_ESC => '(?:' . join( '|', 'x1bx24x40', 'x1bx24x42', 'x1bx26x40x1bx24x42', 'x1bx24x28x44', ) . ')' , TWO_BYTE_CHAR => '(?:[x21-x7e][x21-x7e])', ONE_BYTE_ESC => '(?:x1bx28[x4ax48x42x49])', ONE_BYTE_CHAR => '(?:' . join( '|', '[x21-x5f]', # JIS7 Half width katakana 'x0f[xa1-xdf]*x0e', # JIS8 Half width katakana '[x21-x7e]', # ASCII / JIS-Roman ) . ')' ); %char_re = ( 'euc' => '(?:' . join( '|', values %euc_code_set ) . ')', 'sjis' => '(?:' . join( '|', values %sjis_code_set ) . ')', 'jis' => '(?:' . join( '|', values %jis_code_set ) . ')', );
Each of the regexes in %char_re
matches one
character encoded in the scheme corresponding to the keys of the
hash.
Now for the definition of the jtruncate
subroutine; first, some fairly obvious sanity checks:
sub jtruncate { my $text = shift; my $length = shift; # sanity checks return '' if $length == 0; return undef if not defined $length; return undef if $length < 0; return $text if length( $text ) <= $length;
Now we save the original text; this is used later if the truncation process fails for some reason.
my $orig_text = $text;
Now we use Lingua::JA::Jcode::getcode
to
detect which encoding the text uses.
Lingua::JA::Jcode::getcode
is a simple wrapper
around the jcode.pl
Perl library for Japanese character code conversion. Kazumasa Utashiro
kindly agreed to let us distribute the code with HTML::Summary.
my $encoding = Lingua::JA::Jcode::getcode( $text );
If getcode
returns undef
,
or a value other than euc, sjis
, or
jis
, then it has either failed to detect the
encoding, or detected that it is not one of those that we are
interested in. We then take the brute force approach, using
substr
.
if ( not defined $encoding or $encoding !~ /^(?:euc|s?jis)$/ ) { return substr( $text, 0, $length ); }
The actual truncation of the string is done in
chop_jchars
—more on this subroutine in a
bit.
$text = chop_jchars( $text, $length, $encoding );
chop_jchars
returns undef
on failure. If we have failed to truncate the Japanese text properly,
we resort to substr
again. We had to decide whether
it was more important to meet the $length
constraint or risk returning a Japanese string with broken character
encoding. We chose the former:
return substr( $orig_text, 0, $length ) unless defined $text;
Next, a special case: JIS encoding uses escape sequences to
shift in and out of single-byte and multibyte modes. If the truncation
process leaves the text ending in multi-byte mode, we need to add the
single-byte escape sequence. Therefore, we truncate (at least) three
more bytes from JIS encoded string, so we have room to add the
single-byte escape sequence without going over the
$length
limit.
if ( $encoding eq 'jis' and $text =~ /$jis_code_set{ TWO_BYTE_CHAR }$/ ) { $text = chop_jchars( $text, $length - 3, $encoding ); return substr( $orig_text, 0, $length ) unless defined $text; $text .= "x1bx28x42"; }
And we’re done!
return $text; }
Now for chop_jchars
, which simply lops off
Japanese characters from the end of the string until it is shorter
than the requested length. It’s pretty ugly, and slow for large
strings truncated to small values, but it does the job!
sub chop_jchars { my $text = shift; my $length = shift; my $encoding = shift; while( length( $text ) > $length ) { return undef unless $text =~ s!$char_re{ $encoding }$!!o; } return $text; }
[2] Paice, C. “Constructing literature abstracts by computer.” Information Processing and Management, Volume 26(1), 1990. Emundson 1969.
[3] Earl, L.L. “Experiments in automatic extracting and indexing.” Information Storage and Retrieval, Volume 6, 313–334, 1970.
[4] Edmundson, H. P. “New methods in automatic extracting.” Journal of the ACM, 16(2):264-285, 1969.
18.218.234.83