Chapter 7. Automatic Text Summarization

In an era of information overload, the objective of text summarization is to write a program that can reduce the size of a text, while preserving the main points of its meaning. The task is somewhat similar to the way an architect might create a scale model of a building. The scale model gives the viewer a sense of the important parts about the structure, but does so with a smaller size footprint, fewer details, and without the same expense in time or materials.

Consider Reddit, a news-oriented social media site, with its thousands of news articles posted daily by users. Is it possible to generate a short summary of a news article that preserves the key facts and general meaning of the original story? A few Reddit users created summary bots to do exactly this. These so-called TLDR bots (too long; didn't read) post summaries of user-submitted news stories, usually including a link to the original story and statistics to show by what percentage they reduced the text. One of these bots is named autotldr, which has its own Reddit user page at https://www.reddit.com/user/autotldr/. Created in 2011, autotldr follows links to news stories and summarizes them in a comment posting. It always announces itself before its summary like this, This is the best tl;dr I could make, original reduced by 73%. (I'm a bot). Users seem to enjoy the autotldr bot, and its machine-generated news summaries have been up-voted 190,000 times.

So how does this kind of text summarization actually work?

In this chapter, we will learn:

  • What is automatic text summarization and why is it important?
  • How can we build a naive text summarization system from scratch?
  • How can we implement more sophisticated text summarizers and compare their effectiveness?

What is automatic text summarization?

In the academic literature, text summarization is often proposed as a solution to information overload, and we in the 21st century like to think that we are uniquely positioned in history in having to deal with this problem. However, even in the 1950s when automatic text summarization techniques were in their infancy, the stated goal was similar. H.P. Luhn's 1958 paper The automatic creation of literature abstracts, available in a number of places online, including at http://altaplana.com/ibm-luhn58-LiteratureAbstracts.pdf, describes a text summarization method that will save a prospective reader time and effort in finding useful information in a given article or report and that the problem of finding information is being aggravated by the ever-increasing output of technical literature.

Luhn proposed a text summarization method where the computer would read each sentence in a paper, extract the frequently occurring words, which he calls significant words, and then look for the sentences that had the most examples of those significant words. This is an early example of an extractive method of text summarization. In an extractive summarization method, the summary is composed of words, phrases, or sentences that are drawn directly from the original text. Ideally, every text will have one or more main ideas or topic sentences that can serve as summaries of some portion of the text. The extractive summarization algorithm looks for these important sentences. As long as the amount of text that is extracted is a subset of the original text, this type of summarization achieves the goal of compressing the original text into a shorter size.

Alternatively, an abstractive summarization attempts to distill the key ideas in a text and repackage them into a human-readable synthesis. This task is similar to paraphrasing. However, since the goal is to create a summary, abstractive methods must also reduce the length of the text and not just be a restatement of it.

For this chapter, we will focus on summarization techniques for text documents. Other researchers are also working on summarization algorithms designed for video, images, sound, and more. Some of these data types lend themselves better to extractive summarization rather than abstractive summarization; for example a video summary should probably consist of clips taken from the videos themselves. We will focus on single-document summaries in this chapter, but there are also summarization techniques that are designed to work with collections of documents. The idea with multi-document summarization is that we can scan across a number of related documents, picking out the main ideas correctly, while ensuring that the resulting summary is free of duplicates and is human-readable.

In the next section, we will review some of the currently available single-document text summarization libraries and applications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.192.11