Tools for text summarization

Since our focus in this book is data mining with Python, we will focus on understanding some of the tools, libraries, and applications designed for text summarization in a Python environment. However, if you ever find yourself in a non-Python environment, or if you have a special case where you want to use an off-the-shelf or non-Python solution, you will be glad to know that there are dozens of other text summarization tools for other programming environments, many of which require no programming at all. In fact, the autotldr bot we discussed at the beginning of this chapter uses a package called SUMMRY, which has an API that is accessible via REST and returns JSON. You can read more about SUMMRY at http://smmry.com/api.

Here we will discuss three Python solutions: a simple NLTK-based method, a Gensim-based method, and a Python summarization package called Sumy.

Naive text summarization using NLTK

So far in this book, we have used NLTK for a variety of tasks including sentiment mining in Chapter 5, Sentiment Analysis in Text and named entity recognition in Chapter 6, Named Entity Recognition in Text. For our purposes in this chapter, we can also use the tokenizers from NLTK to build a simple text summarizer that is based on Luhn's method that he wrote about in The automatic creation of literature abstracts. This basic, extractive program will first tokenize each sentence in the text sample, then choose which words occur most frequently while excluding unimportant words, called stopwords, and finally it will find the sentence or sentences that include the important words.

Our program will include one other subtle effect. We will construct a score for each sentence based on the aggregated scores of the words inside it. For example, suppose the word cat appears in a text 10 times, and the word hat appears three times. The sentence The cat wears a hat will score a 13, or the score for cat plus the score for hat. The sentence His hat is different than her hat will only score 3, since the score for hat is counted only once. This scoring system has the advantage of privileging sentences that have a variety of important words in them, while minimizing the effect of sentences that have fewer important words, even if those words are repeated multiple times. The rationale for this scoring system is that sentences with a variety of important words are more likely to be topic sentences or main ideas, and topic sentences are more relevant for building a summary.

The code for this program, simpleTextSummaryNLTK.py, is available on the GitHub site for this book, at https://github.com/megansquire/masteringDM/tree/master/ch7. This GitHub site also contains the code samples for the other two summarizers used later in the chapter as well.

First we need to include a few libraries. We have used all of these in previous chapters, but if you skipped those, you can install libraries in Anaconda with conda install <packagename>:

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from collections import OrderedDict
import pprint

Next, we need a sample of text to summarize. For the example, I decided to use the text from the first portion of this chapter. You can get the text from the GitHub site or paste in some other text sample of your choosing. Following the text variable creation, we instantiate several data structures to hold the various data structures for sentences and counts:

text = '' # put text here
summary_sentences = []
candidate_sentences = {}
candidate_sentence_counts = {}

Now we should strip any carriage returns out of the text, for easier reading when the sentences are displayed. This code replaces carriage returns in the text sample with space characters:

striptext = text.replace('

', ' ')
striptext = striptext.replace('
', ' ')

Next we will get the list of the top 20 most frequent words in this text sample. We will tokenize the text sample into words, lowercase them, making sure we throw out any stopwords and punctuation:

words = word_tokenize(striptext)
lowercase_words = [word.lower() for word in words
                  if word not in stopwords.words() and word.isalpha()]

We can use the FreqDist package to find the frequency distribution of the remaining words, and the most_common() function to pull out the top 20 words from this list. This list is fairly informative, so we can print it to the screen to look at it:

word_frequencies = FreqDist(lowercase_words)
most_frequent_words = FreqDist(lowercase_words).most_common(20)
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(most_frequent_words)

Here, we will take the stripped text sample and tokenize it into a list of sentences. For each sentence, we will create a dictionary with the sentence itself as the key and its lowercase equivalent as the value:

sentences = sent_tokenize(striptext)
for sentence in sentences:
   candidate_sentences[sentence] = sentence.lower()

Now it is time to determine which of these sentences has the most important words in it. We iterate through the dictionary of candidate sentences, searching the lowercase version of the sentence for the important words. If we find an important word, we increment the score for that sentence by the value of the word we found. We save the original mixed-case version of the sentence, and its score, to the candidate_sentence_counts dictionary:

for long, short in candidate_sentences.items():
   count = 0
   for freq_word, frequency_score in most_frequent_words:
       if freq_word in short:
           count += frequency_score
           candidate_sentence_counts[long] = count

Finally we sort the sentences, placing the most important sentences first. We retain the top four most important sentences, and print them to the screen, along with their scores:

sorted_sentences = OrderedDict(sorted(
                    candidate_sentence_counts.items(),
                    key = lambda x: x[1],
                    reverse = True)[:4])
pp.pprint(sorted_sentences)

This program chooses the following words as the most significant from this chapter:

[   ('summarization', 15),
   ('text', 15),
   ('in', 4),
   ('extractive', 4),
   ('method', 4),
   ('summary', 4),
   ('words', 4),
   ('sentences', 4),
   ('original', 3),
   ('information', 3),
   ('documents', 3),
   ('ideas', 3),
   ('goal', 3),
   ('similar', 3),
   ('techniques', 3),
   ('literature', 3),
   ('early', 2),
   ('main', 2),
   ('this', 2),
   ('video', 2)]

Then, the program produces the following four-sentence summary:

In an extractive summarization method, the summary is comprised of words, phrases, or sentences that are drawn directly from the original text. With this early work, Luhn proposed a text summarization method where the computer would read each sentence in a paper, extract the frequently occurring words, which he calls significant words, and then look for the sentences that had the most examples of those significant words. In the academic literature, text summarization is often proposed as a solution to information overload, and we in the 21st century like to think that we are uniquely positioned in history in having to deal with this problem. Luhn's 1958 paper "The automatic creation of literature abstracts," describes a text summarization method that will "save a prospective reader time and effort in finding useful information in a given article or report" and that the problem of finding information "is being aggravated by the ever-increasing output of technical literature."

If we tried to treat this as a standalone summary, like a literature abstract, it is not very effective. The summary lacks the cohesion that a human would bring to the summarization task. The sentences seem disjointed and lack a proper flow from one to the next. This method also seems to be biased towards longer sentences. This makes sense if we consider that longer sentences will end up having higher scores, if only due to the aggregation of the word counts within them. Finally, this method does not take into account the placement of the sentence within the paragraph. Some academic research has shown that the main ideas, or topic sentences, in a text are likely to appear either first or last in a paragraph. If we wanted to add any of these additional features to our program, we would need to decide how to adjust our scoring system accordingly.

Text summarization using Gensim

A slightly more sophisticated approach to text summarization is included in the Gensim topic modeling package. Chapter 8, Topic Modeling in Text of this book has topic modeling as its entire focus, so we will not duplicate those tasks here. However, perhaps we can peek ahead just a little bit and use one tiny part of the Gensim library for our text summarization needs.

The Gensim approach to text summarization is quite different than the naive Luhn approach we used in the last section. To understand it, we need to recall some of the concepts from Chapter 4, Network Analysis, about graph theory and network analysis. The Gensim approach to finding important sentences in a text begins with building a weighted, undirected graph, where the nodes are the sentences and the links between them are a measure of how similar the sentences are to each other. This method is called TextRank, after the similar PageRank algorithm designed finding relevant web page search results. In TextRank, similarity is defined by how many common lexical tokens are shared between the two sentences. To avoid inadvertently privileging long sentences – which was one of the weaknesses of the naive Luhn approach we experimented with earlier – TextRank normalizes the similarity score by taking into account how long the sentence is. After the graph is built, the sentences with the highest weights are extracted as representative summaries of the text as a whole.

This approach was first described in the 2004 paper TextRank: Bringing Order into Texts by Rada Mihalcea and Paul Tarau, available at https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf. This work was updated by Federico Barrios, Federico Lopez, Luis Argerich, and Rosita Wachenchauzer in 2015, with the results of their experiments in calculating better similarity scores, or weights, in the graph. The latter group has contributed to the Gensim code base, according to the Changelog at https://github.com/RaRe-Technologies/gensim/blob/develop/CHANGELOG.txt.

To get started with text summarization in Gensim, we will first import the library:

import gensim.summarization

If you get an error from Anaconda about not having Gensim installed, remember to use conda install gensim to get that library loaded into your system.

Next we will set up a text string that we want to summarize. I used the same text sample as in our naive Luhn summarizer. By using the same text string, we will more easily be able to compare the results of the two text summarizers to each other. We can also strip off end-of-line carriage returns since this text sample has sentences that extend over multiple lines:

text = '' # put text here
striptext = text.replace('

', ' ')
striptext = striptext.replace('
', ' ')

Next, we summarize and print the results:

summary = gensim.summarization.summarize(striptext, word_count=50) 
print(summary)

With the word_count parameter, we asked for a summary of no more than 50 words. The resulting summary is:

Luhn's 1958 paper "The automatic creation of literature abstracts," describes a text summarization method that will "save a prospective reader time and effort in finding useful information in a given article or report" and that the problem of finding information "is being aggravated by the ever-increasing output of technical literature." With this early work, Luhn proposed a text summarization method where the computer would read each sentence in a paper, extract the frequently-occurring words, which he calls significant words, and then look for the sentences that had the most examples of those significant words.

This is actually a pretty good pairing of sentences, and reads fairly well. To see how Gensim chose those sentences, we can ask Gensim to pick out the keywords from the text sample and print them:

keywords = gensim.summarization.keywords(striptext) 
print(keywords)

Gensim returns a list of the following 22 words:

documents
document
literature text summarization
abstracts
abstractive
ideas
idea
shorter
words
summary
summaries
information
reader
type
types
luhn
important
sentence
sentences
extract
extractive
extracted

Note

Notice that the Gensim keywords() function does not naturally combine words that have the same stem. In other words, the keywords extract, extractive, and extracted could be combined into a single concept but keywords() does not do this.

Additionally, with Gensim summarizer, not all the keywords have to be a single word; multi-word phrases are supported. The result literature text summarization is an example of such a phrase. How did this rather awkward phrase get chosen? My first hypothesis was that perhaps Gensim was getting confused by the comma in the first sentence of the text sample. But I removed the comma, and it still found literature text summarization was a keyword. I tested the summarizer by rewording the beginning of the sentence as follows: Text summarization is often proposed... After doing this, the keywords() function returned a new list of 23 words. The lists are very similar, except that text summarization is now the first result, the word size was added, and the order of some of the words was slightly altered.

The Gensim keyword list has a few things in common with the Luhn-style list created earlier, namely the words text and summarization are in both sets, as are extractive, words, literature, ideas, information, summary, and documents. Between the two summarizers, that does seem to be a good list of words.

Text summarization using Sumy

There are many more methods for text summarization than just our naive Luhn-style approach and Gensim's TextRank approach. If we are not sure which one to try, or perhaps we want to try even more exotic methods, we can turn to a full-fledged text summarization library that has multiple algorithms built in. Sumy is just such a library, and it is available at https://github.com/miso-belica/sumy. Because it is not part of the Anaconda distribution, we will have to install it manually by running the following command inside a terminal:

pip install sumy

Once we have Sumy installed, we can set up a simple text summarizer and try out the different algorithms that are built in. The following code shows four different text summarization algorithms implemented in Sumy. We will walk through each of them in turn. Before doing so, we must first import a number of Sumy's summarizers, as well as its utility library. This code is available on the Github site for this book at https://github.com/megansquire/masteringDM/blob/master/ch7/sumySummarize.py:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.edmundson import EdmundsonSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

Then we set up our language to be english, and our number of summary sentences to be four:

LANGUAGE = "english"
SENTENCES_COUNT = 4

Note

In addition to English, Sumy has stopword lists available for Czech, French, German, Portuguese, Slovak, and Spanish.

Next, we read in our sample file. Here Sumy will be directed to read sampleText.txt, a file that consists of the exact same text as we used in our earlier two examples:

parser = PlaintextParser.from_file("sampleText.txt", Tokenizer(LANGUAGE))
stemmer = Stemmer(LANGUAGE)

Here the last line directs Sumy to use the built in stemmer to take care of word stems. Recall that stemming was one of the features we did not have in our previous two algorithms, but when we looked at the keyword lists and how repetitive they were, stemming seemed like a good idea.

Sumy's Luhn summarizer

Now we are ready to call four different summarization methods so that we can compare them. We separate each of the following four print statements with a line so that we can compare the results. Sumy's Luhn-based method is first:

print("
====== Luhn ======")
summarizerLuhn = LuhnSummarizer(stemmer)
summarizerLuhn.stop_words = get_stop_words(LANGUAGE)
for sentenceLuhn in summarizerLuhn(parser.document, SENTENCES_COUNT):
   print(sentenceLuhn, "
")

The LuhnSummarizer() function creates a summarizer based on the stemmed text. It requires a list of stopwords too, so we use Sumy's get_stop_words() function for an English-language list of these words.

The four sentences chosen by the Luhn summarizer are shown here. I added letters (A, B, C, and so on) before each sentence so that we can compare them across summarizers and figure out which sentences are chosen more frequently:

====== Luhn ======
A. However, even in the 1950s when automatic text summarization techniques were in their infancy, the stated goal was similar.
B. Luhn's 1958 paper "The automatic creation of literature abstracts," describes a text summarization method that will "save a prospective reader time and effort in finding useful information in a given article or report" and that the problem of finding information "is being aggravated by the ever-increasing output of technical literature."
C. With this early work, Luhn proposed a text summarization method where the computer would read each sentence in a paper, extract the frequently-occurring words, which he calls significant words, and then look for the sentences that had the most examples of those significant words.
D. As long as the amount of text that is extracted is a subset of the original text, this type of summarization achieves the goal of compressing the original text into a shorter size.

Sumy's TextRank summarizer

Next, we set up a summarizer using TextRank. TextRank also requires a list of stopwords:

print("====== TextRank ======")
summarizerTR = TextRankSummarizer(stemmer)
summarizerTR.stop_words = get_stop_words(LANGUAGE)
for sentenceTR in summarizerTR(parser.document, SENTENCES_COUNT):
   print(sentenceTR, "
")

The four sentences chosen by Sumy's implementation of TextRank are shown next. One sentence, labeled D, was also found in the Luhn summary. The other three sentences labeled E, F, and G are new:

====== TextRank ======
E. With this early work, Luhn proposed a text summarization method where the computer would read each sentence in a paper, extract the frequently-occurring words, which he calls significant words, and then look for the sentences that had the most examples of those significant words.
F. In an extractive summarization method, the summary is comprised of words, phrases, or sentences that are drawn directly from the original text.
D. As long as the amount of text that is extracted is a subset of the original text, this type of summarization achieves the goal of compressing the original text into a shorter size.
G. In this chapter we will focus on summarization techniques for text documents, but researchers are also working on summarization algorithms designed for video, images, sound, and more.

Sumy's LSA summarizer

The next algorithm is based on Latent Semantic Analysis, or LSA. This approach was first introduced for text summarization by Yihong Gong and Xin Liu in a 2001 paper Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis, available at http://www.cs.bham.ac.uk/~pxt/IDA/text_summary.pdf. This work was enhanced in the 2004 paper Using Latent Semantic Analysis in Text Summarization and Summary Evaluation by Josef Steinberger and Karel Ježek, available at http://www.kiv.zcu.cz/~jstein/publikace/isim2004.pdf. We will tackle more of the details behind latent semantic analysis in Chapter 8, Topic Modeling in Text, when we learn about topic modeling, but for now, the basic idea is that the LSA technique forms a matrix of terms on the rows and sentences on the columns. The value at the intersection of row and column is how many times each term appears in each sentence. Similarity is determined by first reducing the matrix mathematically, and then the cosines of the angles of the vectors in the reduced matrix are compared to find rows that are similar. In text summarization, the most important sentences will be those that are the most similar to others:

print("====== LSA ======")
summarizerLSA = LsaSummarizer(stemmer)
summarizerLSA.stop_words = get_stop_words(LANGUAGE)
for sentenceLSA in summarizerLSA(parser.document, SENTENCES_COUNT):
   print(sentenceLSA, "
")

The following results indicate that item B was previously found in Luhn, and item F was found in TextRank. Items H and I are unique to LSA:

====== LSA ======
B. Luhn's 1958 paper "The automatic creation of literature abstracts," describes a text summarization method that will "save a prospective reader time and effort in finding useful information in a given article or report" and that the problem of finding information "is being aggravated by the ever-increasing output of technical literature."
F. In an extractive summarization method, the summary is comprised of words, phrases, or sentences that are drawn directly from the original text.
H. Alternatively, an abstractive summarization attempts to distill the key ideas in a text and repackage them into a human-readable, and usually shorter, synthesis.
I. However, since the goal is to create a summary, abstractive methods must also reduce the length of the text while focusing on only retaining the most important concepts in it.

Sumy's Edmundson summarizer

Finally, we implement another more complex algorithm, named after H.P. Edmundson's 1969 paper New Methods in Automatic Extracting, which is available at http://courses.ischool.berkeley.edu/i256/f06/papers/edmonson69.pdf. The main difference with the Edmundson approach over, for example, the Luhn approach, is that he allows the analyst to inject certain words as cues that are highly correlated to the importance of a sentence. This cue method is based on the hypothesis that the probable relevance of a sentence is affected by the presence of pragmatic words. Examples of words that point to an important sentence are called bonus words. The opposite of these are stigma words, or those that negatively affect whether a sentence is important. Finally, he allows for null words, which are words that are neutral, or irrelevant, to the importance of a sentence.

In the Sumy implementation of Edmundson, the user is required to configure lists of bonus, stigma, and null words. Edmundson advocates for using statistical analyses of similar documents to come up with these words. In the following example, we will initialize each of these lists with the throwaway token foo just to get the program running, and then we will investigate what happens when we alter the bonus, stigma, and null words:

print("====== Edmonson ======")
summarizerEd = EdmundsonSummarizer(stemmer)
summarizerEd.bonus_words = ('foo')
summarizerEd.stigma_words = ('foo')
summarizerEd.null_words = ('foo')
for sentenceEd in summarizerEd(parser.document, SENTENCES_COUNT):
   print(sentenceEd, "
")

The Edmundson results with foo as the cue words are shown here. Items A and B were found previously in the Luhn summarizer, but items J and K are new:

J. In the academic literature, text summarization is often proposed as a solution to information overload, and we in the 21st century like to think that we are uniquely positioned in history in having to deal with this problem.
A. However, even in the 1950s when automatic text summarization techniques were in their infancy, the stated goal was similar.
B. Luhn's 1958 paper "The automatic creation of literature abstracts," describes a text summarization method that will "save a prospective reader time and effort in finding useful information in a given article or report" and that the problem of finding information "is being aggravated by the ever-increasing output of technical literature."
K. In the next section we will review some of the currently available text summarization libraries and applications.

What happens to the Edmundson summarizer when we adjust the cue words? To perform this experiment, we will first read through the text sample and find words that seem to positively correlate to what we think the topic sentences are, then do the same thing for negative, or stigma, words and null words. We can adjust the code as follows:

print("====== Edmonson ======")
summarizerEd = EdmundsonSummarizer(stemmer)
summarizerEd.bonus_words = ('focus', 'proposed', 'method', 'describes')
summarizerEd.stigma_words = ('example')
summarizerEd.null_words = ('literature', 'however')
for sentenceEd in summarizerEd(parser.document, SENTENCES_COUNT):
   print(sentenceEd, "
")

The results are shown here. Items J and K are still in the output, just like before, but now the results show item I (also found by LSA) and item G (also found by TextRank):

J. In the academic literature, text summarization is often proposed as a solution to information overload, and we in the 21st century like to think that we are uniquely positioned in history in having to deal with this problem.
I. However, since the goal is to create a summary, abstractive methods must also reduce the length of the text while focusing on only retaining the most important concepts in it.
G. In this chapter we will focus on summarization techniques for text documents, but researchers are also working on summarization algorithms designed for video, images, sound, and more.
K. In the next section we will review some of the currently available text summarization libraries and applications.

The Edmundson technique is obviously very configurable, which could be a good thing if we had a list of cue words we felt confident about. However, if we are not really sure what to use as cue words, that same flexibility and configurability becomes a weakness.

Scanning across the four implementations, the sentences that were chosen the most are B, D, F, I, and G. Each of these sentences was chosen at least twice. We can manually organize these sentences into a single paragraph, and we get the following summary:

Luhn's 1958 paper "The automatic creation of literature abstracts," describes a text summarization method that will "save a prospective reader time and effort in finding useful information in a given article or report" and that the problem of finding information "is being aggravated by the ever-increasing output of technical literature." In an extractive summarization method, the summary is comprised of words, phrases, or sentences that are drawn directly from the original text. As long as the amount of text that is extracted is a subset of the original text, this type of summarization achieves the goal of compressing the original text into a shorter size. However, since the goal is to create a summary, abstractive methods must also reduce the length of the text while focusing on only retaining the most important concepts in it. In this chapter we will focus on summarization techniques for text documents, but researchers are also working on summarization algorithms designed for video, images, sound, and more.

This summary is a bit awkward in its style, but it does convey some of the main points of the introduction to this chapter. It is all the more impressive when we consider that it was automatically generated, based on very little domain knowledge or human intervention.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.8.216