Text summarization

The target of text summarization is to generate a concise and coherent conclusion or summary of the major information of the input. Three major steps are performed in most of the summarization systems. They are as follows:

  1. Build a temporary structure that contains the main portion of the key point of the input text.
  2. Next, the sentences from the input are scored with the output of the first step.
  3. Lastly, a final summary that represents the input documents is set up by several sentences.

One popular strategy is to remove the unimportant information, clauses, or sentences and, at the same time, build classifiers to make sure that the key information is not thrown away, which is, in another viewpoint, the relative importance of topics functioned here during the summarization process. The final result is represented in a coherent way.

The summarization is a dynamic nonstop process. First, we need to build summaries on the set of old documents' dataset, which means multidocuments' summarization. The second step is to summarize the new documents when we get the result of the first step.

Due to the difficulty in building a new sentence for the summaries, one solution is extractive summarization, that is, to extract the most relevant sentences from the document dataset or documents. With the growth of the size of the documents set, the time evolution of the topics is also an important issue for which statistical topic modeling such as time-series-based algorithms are proposed.

There are two popular solutions for intermediate representation: topic representation and indicator representation. Sentence scoring, the score of each sentence, is determined by a couple of important factors such as the combination of characters. Summary sentence selection is determined by the most important N sentences.

There are many good characteristics that describe the final summary: it is indicative or informative, extract or abstract, generic or query-oriented, consists of background or just the news, is monolingual or cross lingual, and consists of a single document or multiple documents.

The benefit of text summarization is to improve the efficiency of document processing during which the summarization of a certain document can help the reader decide whether to analyze the document for specific purposes. One example is to summarize the multilingual, large (nearing unlimited), dynamic dataset of documents from various sources, including the Web. Examples include summarization of medical articles, e-mail, the Web, speech, and so on.

Topic representation

Topic representation such as topic signature plays an important role in the document-summarization system. Various topic representations are provided, such as topic signature, enhanced topic signature, thematic signature, and so on.

The topic signature is defined as a set of related terms; topic is the target concept, and signature is a list of terms that is related to the topic with a specific weight. Each term can be the stemmed content words, bigram or trigram:

Topic representation

The topic term selection process is as follows. The input document sets are divided into two sets: relevant and nonrelevant texts for the topic. Two hypotheses are defined:

Topic representation
Topic representation

The first hypothesis denotes that the relevance of a document is independent of the term, whereas the second hypothesis indicates a strong relevance with the presence of those terms, given that Topic representation and the 2 x 2 contingency table:

 

R

Topic representation
Topic representation Topic representation Topic representation
Topic representation Topic representation Topic representation

Topic representation represents the frequency of the term, Topic representation, occurring in R, and Topic representation, is the Topic representation frequency of the term occurring in Topic representation. Topic representation is the frequency of the term, Topic representation, occurring in R, and Topic representation is the frequency of the term, Topic representation, occurring in Topic representation.

The likelihood of both hypotheses is calculated as follows; here, b denotes the binomial distribution:

Topic representation
Topic representation
Topic representation

The algorithm to create the topic signature for a given topic is illustrated as follows:

Topic representation

The multidocument summarization algorithm

Here is the Graph-Based Sub-topic Partition Algorithm (GSPSummary) algorithm for multidocument summarization:

The multidocument summarization algorithm

The GSPRankMethod is defined as follows:

The multidocument summarization algorithm

The Maximal Marginal Relevance algorithm

Maximal Marginal Relevance (MMR), which is comparatively suited for query-based and multidocument summarization, selects the most important sentence in each iteration of sentence selection. Each selected sentence has minimal relevance to the selected sentence set.

The Maximal Marginal Relevance algorithm

The summarized algorithm of MMR is as follows:

The Maximal Marginal Relevance algorithm

The R implementation

Please take a look at the R codes file ch_10_mmr.R from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:

> source("ch_10_mmr.R")
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.78.64