What makes SBD difficult?

Breaking text into sentences is difficult for a number of reasons:

  • Punctuation is frequently ambiguous
  • Abbreviations often contain periods
  • Sentences may be embedded within each other by the use of quotes
  • With more specialized text, such as tweets and chat sessions, we may need to consider the use of new lines or completion of clauses

Punctuation ambiguity is best illustrated by the period. It is frequently used to demark the end of a sentence. However, it can be used in a number of other contexts as well, including abbreviation, numbers, e-mail addresses, and ellipses. Other punctuation characters, such as question and exclamation marks, are also used in embedded quotes and specialized text such as code that may be in a document.

Periods are used in a number of situations:

  • To terminate a sentence
  • To end an abbreviation
  • To end an abbreviation and terminate a sentence
  • For ellipses
  • For ellipses at the end of a sentence
  • Embedded in quotes or brackets

Most sentences we encounter end with a period. This makes them easy to identify. However, when they end with an abbreviation, it a bit more difficult to identify them. The following sentence contains abbreviations with periods:

"Mr. and Mrs. Smith went to the ball."

In the next two sentences, we have an abbreviation that occurs at the end of the sentence:

"He was an agent of the CIA."

"He was an agent of the C.I.A."

In the last sentence, each letter of the abbreviation is followed by a period. Although not common, this may occur and we cannot simply ignore it.

Another issue that makes SBD difficult is trying to determine whether or not a word is an abbreviation. We cannot simply treat all uppercase sequences as abbreviations. Perhaps the user typed in a word in all caps by accident or the text was preprocessed to convert all characters to lowercase. Also, some abbreviations consist of a sequence of uppercase and lowercase letters. To handle abbreviations, a list of valid abbreviations is sometimes used. However, the abbreviations are often domain-specific.

Ellipses can further complicate the problem. They may be found as a single character (Extended ASCII 0x85 or Unicode (U+2026)) or as a sequence of three periods. In addition, there is the Unicode horizontal ellipsis (U+2026), the vertical ellipsis (U+22EE), and the presentation form for the vertical and horizontal ellipsis (U+FE19). Besides these, there are HTML encodings. For Java, uFE19 is used. These variations on encoding illustrate the need for good preprocessing of text before it is analyzed.

The next two sentences illustrate possible uses of the ellipses:

"And then there was … one."

"And the list goes on and on and …"

The second sentence was terminated by an ellipsis. In some situations, as suggested by the MLA handbook (http://www.mlahandbook.org/fragment/public_index), we can use brackets to distinguish ellipses that have been added from ellipses that were part of the original text, as shown here:

"The people […] used various forms of transportation […]" (Young 73).

We will also find sentences embedded in another sentence, such as:

The man said, "That's not right."

Exclamation marks and questions marks present other problems, even though the occurrence of these characters is more limited than that of the period. There are places other than at the end of a sentence where exclamation marks can occur. In the case of some words, such as Yahoo!, the exclamation mark is a part of the word. In addition, multiple exclamation marks are used for emphasis such as "Best wishes!!" This can lead to identification of multiple sentences where they do not actually exist.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.50.87