Chapter 8. Combined Approaches

In this chapter, we will address several issues surrounding the use of combinations of techniques to solve NLP problems. We start with a brief introduction to the process of preparing data. This is followed by a discussion on pipelines and their construction. A pipeline is nothing more than a sequence of tasks integrated to solve some problems. The chief advantage of a pipeline is the ability to insert and remove various elements of the pipeline to solve a problem in a slightly different manner.

The Stanford API supports a good pipeline architecture, which we have used repeatedly in this book. We will expand upon the details of this approach and then show how OpenNLP can be used to construct a pipeline.

Preparing data for processing is an important first step in solving many NLP problems. We introduced the data preparation process in Chapter 1, Introduction to NLP, and then discussed the normalization process in Chapter 2, Finding Parts of Text. In this chapter, we will focus on extracting text from different data sources, such as HTML, Word, and PDF documents, to be precise.

The Stanford StanfordCoreNLP class is a good example of a pipeline that is easily used. In a sense, it is preconstructed. The actual tasks performed are dependent on the annotations added. This works well for many types of problems.

However, other NLP APIs do not support pipeline architecture as directly as Stanford APIs; while more difficult to construct, these approaches can be more flexible for many applications. We demonstrate this construction process using OpenNLP.

Preparing data

Text extraction is an early step in most NLP tasks. Here, we will quickly cover how text extraction can be performed for HTML, Word, and PDF documents. Although there are several APIs that support these tasks, we will use:

Some APIs support the use of XML for input and output. For example, the Stanford XMLUtils class provides support for reading XML files and manipulating XML data. The LingPipe's XMLParser class will parse XML text.

Organizations store their data in many forms and frequently it is not in simple text files. Presentations are stored in PowerPoint slides, specifications are created using Word documents, and companies provide marketing and other materials in PDF documents. Most organizations have an Internet presence, which means that much useful information is found in HTML documents. Due to the widespread nature of these data sources, we need to use tools to extract their text for processing.

Using Boilerpipe to extract text from HTML

There are several libraries available for extracting text from HTML documents. We will demonstrate how to use Boilerpipe (https://code.google.com/p/boilerpipe/) to perform this operation. This is a flexible API that not only extracts the entire text of an HTML document but can also extract selected parts of an HTML document such as its title and individual text blocks.

We will use the HTML page at http://en.wikipedia.org/wiki/Berlin to illustrate the use of Boilerpipe. Part of this page is shown in the following screenshot. In order to use Boilerpipe, you will need to download the binary for the Xerces Parser found at http://xerces.apache.org/index.html.

Using Boilerpipe to extract text from HTML

We start by creating a URL object that represents this page as shown here. The try-catch block handles exceptions:

try {
    URL url = new URL("http://en.wikipedia.org/wiki/Berlin");
    …
    } catch (MalformedURLException ex) {
        // Handle exceptions
    } catch (BoilerpipeProcessingException | SAXException 
            | IOException ex) {
        // Handle exceptions
}

We will use two classes to extract text. The first is the HTMLDocument class that represents the HTML document. The second is the TextDocument class that represents the text within an HTML document. It consists of one or more TextBlock objects that can be accessed individually if needed.

In the following sequence, a HTMLDocument instance is created for the Berlin page. The BoilerpipeSAXInput class uses this input source to create a TextDocument instance. It then uses the TextDocument class' getText method to retrieve the text. This method uses two arguments. The first argument specifies whether to include the TextBlock instances marked as content. The second argument specifies whether noncontent TextBlock instances should be included. In this example, both types of TextBlock instances are included:

HTMLDocument htmlDoc = HTMLFetcher.fetch(url);
InputSource is = htmlDoc.toInputSource();
TextDocument document = 
    new BoilerpipeSAXInput(is).getTextDocument();
System.out.println(document.getText(true, true));

The output of this sequence is quite large since the page is large. A partial listing of the output is as follows:

Berlin
From Wikipedia, the free encyclopedia
Jump to: navigation , search
This article is about the capital of Germany.  For other uses, see Berlin (disambiguation) .
...
Privacy policy
About Wikipedia
Disclaimers
Contact Wikipedia
Developers
Mobile view

The getTextBlocks method will return a list of TextBlock objects for the document. Various methods allow you to access the text and information about the text such as the number of words in a block.

Using POI to extract text from Word documents

The Apache POI Project (http://poi.apache.org/index.html) is an API used to extract information from Microsoft Office products. It is an extensive library that allows information extraction from Word documents and other office products, such as Excel and Outlook.

When downloading the API for POI, you will also need to use XMLBeans (http://xmlbeans.apache.org/), which supports POI. The binaries for XMLBeans can be downloaded from http://www.java2s.com/Code/Jar/x/Downloadxmlbeans230jar.htm.

Our interest is in demonstrating how to use POI to extract text from Word documents. To demonstrate this use, we will use a file called TestDocument.docx, as shown in the following screenshot:

Using POI to extract text from Word documents

There are several different file formats used by different versions of Word. To simplify the selection of which text extraction class to use, we will use the ExtractorFactory factory class.

Although the POI's capabilities are considerable, the process of extracting text is simple. As shown here, a FileInputStream object representing the file, TestDocument.docx, is used by the ExtractorFactory class' createExtractor method to select the appropriate POITextExtractor instance. This is the base class for several different extractors. The getText method is applied to the extractor to get the text:

try {
    FileInputStream fis = 
        new FileInputStream("TestDocument.docx");
    POITextExtractor textExtractor = 
        ExtractorFactory.createExtractor(fis);
    System.out.println(textExtractor.getText());
} catch (IOException ex) {
    // Handle exceptions
} catch (OpenXML4JException | XmlException ex) {
    // Handle exceptions
}

A part of the output is as follows:

Pirates
Pirates are people who use ships to rob other ships. At least this is a common definition. They have also been known as buccaneers, corsairs, and privateers. In
...
Our list includes:
Gan Ning
Awilda
...
Get caught
Walk the plank
This is not a recommended occupation.

It can be useful to know more about a Word document. POI supports a POIXMLPropertiesTextExtractor class that gives us access to core, extended, and custom properties of a document. There are two ways of readily getting a string containing many of these properties.

  • The first approach uses the getMetadataTextExtractor method and then the getText method, as shown here:
    POITextExtractor metaExtractor = 
        textExtractor.getMetadataTextExtractor();
    System.out.println(metaExtractor.getText());
  • The second approach creates an instance of the POIXMLPropertiesTextExtractor class using XWPFDocument representing the Word document, as illustrated here:
    fis = new FileInputStream("TestDocument.docx");
    POIXMLPropertiesTextExtractor properties = 
        new POIXMLPropertiesTextExtractor(new XWPFDocument(fis));
    System.out.println(properties.getText());

The output of either approach is shown here:

Created = Sat Jan 03 18:27:00 CST 2015
CreatedString = 2015-01-04T00:27:00Z
Creator = Richard
LastModifiedBy = Richard
LastPrinted = Sat Jan 03 18:27:00 CST 2015
LastPrintedString = 2015-01-04T00:27:00Z
Modified = Mon Jan 05 14:01:00 CST 2015
ModifiedString = 2015-01-05T20:01:00Z
Revision = 3
Application = Microsoft Office Word
AppVersion = 12.0000
Characters = 762
CharactersWithSpaces = 894
Company = 
HyperlinksChanged = false
Lines = 6
LinksUpToDate = false
Pages = 1
Paragraphs = 1
Template = Normal.dotm
TotalTime = 20

There is a CoreProperties class that holds a core set of properties for the document. The getCoreProperties method provides access to these properties:

CoreProperties coreProperties = properties.getCoreProperties();
System.out.println(properties.getCorePropertiesText());

These properties are listed here:

Created = Sat Jan 03 18:27:00 CST 2015
CreatedString = 2015-01-04T00:27:00Z
Creator = Richard
LastModifiedBy = Richard
LastPrinted = Sat Jan 03 18:27:00 CST 2015
LastPrintedString = 2015-01-04T00:27:00Z
Modified = Mon Jan 05 14:01:00 CST 2015
ModifiedString = 2015-01-05T20:01:00Z
Revision = 3

There are individual methods with access to specific properties such as the getCreator, getCreated, and getModified methods. Extended properties, represented by the ExtendedProperties class, are available through the getExtendedProperties method, as shown here:

ExtendedProperties extendedProperties = 
    properties.getExtendedProperties();
System.out.println(properties.getExtendedPropertiesText());

The output is as follows:

Application = Microsoft Office Word
AppVersion = 12.0000
Characters = 762
CharactersWithSpaces = 894
Company = 
HyperlinksChanged = false
Lines = 6
LinksUpToDate = false
Pages = 1
Paragraphs = 1
Template = Normal.dotm
TotalTime = 20

Methods such as getApplication, getAppVersion, and getPages give access to specific extended properties.

Using PDFBox to extract text from PDF documents

The Apache PDFBox (http://pdfbox.apache.org/) project is an API for processing PDF documents. It supports the extraction of text and other tasks such as document merging, form filling, and PDF creation. We will only illustrate the text extraction process.

To demonstrate the use of POI, we will use a file called TestDocument.pdf. This file was saved as a PDF document using the TestDocument.docx file as shown in the Using POI to extract text from Word documents section.

The process is straightforward. A File object is created for the PDF document. The PDDocument class represents the document and the PDFTextStripper class performs the actual text extraction using the getText method, as shown here:

try {
    File file = new File("TestDocument.pdf");
    PDDocument pdDocument = PDDocument.load(file);
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(pdDocument);
    System.out.println(text);
    pdDocument.close();
} catch (IOException ex) {
    // Handle exceptions
}

Due to its length, only part of the output is shown here:

Pirates 
Pirates are people who use ships to rob other ships. At least this is a common definition. They have also been known as buccaneers, corsairs, and privateers. In
...
Our list includes: 
 Gan Ning 
 Awilda 
...
4. Get caught 
5. Walk the plank 
This is not a recommended occupation.

The extraction includes special characters for the bullets and numbers for the numbered sequences.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.137.75