In this chapter, we will address several issues surrounding the use of combinations of techniques to solve NLP problems. We start with a brief introduction to the process of preparing data. This is followed by a discussion on pipelines and their construction. A pipeline is nothing more than a sequence of tasks integrated to solve some problems. The chief advantage of a pipeline is the ability to insert and remove various elements of the pipeline to solve a problem in a slightly different manner.
The Stanford API supports a good pipeline architecture, which we have used repeatedly in this book. We will expand upon the details of this approach and then show how OpenNLP can be used to construct a pipeline.
Preparing data for processing is an important first step in solving many NLP problems. We introduced the data preparation process in Chapter 1, Introduction to NLP, and then discussed the normalization process in Chapter 2, Finding Parts of Text. In this chapter, we will focus on extracting text from different data sources, such as HTML, Word, and PDF documents, to be precise.
The Stanford StanfordCoreNLP
class is a good example of a pipeline that is easily used. In a sense, it is preconstructed. The actual tasks performed are dependent on the annotations added. This works well for many types of problems.
However, other NLP APIs do not support pipeline architecture as directly as Stanford APIs; while more difficult to construct, these approaches can be more flexible for many applications. We demonstrate this construction process using OpenNLP.
Text extraction is an early step in most NLP tasks. Here, we will quickly cover how text extraction can be performed for HTML, Word, and PDF documents. Although there are several APIs that support these tasks, we will use:
Some APIs support the use of XML for input and output. For example, the Stanford XMLUtils
class provides support for reading XML files and manipulating XML data. The LingPipe's XMLParser
class will parse XML text.
Organizations store their data in many forms and frequently it is not in simple text files. Presentations are stored in PowerPoint slides, specifications are created using Word documents, and companies provide marketing and other materials in PDF documents. Most organizations have an Internet presence, which means that much useful information is found in HTML documents. Due to the widespread nature of these data sources, we need to use tools to extract their text for processing.
There are several libraries available for extracting text from HTML documents. We will demonstrate how to use Boilerpipe (https://code.google.com/p/boilerpipe/) to perform this operation. This is a flexible API that not only extracts the entire text of an HTML document but can also extract selected parts of an HTML document such as its title and individual text blocks.
We will use the HTML page at http://en.wikipedia.org/wiki/Berlin to illustrate the use of Boilerpipe. Part of this page is shown in the following screenshot. In order to use Boilerpipe, you will need to download the binary for the Xerces Parser found at http://xerces.apache.org/index.html.
We start by creating a URL
object that represents this page as shown here. The try-catch block handles exceptions:
try { URL url = new URL("http://en.wikipedia.org/wiki/Berlin"); … } catch (MalformedURLException ex) { // Handle exceptions } catch (BoilerpipeProcessingException | SAXException | IOException ex) { // Handle exceptions }
We will use two classes to extract text. The first is the HTMLDocument
class that represents the HTML document. The second is the TextDocument
class that represents the text within an HTML document. It consists of one or more TextBlock
objects that can be accessed individually if needed.
In the following sequence, a HTMLDocument
instance is created for the Berlin page. The BoilerpipeSAXInput
class uses this input source to create a TextDocument
instance. It then uses the TextDocument
class' getText
method to retrieve the text. This method uses two arguments. The first argument specifies whether to include the TextBlock
instances marked as content. The second argument specifies whether noncontent TextBlock
instances should be included. In this example, both types of TextBlock
instances are included:
HTMLDocument htmlDoc = HTMLFetcher.fetch(url); InputSource is = htmlDoc.toInputSource(); TextDocument document = new BoilerpipeSAXInput(is).getTextDocument(); System.out.println(document.getText(true, true));
The output of this sequence is quite large since the page is large. A partial listing of the output is as follows:
Berlin From Wikipedia, the free encyclopedia Jump to: navigation , search This article is about the capital of Germany. For other uses, see Berlin (disambiguation) . ... Privacy policy About Wikipedia Disclaimers Contact Wikipedia Developers Mobile view
The getTextBlocks
method will return a list of TextBlock
objects for the document. Various methods allow you to access the text and information about the text such as the number of words in a block.
The Apache POI Project (http://poi.apache.org/index.html) is an API used to extract information from Microsoft Office products. It is an extensive library that allows information extraction from Word documents and other office products, such as Excel and Outlook.
When downloading the API for POI, you will also need to use XMLBeans (http://xmlbeans.apache.org/), which supports POI. The binaries for XMLBeans can be downloaded from http://www.java2s.com/Code/Jar/x/Downloadxmlbeans230jar.htm.
Our interest is in demonstrating how to use POI to extract text from Word documents. To demonstrate this use, we will use a file called TestDocument.docx
, as shown in the following screenshot:
There are several different file formats used by different versions of Word. To simplify the selection of which text extraction class to use, we will use the ExtractorFactory
factory class.
Although the POI's capabilities are considerable, the process of extracting text is simple. As shown here, a FileInputStream
object representing the file, TestDocument.docx
, is used by the ExtractorFactory
class' createExtractor
method to select the appropriate POITextExtractor
instance. This is the base class for several different extractors. The getText
method is applied to the extractor to get the text:
try { FileInputStream fis = new FileInputStream("TestDocument.docx"); POITextExtractor textExtractor = ExtractorFactory.createExtractor(fis); System.out.println(textExtractor.getText()); } catch (IOException ex) { // Handle exceptions } catch (OpenXML4JException | XmlException ex) { // Handle exceptions }
A part of the output is as follows:
Pirates Pirates are people who use ships to rob other ships. At least this is a common definition. They have also been known as buccaneers, corsairs, and privateers. In ... Our list includes: Gan Ning Awilda ... Get caught Walk the plank This is not a recommended occupation.
It can be useful to know more about a Word document. POI supports a POIXMLPropertiesTextExtractor
class that gives us access to core, extended, and custom properties of a document. There are two ways of readily getting a string containing many of these properties.
getMetadataTextExtractor
method and then the getText
method, as shown here:POITextExtractor metaExtractor = textExtractor.getMetadataTextExtractor(); System.out.println(metaExtractor.getText());
POIXMLPropertiesTextExtractor
class using XWPFDocument
representing the Word document, as illustrated here:fis = new FileInputStream("TestDocument.docx"); POIXMLPropertiesTextExtractor properties = new POIXMLPropertiesTextExtractor(new XWPFDocument(fis)); System.out.println(properties.getText());
The output of either approach is shown here:
Created = Sat Jan 03 18:27:00 CST 2015 CreatedString = 2015-01-04T00:27:00Z Creator = Richard LastModifiedBy = Richard LastPrinted = Sat Jan 03 18:27:00 CST 2015 LastPrintedString = 2015-01-04T00:27:00Z Modified = Mon Jan 05 14:01:00 CST 2015 ModifiedString = 2015-01-05T20:01:00Z Revision = 3 Application = Microsoft Office Word AppVersion = 12.0000 Characters = 762 CharactersWithSpaces = 894 Company = HyperlinksChanged = false Lines = 6 LinksUpToDate = false Pages = 1 Paragraphs = 1 Template = Normal.dotm TotalTime = 20
There is a CoreProperties
class that holds a core set of properties for the document. The getCoreProperties
method provides access to these properties:
CoreProperties coreProperties = properties.getCoreProperties(); System.out.println(properties.getCorePropertiesText());
These properties are listed here:
Created = Sat Jan 03 18:27:00 CST 2015 CreatedString = 2015-01-04T00:27:00Z Creator = Richard LastModifiedBy = Richard LastPrinted = Sat Jan 03 18:27:00 CST 2015 LastPrintedString = 2015-01-04T00:27:00Z Modified = Mon Jan 05 14:01:00 CST 2015 ModifiedString = 2015-01-05T20:01:00Z Revision = 3
There are individual methods with access to specific properties such as the getCreator
, getCreated
, and getModified
methods. Extended properties, represented by the ExtendedProperties
class, are available through the getExtendedProperties
method, as shown here:
ExtendedProperties extendedProperties = properties.getExtendedProperties(); System.out.println(properties.getExtendedPropertiesText());
The output is as follows:
Application = Microsoft Office Word AppVersion = 12.0000 Characters = 762 CharactersWithSpaces = 894 Company = HyperlinksChanged = false Lines = 6 LinksUpToDate = false Pages = 1 Paragraphs = 1 Template = Normal.dotm TotalTime = 20
Methods such as getApplication
, getAppVersion
, and getPages
give access to specific extended properties.
The Apache PDFBox (http://pdfbox.apache.org/) project is an API for processing PDF documents. It supports the extraction of text and other tasks such as document merging, form filling, and PDF creation. We will only illustrate the text extraction process.
To demonstrate the use of POI, we will use a file called TestDocument.pdf
. This file was saved as a PDF document using the TestDocument.docx
file as shown in the Using POI to extract text from Word documents section.
The process is straightforward. A File
object is created for the PDF document. The PDDocument
class represents the document and the PDFTextStripper
class performs the actual text extraction using the getText
method, as shown here:
try { File file = new File("TestDocument.pdf"); PDDocument pdDocument = PDDocument.load(file); PDFTextStripper stripper = new PDFTextStripper(); String text = stripper.getText(pdDocument); System.out.println(text); pdDocument.close(); } catch (IOException ex) { // Handle exceptions }
Due to its length, only part of the output is shown here:
Pirates Pirates are people who use ships to rob other ships. At least this is a common definition. They have also been known as buccaneers, corsairs, and privateers. In ... Our list includes: Gan Ning Awilda ... 4. Get caught 5. Walk the plank This is not a recommended occupation.
The extraction includes special characters for the bullets and numbers for the numbered sequences.
3.142.194.230