Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6. Indexing Data Using Apache Tika

In previous chapters, we saw how we can use the data import handler provided by Solr to index data using various datasources (JDBC and file datasource). In this chapter, we'll see how we can index data for various file formats, such as MS Word, Excel, PDF and many more. We'll cover the following topics:

Introducing Apache Tika
Configuring Apache Tika in Solr
Indexing PDF and Word documents

Introducing Apache Tika

Apache Tika is an open source library that is used for document type detection and content extraction from various file formats. It uses various existing document parsers and document type detection techniques to detect and extract data. Using Tika, we can develop a universal type detector and content extractor to extract both structured text as well as metadata from different types of documents such as spreadsheets, text documents, images, PDFs, and even multimedia input formats. Apache Tika provides a single API for parsing different file formats. The existing parser libraries are encapsulated under a single interface, called the parser interface.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. Indexing Data Using Apache Tika

Create new playlist

Sign In

Sign Up

Chapter 6. Indexing Data Using Apache Tika

Introducing Apache Tika

Table of Contents for
6. Indexing Data Using Apache Tika