Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. Indexing with Local PDF Files

In this chapter we will have the chance to index and query some local PDFs (some examples are provided for your tests) as first use cases, even if you do not yet have any knowledge of Solr.
We will have a hands on with both cURL and the browser. We will see how an index is made and how to interact with it in various ways, introducing the web user interface. We will describe the main concepts behind what is an index and a core, which will be useful for the examples covered in the subsequent chapters.

Understanding and using an index

The main component in Solr is the Lucene library, a full-text search library written in Java. Since Solr hides the Lucene layer from us, we don't have to study how Lucene works in detail now; you can study it in depth later. Yet it is important to have an idea of what a Lucene index is, and how it's made. Lucene's core concepts are as follows:

Document: This is the main structure used both for searches and indexes. A document is an in-memory representation of the data values we need to use for our searches. In order for this to work, every document resource consists of a collection of fields, which is the most simple data structure.
Field: This has its own name and value and consists of at least one term. So every document can be seen as nothing more than a list of very simple (field name and term value) pairs. If a field is designed to be multivalued, we can save as many values as we want within the same key; otherwise, if we enter new values, the last one will simply overwrite the previous.
Term: This is a basic unit for indexing. For simplicity let's imagine a single word, but the word can consist of a string of words, depending on configuration details.
Index: This is the in-memory structure where Lucene (and Solr) perform the searches. We can then think about a document to be a single record in the Index. From an abstract logical point of view, we can easily imagine a data structure as shown in the following figure:

The best way to understand how a generic query works is by focusing on documents and trying to imagine how to search for them. While searching for the string Solr Book in the field title, if the index has been created and the fields in our query exist, we expect Lucene to search correspondences for the name-value pair title:'Solr Book' iterating over all the existing documents currently added to the index.

These kind of document-oriented representations are often useful, as it is a common way of representing data used by many people. However, the real internal structure adopted for storing index data (and the actual process to search over the index data) is less intuitive, and we will cover it later in this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. Indexing with Local PDF Files

Create new playlist

Sign In

Sign Up

Chapter 2. Indexing with Local PDF Files

Understanding and using an index

Table of Contents for
2. Indexing with Local PDF Files