Chapter 2. Understanding Document Analysis and Creating Mappings

Search is hard, and it becomes harder when both speed and relevancy are required together. There are lots of configurable options Elasticsearch provides out-of-the-box to take control before you start putting the data into it. Elasticsearch is schemaless. I gave a brief idea in the previous chapter of why it is not completely schemaless and how it creates a schema right after indexing the very first document for all the fields existing in that document. However, the schema matters a lot for a better and more relevant search. Equally important is understanding the theory behind the phases of document indexing and search.

In this chapter, we will cover the following topics:

  • Full text search and inverted indices
  • Document analysis
  • Introducing Lucene analyzers
  • Creating custom analyzers
  • Elasticsearch mappings

Text search

Searching is broadly divided into two types: exact term search and full text search. An exact term search is something in which we look out for the exact terms; for example, any named entity such as the name of a person, location, or organization or date. These searches are easier to make since the search engine simply looks out for a yes or no and returns the documents.

However, full text search is different as well as challenging. Full text search refers to the search within text fields, where the text can be unstructured as well as structured. The text data can be in the form of any human language and based on the natural languages, which are very hard for a machine to understand and give relevant results. The following are some examples of full text searches:

  • Find all the documents with search in the title or content fields, and return the results with matches in titles with the higher score
  • Find all the tweets in which people are talking about terrorism and killing and return the results sorted by the tweet creation time

While doing these kinds of searches, we not only want relevant results but also expect that the search for a keyword matches all of its synonyms, root words, and spelling mistakes. For example, terrorism should match terorism and terror, while killing should match kills, kill, and killed.

To serve all these queries, the text-based fields go through an analysis phase before indexing, and based on this analysis, inverted indexes are built. At the time of querying, the same analysis process is applied to the terms that are sent within the queries to match those terms stored in the inverted indexes.

TF-IDF

TF-IDF stands for term frequencies-inverse document frequencies, and it is an important parameter used inside Lucene's standard similarity algorithm, Vector Space Model (VSM). The weight calculated by TF-IDF is the statistical measure to evaluate how important a word is to a document in a collection of documents.

Let's see how a TF-IDF weight is calculated to find our term's relevancy:

  • TF (term): (The number of times a term appears in a document) / (The total number of terms in the document)
  • IDF (term): log_e (The total number of documents / The number of documents with the t term in it)

    Note

    While calculating IDF, the log is taken because terms such as the, that, and is may appear too many times, and we need to weigh down these frequently appearing terms while increasing the importance of rare terms.

The weight of TF-IDF is a product of TF(term)*IDF(term).

In information retrieval, one of the simplest relevancy ranking functions is implemented by summing the TF-IDF weight for each query term. Based on the combined weights for all the terms appearing in a single query, a score is calculated that is used to return the results in a sorted order.

Inverted indexes

Inverted index is the heart of search engines. The primary goal of a search engine is to provide speedy searches while finding the documents in which our search terms occur. Relevancy comes second.

Let's see with an example how inverted indexes are created and why they are so fast. In this example, we have two documents with each content field containing the following texts:

  • I hate when spiders sit on the wall and act like they pay rent
  • I hate when spider just sit there

While indexing, these texts are tokenized into separate terms and all the unique terms are stored inside the index with information such as in which document this term appears and what is the term position in that document.

The inverted index built with the preceding document texts looks like this:

Term

Document:Position

I

1:1, 2:1

Hate

1:2, 2:2

When

1:3, 2:3

Spiders

1:4

Sit

1:5, 2:5

On

1:6

Wall

1:7

Spider

2:4

Just

2:5

There

2:6

When you search for the term spider OR spiders, the query is executed against the inverted index and the terms are looked out for, and the documents where these terms appear are quickly identified. If you search for spider AND spiders, you will not get any results because when we use AND queries, both the terms used in the queries must be present in the document. However, spiders and spider are different for the search engine unless they are normalized into their root forms. For all these term normalizations, Elasticsearch has a document analysis phase that we will see in the upcoming sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.21.109