How text analysis works

We have seen an overview of Solr's text analysis; now let's learn how Solr actually implements the analysis process. Here are the common steps normally used by Solr in an analysis process:

Removing stop words:
- Stop words (common letters/words) such as a, an, the, at, to, for, and so on are removed from the text string; so Solr will not give results for these common words. These words are configured in a text file (for example, stopwords_en.txt) and this file needs to be imported in the analysis configuration.
Adding synonyms:
- Solr reads synonyms from the text file (synonyms) and adds them to the token stream. All synonyms need to be preconfigured in the text file, such as Football and Soccer, Country and Nation, and so on.
Stemming the words:
- Solr transforms words into a base form using language-specific rules
- It transforms removing → remove (removes ing)
- It transforms searches → search (converts to singular)
Set all to lowercase:
- Solr converts all tokens to lowercase

These are the common steps that Solr performs in all normal scenarios. But in real life, things may not be that easy and straightforward. We can't assume which type of search input pattern in which language end users may use. To meet all possible requirements, Solr's intelligence is hidden in its three powerful tools:

Analyzer: The task of the analyzer is to determine the text and generate the token stream accordingly
Tokenizer: The tokenizer splits the input string at some delimiter and generates a token stream, say Mastering Apache Solr to Mastering, Apache, and Solr (splitting at white spaces)
Filter: The filter performs one of these tasks:
- Adding: Adds new tokens to the stream, such as adding synonyms
- Removing: Removes tokens from the stream, such as stop words
- Conversation: Converts tokens from one form to another, such as uppercase to lowercase

We will explore each one of these in detail later in this chapter. By using these three tools, Solr becomes a powerful search engine to meet any complex search requirement.

Solr provides a UI Admin Console for us to understand the analysis process. Here, we can easily understand what is actually happening and which steps are getting executed during index and query time analysis. To access the admin console, navigate to http://localhost:8983/solr.

In the Solr admin console, the user can easily understand text analysis, querying, and so on:

Go to Dashboard | Core Selector, select your configured example, and click on Analysis. Here we are using Solr's built-in example techproducts.

We have configured the text_en field as follows in the managed-schema.xml file:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
 <tokenizer class="solr.StandardTokenizerFactory"/>
 <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
 <filter class="solr.LowerCaseFilterFactory"/> 
 </analyzer>
 <analyzer type="query">
 <tokenizer class="solr.StandardTokenizerFactory"/>
 <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
 <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
 <filter class="solr.LowerCaseFilterFactory"/> 
 </analyzer>
 </fieldType>

Applying lower case filter (LCF) during index time and query time:

Text analysis applied during index time on the input string.

Input: The Nation of soccer

Output: nation, soccer

Analysis applied:

It is split at white spaces
Stop words (The, of) removed
All set to lowercase

Text analysis applied during query time on the input string.

Input: Famous country for football

Output: famous, nation, country, soccer, and football

Analysis applied:

Split at white spaces
Stop words (for) removed
Synonyms added (nation for country and soccer for football)
All set to lowercase

As we know that analysis takes place in both phases (index time and query time), we have configured two <analyzer> elements distinguished by type attribute value (index for index time and query for query time). And we've configured a set of tokenizers and filters in each phase. The configurations for each phase may vary based on requirements. It is also possible to define a single <analyzer> element without the type attribute and configure tokenizers and filters inside that <analyzer> element. This type of config set will apply the same configurations to both phases. This is useful for cases where we want perfect string matching. We will discuss this later when we explore the analyzer in detail. Now consider the preceding analysis example.

During the index and query phases, the configured set of configurations is executed by the respective tokenizers and filters and the final token stream is generated. This stream (nation and soccer) is stored as an index. Now, during query time, the final token stream of the query phase (famous, nation, country, soccer, and football) will match the final token stream of the index phase (nation, and soccer), and we can see that there are multiple similar tokens (nation, and soccer) available in both streams. So here, both The Nation of soccer and Famous country for football search for the same results. This is the way how text analysis process executing. Here we have just provided an overview of an analyzer, tokenizer, and filters. We will understand all of these analysis tools in detail later in this chapter.

Table of Contents for How text analysis works

Create new playlist

Sign In

Sign Up

Table of Contents for
How text analysis works