How an analyzer works

Text analysis takes place in two phases, during index time and during query time. So we need to configure <analyzer> for both phases, distinguished by the type attribute. In the preceding example, we have configured a single <analyzer> element along with tokenizers and filters and not specified the type attribute. So Solr will apply the same configurations for the both phases (index and query). This type of configuration is required for some scenarios, say if we want to match strings exactly.

It is always advisable to define two separate <analyzer> for each phase distinguished by a type attribute. Doing so is required in some scenarios where we want to apply some steps at query time but not index time. For example:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Here, the synonyms filter (SynonymGraphFilterFactory) is injected at query time but not index time. If we inject this filter at index time, after adding a new synonym to the synonyms.txt file, we need to build document indexes again. Also the indexes for synonyms will be created and the index size will be increased.

At the time of configuring two separate <analyzer> for the index and query, we also need to bear in mind that configurations for both <analyzer> must be compatible with each other. For example, we have used the LowerCaseFilterFactory filter in both <analyzer> definitions in the preceding configurations. If we define an LCF only in the indexing phase and not in the querying phase, a query for Soccer will never match with the indexed term soccer.

So far we have learned the following:

  • Solr performs text analysis in both phases: index and query time.
  • Solr uses analyzers, tokenizers, and filters for text analysis.
  • Using the analyzers, tokenizers, and filters, Solr examines the input string during index time. It normalizes accordingly and generates the token stream. Solr builds indexes based on this token stream.
  • Using the same analyzer, tokenizers, and filters during query time, Solr examines the query string, normalizes accordingly, and generates a token stream. This token stream will be compared with the token stream generated at index time and return the matching output.
  • The configuration of analyzer depends on the search requirements.
  • Defining a single analyzer will be considered as the same analysis configuration for both phases.
  • The advisable approach is always to define two separate analyzer for each phase so that we can apply friendlier configurations for each phase.
  • When defining the tokenizers and filter, the ordering should be logical.
  • Both the phase configurations should be compatible with each other.

So now we can apply text analysis on slightly more complex search strings. Let's apply more tokenizers and filters and understand their behavior. Previously, we mentioned an example of searching The Host Country of Soccer World Cup 2018 and searching for The Host Nation of Football world cup 2018; in both cases, the result should be Russia. Let's see how Solr analyzes this example. Up next are the analyzer configurations configured in the managed-schema.xml file. There are two separate <analyzer> elements for index time and query time, distinguished by the type attribute:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Text analysis by configuring various tokenizers and filters during index time and query time:

In the preceding screen, we have provided the string The Host Country of Soccer World Cup 2018 at index time and The Host Nation of Football world cup 2018 at query time. 

An analyzer during index time.

String: The Host Country of Soccer World Cup 2018

Solr executes all tokenizers and filters sequentially configured inside the <analyzer type="index"> definition and generates a token stream accordingly. It builds indexes using this token stream.

StandardTokenizerFactory (ST): Splits the string at white spaces:

<tokenizer class="solr.StandardTokenizerFactory"/>

Input: The Host Country of Soccer World Cup 2018

OutputTheHostCountryofSoccerWorldCup, and 2018

The analyzer passes the input text to the first element from the list. Here, it is StandardTokenizerFactory. The entire string was split at white spaces using standard tokenizer. Now the output of this tokenizer will be the input to the next (immediate successor) in the sequence chain; here, it is StopFilterFactory.

StopFilterFactory:  Removes all stop words (common) listed in the stopwords_en.txt file:

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />

Input: The, Host, Country, of, Soccer, World, Cup, 2018 ( output of immediate predecessor)

Output: Host,Country,Soccer,World,Cup,2018

Here all stop words (The, of, and so on) are removed from the steam. Stop words are nothing but all common words (a, an, the, of, at, forin, and so on) listed in the stopwords_en.txt file. After the removal of these stop words, Solr will not give matching results for these words. Now the output of this filter will be the input to its immediate successor in the chain; here, it is LowerCaseFilterFactory.

The attribute ignoreCase="true" (default false) will ignore case when testing for stop words. If it is true, the stop list should contain lowercase words.

LowerCaseFilterFactory: Converts all tokens to lowercase:

<filter class="solr.LowerCaseFilterFactory"/> 

InputHostCountrySoccerWorldCup2018

Output: hostcountrysoccerworldcupX

All the incoming inputs will be converted to lowercase:

Now all three components (StandardTokenizerFactory, StopFilterFactory, and LowerCaseFilterFactory) have executed their job sequentially and generated the final token stream:

  • Final token streamhostcountrysoccerworldcup2018

Next, Solr builds indexes based on this final token stream:

  • Analyzer: Query time
  • String: The Host Nation of Football world cup 2018

At query time, Solr executes all tokenizers and filters sequentially configured inside the <analyzer type="query"> definition and generates a token stream accordingly. This token stream will be compared with the token stream generated during index time. The matching result will be given as output.

StandardTokenizerFactory: Splits the string at white spaces:

<tokenizer class="solr.StandardTokenizerFactory"/>

Input: The Host Nation of Football world cup 2018

Output: TheHostNationofFootballworldcup,2018

The analyzer passes the input text to the first element from the list. Here, it is StandardTokenizerFactory. The entire string was split at white spaces using Standard Tokenizer. The output of this tokenizer will be the input to the immediate successor in the chain; here, it is StopFilterFactory.

StopFilterFactory: Removes all stop words (common) listed in the stopwords_en.txt file.

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />

Input: The, Host, Nation, of, Football, world, cup, 2018 (output of immediate predecessor)

Output: Host, Nation, Football, world, cup,2018

Here all stop words are removed from the stream. The output of this filter will be the input to its immediate successor in the chain, SynonymGraphFilterFactory.

The attribute ignoreCase="true" (default: false) will ignore casing when testing for stop words. If it is true, the stop list should contain lowercase words.

SynonymGraphFilterFactory: Adds synonyms to the token stream:

<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

InputHostNationFootballworldcup2018
Output: HostcountryNationsoccerFootballworldcup2018

The synonyms (country for nation and soccer for Football) are added to the stream. All synonyms are configured in the synonyms.txt file, as follows. We will see this filter in detail later in this chapter. 

Synonyms.txt file: Synonym mapping examples. Blank lines and lines that start with # will be ignored:

football,soccer
dumb,stupid,dull
country => nation
smart,clever,bright => intelligent,genius

Here, SynonymGraphFilterFactory is configured at query time but not at index time. Previously, we have seen the reason for this type of configuration variation for index time and query time.

The output will be passed to the next filter, LowerCaseFilterFactory.

LowerCaseFilterFactory: Converts all tokens to lowercase:

<filter class="solr.LowerCaseFilterFactory"/>

Input: HostcountryNationsoccerFootballworldcup2018

Output: hostcountrynationsoccerfootballworldcup2018

All the incoming inputs will be converted to lowercase.

Solr has examined the query string and produced the final token stream using four components (StandardTokenizerFactory, StopFilterFactory, SynonymGraphFilterFactory, and LowerCaseFilterFactory).

Now, these are the final streams of both phases:

  • Indexed streamhostcountrysoccerworldcup2018
  • Query stream: hostcountrynationsoccerfootballworldcup2018

We can see that many tokens from the query stream are matching tokens from the indexed stream, such as country and soccer. So in this way, Solr will bring the same output for the search string The Host Country of Soccer World Cup 2018 and The Host Nation of Football world cup 2018

Thus, we have covered:

  • The Solr text analysis mechanism
  • The tasks of analyzer, tokenizer, and filters
  • Defining a single <analyzer> or multiple <analyzer> distinguished by the type attribute based on search requirements
  • Defining configurations steps logically such as using a LCF after a stop filter and adding a synonym filter at query time only (we explained both the examples previously)

Now we are familiar with the Solr text analysis mechanism. Here we have tried to explain it by taking a little complex string example, but we can't assume which type of input may be entered during searches by end users. Providing accurate results for any pattern of input string is the main aim of any search engine. Solr comes with a number of tokenizers and filters to challenge any input pattern. Solr is also efficient at multiple language search. Here we have covered very few and common tokenizers and filters, but Solr's list for tokenizers and filters does not end there. There are many tokenizers and filters available. Let's understand the behavior of tokenizers in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.67.195