Chapter 3. Analyzing Your Text Data

In this chapter, we will cover the following topics:

  • Using the enumeration type
  • Removing HTML tags during indexing
  • Storing data outside of Solr index
  • Using synonyms
  • Stemming different languages
  • Using nonaggressive stemmers
  • Using the n-gram approach to do performant trailing wildcard searches
  • Using position increment to divide sentences
  • Using patterns to replace tokens

Introduction

The process of data indexing can be divided into parts. One of the parts is data analysis. It's one of the crucial parts of data preparation. It defines how your data will be divided into terms from text, and what type it will be. The Solr data parsing behavior is defined by types. A type's behavior can be defined in the context of the indexing process, query process, or both. Furthermore, the type definition is composed of a tokenizer (or multiple tokenizers, some for querying and some for indexing) and filters (both token and character filters). A tokenizer specifies how your data will be preprocessed after it is sent to the appropriate field. An analyzer operates on the whole data that is sent to the field. Types can only have one tokenizer. The result of the tokenizer is a stream of objects called tokens.

Next in the analysis chain are the filters. They operate on the tokens in the token stream. They can do anything with the tokens—changing, removing, or making them lowercase are just a few examples. Types can have multiple filters, which are run one after another.

One additional type of filter is the character filter. It does not operate on tokens from the token stream. It operates on non-tokenized data and is invoked before being sent to the analyzer. This chapter will focus on data analysis and how to handle day-to-day analysis questions and problems. You'll see how to use char filters, tokenizers, and of course, the filters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.70.170