Chapter 3. Indexing Data

In the previous chapter, we saw the various analyzers, tokenizers, and filters provided by Solr that help us select the most important data from a given document. In this chapter, we'll see how Solr provides us a way to index this data so that we can run queries on top of it. We'll cover the following topics in this chapter:

  • Defining field types in Solr
  • Creating a custom musicCatalogue example
  • Facet searching

The Solr indexing process can mainly be broken down into two major parts:

  • Converting the document from its native format to XML or JSON, both of which are supported by Solr
  • Adding documents into Solr datastore using API or HTTP POST

To better understand the preceding two parts, we'll create an example of a music catalogue that contains metadata related to songs. The music catalogue will contain metadata related to a song that can later be used to retrieve important information regarding the song.

We'll also see how Solr provides various ways of feeding this information into it and how we can retrieve it.

Indexing data in Solr

Indexing of data in Solr is done using a document that contains fields that are used to provide major information to Solr. A document can be broken down further into fields, which contain major pieces of information that is used by Solr to further provide better search results.

The musicCatalogue core that we'll build will contain fields representing information related to a song. For example, an artistName field will contain the name of the artist who sang the song. Another field such as duration can contain the length of the song. A fields can also contain a data type, which will further describe the type of data that can be used. For example, artistName can be described as a text field. On the other hand, the duration field can be of type float or double. The fieldType property specifies the kind of field to be used by Solr.

Note

More information about floating-point numbers can be found at https://en.wikipedia.org/wiki/Floating_point.

So let's go ahead and create a schema for our musicCatalogue example. As we progress through this chapter, we'll see what Solr provides us when we create a schema for our musicCatalogue example.

To create our example, we'll use the default installation of Solr that we set up in Chapter 1, Getting Started. We will do so to create a core for our musicCatalogue.

We'll create a directory called musicCatalog in SOLR_HOME//solr. After we have created a directory, we'll create a folder named conf to hold schema.xml and solrconfig.xml, which will be used by Solr:

music-catalog
---conf
        schema.xml
        solrconfig.xml

The schema.xml config will define the fields that are necessary for our musicCatalogue example. To keep the music catalogue simple, we're going to use the following fields only:

Field name

Data type

Solr field type

songId

Long

solr.TrieLongField

songName

String

solr.StrField

artistName

String

solr.StrField

albumArtist

String

solr.StrField

songDuration

Double

solr.TrieDoubleField

Rating

Float

solr.TrieFloatField

Composer

String

solr.StrField

Rating

Float

solr.TrieFloatField

Year

Integer

solr.intField

Genre

String

solr.StrField

So let's see what our schema.xml file should look like and the various fields and field types that we can use while building this schema.

A basic schema.xml file looks like this:

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="uniqueSchemaName" version="1.5">
<!--More elements go here -->
</schema>

As we can see from the preceding XML, every schema must consist of a unique name that will distinguish it from others.

For this example, we'll name it musicCatalogue. A schema can contain the following elements within it:

  • Field types
  • Fields, copyFields, and dynamicFields
  • A unique key
  • A Solr query parser
  • A copy field
  • Similarity

Introducing field types

A field type in Solr defines how data should be interpreted and how Solr can use it to index the data. A field type can contain an analyzer and a filter, which we've seen in Chapter 2, Understanding Analyzers, Tokenizers, and Filters. These help Solr refine the data that it needs to index.

To keep it simple, we won't be using any filters or analyzers in our example. For our example, we've defined the following field types:

    <fieldType name="int" class="solr.TrieIntField"/>
    <fieldType name="float" class="solr.TrieFloatField"/>
    <fieldType name="string" class="solr.StrField"/>
    <fieldType name="double" class="solr.TrieDoubleField"/>
    <fieldType name="long" class="solr.TrieLongField"/>

As we can see from the fieldType element contains the following main attributes:

  • name: This attribute contains the name of the fieldType elements, which can be used later on while defining the field element
  • class: This is the Solr class that can be used to denote the data type used

By default Solr supports various data types that we can use when creating a schema.

Note

A list of these field types can be found in the Solr documentation at https://cwiki.apache.org/confluence/display/solr/Field+Types+Included+with+Solr.

After creating the field type element within our schema, let's create the major fields that are necessary for storing information related to the music catalogue.

Defining fields

Fields are a main part of the Solr schema, which provides major information to Solr while indexing. For our musicCatalogue example, we'll define the following fields:

<!-- Unique SongID -->
  <field name="songId" type="string" indexed="true" stored="true" required="true" multiValued="false"/>

  <!-- Song name -->
  <field name="songName" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 

  <!-- Artist name -->
  <field name="artistName" type="string" indexed="true" stored="true" required="true" multiValued="false"/>

  <!-- Album Artist -->
  <field name="albumArtist" type="string" indexed="true" stored="true" required="false" multiValued="false"/>

  <!-- Album name -->
  <field name="albumName" type="string" indexed="true" stored="true" required="true" multiValued="false"/>

  <!-- Duration of the Song -->
  <field name="songDuration" type="double" indexed="true" stored="true" required="false" multiValued="false"/>

  <!-- Duration of the Song -->
  <field name="composer" type="string" indexed="true" stored="true" required="false" multiValued="false"/>

  <!-- Song rating -->
  <field name="rating" type="float" indexed="true" stored="true" required="false" default="0.0" multiValued="false"/>

  <!-- Year which the song has been published -->
  <field name="year" type="int" indexed="true" stored="true" required="false" multiValued="false"/>

  <!-- Genre of the song (e.g. rock, pop, indie, etc)-->
  <field name="genre" type="string" indexed="true" stored="true" required="false" multiValued="false"/>

  <!-- Temporary field for storing all the information -->
  <field name="tmpField" type="string" indexed="true" stored="true" required="false" multiValued="true"/>

The preceding fields, which we have defined, contain the following attributes:

  • name: This denotes the name of the field
  • type: This is the field type that we set up in the previous section
  • indexed: Whether the field should be indexed by Solr or not (true or false)
  • required: The required attribute tells Solr that when we're indexing the document, this field should be mandatory
  • multivalued: If this is set to true, a document can contain multiple values of the same field
  • default: This is the default value that should be used if there is no value available in the document
  • stored: This field tells Solr to index the given field if the stored is set to true

Defining an unique key

A unique key element helps Solr maintain documents in a consistent way. This is an optional field and can be used as per the indexing requirements. If we think that the data that we feed into Solr will never come across as a duplicate document, we can avoid using this element.

The uniqueKey element can help us maintain a similar set of data. The format is as follows:

<uniqueKey>songId</uniqueKey>

Here, songId is a uniqueKey; it will be held in Solr just like a primary key in a database.

Copy fields and dynamic fields

A copy field tells Solr to copy the source field to the destination. This feature of Solr can be useful if we want to merge inputs of different elements into a single field. The following is the format by which we can introduce a copyField element in the schema:

<copyField source="sourceElement" destination="destinationElement" />

We can also merge all input fields into a single field using the * symbol in the source attribute. This will merge all the fields into a single destination field, like this for example:

<copyField source="*" dest="result"/>

Dynamic copy fields can also be created in Solr using the following format:

<copyField source="*_t" dest="*_copyField"/>

So, suppose we're sending a field called album_name_t. Solr will dynamically create a new field called album_name_t_copyField.

Dynamic fields can be used in Solr where we don't have to define all the fields in schema.xml. For example, the following line of code tells Solr to create a dynamic field of the string type whenever it sees a field name ending with _txt:

<dynamicField name="*_txt" type="string" indexed="true" store="true"/>
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.245.196