Time for action – writing a simple schema.xml file

We can define the structure of a Solr document by writing the schema.xml file (by defining its fields); we can also define some data manipulation strategy such as tokenizing texts in order to take care of single words instead of full phrases. The steps for writing a simple schema.xml file are as follows:

  1. The focus here is on how to model the data on which we will do the searches and navigation for our specific domain of interest. We are only shaping metadata or let's say a 'projection' of data useful for our searches. A typical structure for the schema.xml file will involve the following elements:
    <schema name='simple' version='1.1'>
      <types> … </types>
      <fields> … </fields>
      <uniqueKey> … </uniqueKey>
          …
    </schema>
  2. We can write a simple schema and save it as /SolrStarterbook/solr-app/chp02/conf/schema.xml:
    <?xml version='1.0' ?>
    <schema name='simple' version='1.1'>
      <types>
        <fieldtype name='string' class='solr.StrField' />
        <fieldType name='long' class='solr.TrieLongField' />
      </types>
    
      <fields>
        <field name='id' type='long' required='true' />
        <field name='author' type='string' multiValued='true' />
        <field name='title' type='string' />
        <field name='text' type='string' />
        <dynamicField name='*_string' type='string' 
          multiValued='true' indexed='true' stored='true' />
        <copyField source='*' dest='fullText' />
        <field name='fullText' type='string'multiValued='true' indexed='true' />
      </fields>
    
      <uniqueKey>id</uniqueKey>
      <defaultSearchField>fullText</defaultSearchField>
      <solrQueryParser defaultOperator='OR' />
    
    </schema>

Note how we have defined different fields for handling different types of data.

What just happened?

It is important to underlay the difference between storing the actual data and creating an indexing process over the metadata manipulated and derived (extracted, projected, filtered, and so on) from them. With Solr we usually take care of the second case, even in a special case where we can also be interested in storing the actual data, using Solr as a NoSQL database, as we will see later.

We may not necessarily be interested in describing all the data we have (for example, what we have in our databases), but only what can be relevant in the search and navigation context.

At the beginning it can look like a duplication of functionality between Solr and a relational database technology, but it is not. Solr is not designed to replace traditional relational databases. In most cases Solr is used in parallel with the relational database, to expose a simple and efficient full-text API over the DBMS data. As we will see later, the data can not only be indexed but also stored in Solr so that it's even possible to adopt it as a NoSQL store in certain cases.

You should easily recognize the essential parts of this file as follows:

  • Types: This is used for defining a list of different data types for values. We can define strings, numeric types, or new types as we like. It's very common to have two or three different data types for handling text values shaped for different purposes, but for the moment we need to focus on the main concepts.

    Tip

    If you are not a programmer or you are not familiar with data types, I suggest you start by using the basic string type. When you have something working, you can move to using more advanced features, specific for a certain data type. For example, dates. If dates are saved using the required specific data type, it allows optimization for range queries over a certain period of time.

  • Fields: These are an essential part of this file. Every field should declare a unique name and associate it with one of the types defined previously. It's important to understand that not every instance of a Solr document must have a value for every field; when mandatory, a field can be simply marked as required. This approach is very flexible; we index only the actual data values without introducing dummy empty fields when a value is not present.
  • copyfield: This is used when the content of a source field needs to be added and indexed on some other destination field (usually as a melting pot for very general searches). The idea behind it is that we want to be able to search in all the fields of a document at the same time (the source will be defined by the wildcard *). The most simple way to do this is by copying the values into a default field where we will perform the actual searches. This field will also have its own type and analysis defined.
  • dynamicfield: By using this type we can start indexing some data without having to define the name of the field. The name will be defined by the wildcard, and it's possible to use prefixes and postfixes so that the actual name of a field is accepted at runtime, while the type should be defined. For example, when writing <dynamicField name='*_s' type='string' /> we can post new documents containing string values such as firstName_s='Alfredo' and surname_s='Serafini'. This is an ideal case for prototypes, as we can work with the Solr API without defining a final schema for our data.
  • uniqueKey: This is used to give a unique identity to a specific Solr document. It is a concept similar to a primary key for DBMS.
  • defaultSearchField: This field is used when there is no request for a specific field. The best configuration for this is generally the field containing all the full-text tokens (for example, the destination in the copyfield definition seen earlier).
  • defaultOperator: This is used to choose a default behavior when handling multiple tokens in the search. A query that uses and between the various words used for a search is intuitively narrowed to a small set of documents. So in most cases you will use the or operator instead, which is less restrictive and more natural for common queries. The and approach is generally useful, for example, when working with navigation filters or conducting an advanced search on large datasets.

Every field can define the following three important attributes:

  • multiValued: If the value of this attribute is true, a Solr document can contain more than one instance of values for the field. The default value is false.
  • indexed: If it is true, the field is used in index. Generally we will use only indexed fields, but it can be interesting to have them not indexed in certain instances, for example, if we want to save a value without using it for searches.
  • stored: This is used to permanently save the original data value for a field, whether indexed (and used for searches) or not. Moreover during the indexing phase, a field is analyzed as defined by its type in the schema.xml file to update the index; however, it is not explicitly saved unless we decide to store it.

Imagine indexing several different synonyms of the same word using a word_synonim multivalued field, but storing only this specific word in a word_original field. When the user searches for the word or one of its synonyms, all the documents produced as output will only contain the field word_synonim, which is the only one stored.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.85.221