The schema.xml file

Let's finally explore a Solr schema.

Before we continue, find a schema.xml file to follow along. This file belongs in the conf directory for a Solr Core instance configuration. For simple single-core Solr setups, this is the same as a Solr home directory. We suggest looking at configsets/mbtype/conf/schema.xml in the example code distributed with the book, available online. If you are working off of the Solr distribution, you'll find it in example/solr/collection1/conf/schema.xml. The example schema.xml is loaded with useful field types, documentation, and field definitions used for the sample data that comes with Solr.

Tip

We prefer to initialize a Solr configuration by copying the example Solr home directory and liberally modifying it as needed, ripping out or commenting what we don't need (which is often a lot). This is half way between starting with nothing, or starting with the example and making essential modifications. If you do start with Solr's example configuration, be sure to revisit your configuration at some point to clean out what you aren't using. In addition, it's tempting to keep the existing documentation comments, but you can always refer back to what comes with Solr as needed and keep your config file clean.

At the start of the file is the schema opening element:

<schema name="musicbrainz" version="1.5">

We've set the name of this schema to musicbrainz, the name of our application. If we used different schema files, then we should name them differently so as to differentiate them.

Field definitions

The definitions of the fields in the schema are located within the <fields/> element. There are many attributes that can be added to configure them, but here are the most important ones:

  • name (required): This uniquely identifies the field. There aren't any restrictions on the characters used nor any words to avoid, except for score.
  • type (required): This is a reference to one of the field types defined in the schema.
  • indexed: This indicates that this field can be searched, sorted, and used in a variety of other Solr features. It defaults to true since the only thing you can do with a nonindexed field is return it in search results, assuming it's marked as stored.
  • stored: This indicates that the field's value will be stored in Solr so that it can be returned in search results verbatim or highlighted for matching query text. By default, fields are stored. Sometimes the same data is copied into multiple fields that are indexed differently (which you'll begin to understand soon), and so the redundant fields should not be marked as stored. As of Solr 4.1, the stored data is internally compressed to save space, and perhaps surprisingly, to improve search performance too.
  • multiValued: Enable this if a field can contain more than one value. Order is maintained from that supplied at index-time. It's sloppy to have this enabled if the field never has multiple values as some aspects of Solr like faceting are forced to choose less efficient algorithms unnecessarily.
  • default: This is the default value, if an input document doesn't specify it. A common use of this is to timestamp documents: <field name="indexedAt" type="tdate" default="NOW/SECOND" />. For information on specifying dates, see the Date math section in Chapter 5, Searching.
  • required: Set this to true if you want Solr to fail to index a document that does not have a value for this field.

There are other attributes too that are more advanced; we'll get to them in a bit rather than distract you with them now.

Dynamic field definitions

The very notion of a dynamic field definition highlights the flexibility of Lucene's index, as compared to typical relational database technology. Not only can you explicitly name fields in the schema, but you can also have some defined on the fly based on the name supplied for indexing. Solr's example schema contains some examples of this, as follows:

<dynamicField name="*_dt" type="date" indexed="true"  stored="true"/>

If at index time a document contains a field that isn't matched by an explicit field definition, but does have a name matching this pattern (that is, ends with _dt, such as updated_dt), then it gets processed according to that definition. A dynamic field is declared just like a regular field in the same section. However, the element is named dynamicField, and it has a name attribute that must either start or end with an asterisk (the wildcard). It can also be just *, which is the final fallback.

Tip

The * fallback is most useful if you decide that all fields attempted to be stored in the index should succeed, even if you didn't know about the field when you designed the schema. It's also useful if you decide that instead of it being an error, such unknown fields should simply be ignored (that is, not indexed and not stored).

In the end, a field is a field, whether explicitly defined or defined dynamically according to a name pattern. Dynamic field definitions are just a convenience that makes defining schemas easier. There are no performance implications of using dynamic field definitions.

Advanced field options for indexed fields

There are additional attributes that can be added to fields marked as indexed to further configure them. These options are all set to false by default:

  • sortMissingFirst, sortMissingLast: Sorting on a field with one of these set to true indicates on which side of the search results to put documents that have no data for the specified field, regardless of the sort direction. The default behavior for such documents is to appear first for ascending and last for descending.
  • omitNorms: Basically, if you don't want the length of a field to affect its scoring (see Chapter 6, Search Relevancy) or it isn't used in the score in any way (such as for faceting), and you aren't doing index-time document boosting (see Chapter 4, Indexing Data), then enable this. Aside from its effect on scores, it saves a little memory too. It defaults to true for primitive field types, such as int, float, boolean, string, and so on.
  • omitPositions: This omits the term position information from the index to save a little space. Phrase queries won't work anymore.
  • omitTermFreqAndPositions: This omits term frequency and term positions from the index to save a little space. Phrase queries won't work and scores will be less effective.
  • termVectors: This will tell Lucene to store information that is used in a few cases to improve search performance. If a field is to be used by the MoreLikeThis feature, or for highlighting of a large text field, then try enabling this. It can substantially increase the index size and indexing time, so do a before-and-after measurement. There are two more options, which add more data to term vectors: termPositions and termOffsets. The FastVectorHighlighter class requires these.
  • positionIncrementGap: For a multiValued field, this is the number of (virtual) nonexistent words between each value to prevent inadvertent phrase queries matching across field values. For example, if A and B are given as two values for a field, positionIncrementGap of more than 1 prevents the phrase query "A B" from matching.

    Tip

    There is a helpful table on Solr's wiki at http://wiki.apache.org/solr/FieldOptionsByUseCase, which shows most of the options with some use cases that need them.

Solr 4.2 introduced a new advanced schema option called DocValues with the docValues option in the field type.

The docValues is a Boolean that, when enabled, causes Lucene to store the values for this field in an additional way that can be initialized faster than un-inverting indexed data when the field is used for match-only semantics such as term, wildcard, range queries, and so on, and also for faceting, sorting, and other use cases.

DocValues help optimize Solr for meeting real-time search requirements. Unless you have such requirements or you know what you're doing, don't enable DocValues as it uses more disk and the features that use it tend to work slower than without it after it's initialized (as of Solr 4.2). For more information on DocValues, read https://cwiki.apache.org/confluence/display/solr/DocValues.

The unique key

After the <fields> declarations in the schema, we can have the <uniqueKey> declaration specifying which field uniquely identifies each document, if any. This is what we have in our MusicBrainz schema:

<uniqueKey>id</uniqueKey>

Although it is technically not always required, you should define a unique ID field. In our MusicBrainz schema, the ID is a string that includes an entity type prefix type so that it's unique across the whole corpus, spanning multiple Solr Cores, for example, Artist:11650.

If your source data does not have an ID field that you can propagate, then you may want to consider using a Universally Unique Identifier (UUID), according to RFC-4122. Simply have a field with a field type for the class solr.UUIDField and either provide a UUID to Solr or use UUIDUpdateProcessorFactory, an update processor that adds a newly generated UUID value to any document being added that does not already have a value in the specified field. Solr's UUID support is based on java.util.UUID.

The default search field and query operator

There are a couple of schema configuration elements pertaining to search defaults when interpreting a query string:

<!-- <defaultSearchField>text</defaultSearchField>
<solrQueryParser defaultOperator="AND"/> -->

The defaultSearchField parameter declares the particular field that will be searched for queries that don't explicitly reference one. The solrQueryParser setting has a defaultOperator attribute, which lets you specify the default search operator (that is AND or OR and it will be OR if unspecified) here in the schema. These are essentially defaults for searches that are processed by Solr request handlers defined in solrconfig.xml.

Tip

We strongly recommend that you leave these commented out in the schema, which is how it comes in the example. It's tempting to set them but it further disperses the configuration relevant to interpreting a query, which already is the URL plus the request handler definition. Its effects are global here and may have unintended consequences on queries you don't want or intend, such as a delete query. Instead, configure the query parser's defaults in a request handler as desired in solrconfig.xml—documented in Chapter 5, Searching.

Copying fields

Closely related to the field definitions are copyField directives. A copyField directive copies one or more input field values to another during indexing. A copyField directive looks like this:

<copyField source="r_name" dest="r_name_sort" maxChars="20" />

This directive is useful when a value needs to be copied to additional field(s) to be indexed differently. For example, sorting and faceting require a single indexed value. Another is a common technique in search systems in which many fields are copied to a common field that is indexed without norms and not stored. This permits searches, which would otherwise search many fields, to search one instead, thereby drastically improving performance at the expense of reducing score quality. This technique is usually complemented by searching some additional fields with higher boosts. The dismax/edismax query parser, which is described in Chapter 5, Searching, makes this easy.

At index-time, each supplied field of input data has its name compared against the source attribute of all copyField directives. The source attribute might include an * wildcard, so it's possible that the input might match more than one copyField. If a wildcard is used in the destination, then it must refer to a dynamic field, and furthermore the source must include a wildcard too—otherwise a wildcard in the destination is an error. A match against a copyField directive has the effect of the input value being duplicated, but using the field name of the dest attribute of the directive. If maxChars is optionally specified, the copy is truncated to these many characters. The duplicate does not replace any existing values that might be going to the field, so be sure to mark the destination field as multiValued, if needed.

Note

<copyField> is a fundamental and very powerful concept of Solr, which is used more often than not to ensure that data is indexed into several fields based on the type of processing required on them at search time, without needing to include the data in the update command multiple times.

Our MusicBrainz field definitions

What follows is a first cut of our MusicBrainz schema definition. There are additional fields that will be added in other chapters to explore other search features. This is a combined schema defining all core entity types: artists, releases (also known as albums), and tracks. This approach was described earlier in the chapter. Notice that we chose to prefix field names by a character representing the entity type it is on (a_, r_, t_), to avoid overloading the use of any field across entity types.

We also used this abbreviation when we denormalized relationships such as in r_a_name (a release's artist's name).

<!-- COMMON TO ALL TYPES: -->
<field name="id" type="string" required="true" /><!-- Artist:11650 -->
<field name="type" type="string" required="true" /><!-- Artist | Release | Label -->
<field name="indexedAt" type="tdate" default="NOW/SECOND" />


<!-- ARTIST: -->
<field name="a_name" type="title" /><!-- The Smashing Pumpkins -->
<field name="a_name_sort" type="string" stored="false" /><!-- Smashing Pumpkins, The -->
<field name="a_alias" type="title" stored="false" multiValued="true" />
<field name="a_type" type="string" /><!-- group | person -->
<field name="a_begin_date" type="tdate" />
<field name="a_end_date" type="tdate" />
<field name="a_member_name" type="title" multiValued="true" />
       <!-- Billy Corgan -->
<field name="a_member_id" type="long" multiValued="true" /><!-- 102693 -->

<!-- RELEASE -->
<field name="r_name" type="title" /><!-- Siamese Dream -->
<field name="r_name_sort" type="string" stored="false" /><!-- Siamese Dream -->
<field name="r_a_name" type="title" /><!-- The Smashing Pumpkins -->
<field name="r_a_id" type="long" /><!-- 11650 -->
<field name="r_attributes" type="int" indexed="false" multiValued="true" /><!-- ex: 0, 1, 100 -->
<field name="r_type" type="rType" stored="false" multiValued="true" /><!-- Album | Single | EP |... etc. -->
<field name="r_official" type="rOfficial" stored="false"multiValued="true" /><!-- Official | Bootleg | Promotional -->
<field name="r_lang" type="string" indexed="false" /><!-- eng / latn -->
<field name="r_tracks" type="int" indexed="false" />
<field name="r_event_country" type="string" multiValued="true" /><!-- us -->
<field name="r_event_date" type="tdate" multiValued="true" />



  <!-- TRACK -->
<field name="t_name" type="title" /><!-- Cherub Rock -->
<field name="t_num" type="int" indexed="false" /><!-- 1 -->
<field name="t_duration" type="int"/><!-- 298133 -->
<field name="t_a_id" type="long" /><!-- 11650 -->
<field name="t_a_name" type="title" /><!-- The Smashing Pumpkins -->
<field name="t_r_name" type="title" /><!-- Siamese Dream -->
<field name="t_r_tracks" type="int" indexed="false" /><!-- 13 -->

Tip

Put some sample data in your schema comments

You'll find the sample data helpful and anyone else working on your project will thank you for it! In the preceding examples, we sometimes use actual values, and on other occasions, we list several possible values separated by |, if there is a predefined list.

Also, note that the only fields that we can mark as required are those common to all, which are ID and type, because we're doing a combined schema approach.

In our schema, we're choosing to index most of the fields, even though MusicBrainz's search doesn't require more than the name of each entity type. We're doing this so that we can make the schema more interesting to demonstrate more of Solr's capabilities. As it turns out, some of the other information in MusicBrainz's query results actually are queryable if one uses the advanced search form, checks use advanced query syntax, and your query uses those fields (for example, artist:"Smashing Pumpkins").

Tip

At the time of this writing, MusicBrainz used Lucene for its text search and so it uses Lucene's query syntax. http://wiki.musicbrainz.org/TextSearchSyntax. You'll learn more about the syntax in Chapter 5, Searching.

Defining field types

The latter half of the schema is the definition of field types. This section is enclosed in the <types/> element and will consume much of the file's content. The field types declare the types of fields, such as booleans, numbers, dates, and various text flavors. They are referenced by the field definitions under the <fields/> element. Here is the field type for a Boolean:

<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" />

A field type has a unique name and is implemented by a Java class specified by the class attribute.

Tip

A fully qualified classname in Java looks like org.apache.solr.schema.BoolField. The last piece is the simple name of the class, and the part preceding it is called the package name. In order to make configuration files in Solr more concise, the package name is abbreviated to just solr for most of Solr's packages.

Attributes other than the name and class represent configuration options; most are applicable to all types such as those listed earlier and some are specific to the implementing class. They can usually be overridden at the field declaration too. In addition to these attributes, there is also the text analysis configuration that is only applicable to text fields. This will be covered in the next chapter.

Note

Starting from Solr 4.8, both <fields> and <types> tags have been deprecated and they might be removed completely in the future versions. These tags can be safely removed from the schema file, which allows intermixing of <fieldType>, <field>, and <copyField> definitions, if desired.

Built-in field type classes

There are a number of built-in field types and nearly all are present and documented to some extent in Solr's example schema. We're not going to enumerate all of them here, but instead we will highlight some of ones that are worthy of more explanation.

Numbers and dates

There are no less than five different field types to use to store an integer, perhaps six if you want to count string! It's about the same for float, double, long, and date. And to think that you probably initially thought this technology only did text! We'll explain when to use which, using Integer as an example. Most have an analogous name for the other numeric and date types. The field types with names starting with "Trie" should serve 95 percent of your needs. To clean up your schema, consider deleting the others. The following is the list of various integer field types that you can use:

  • TrieIntField (with precisionStep = 0) is, commonly named int. This is a good default field suitable for most uses, such as an ID field.
  • TrieIntField (with precisionStep> 0), commonly named tint. If you expect to do numeric range queries (which include faceted ranges) over many values, then this field type has unbeatable performance at query time at the expense of a little more disk and indexing time cost. The default value configured in Solr's example schema is 8 for numeric and 6 for date; we recommend keeping these defaults. Smaller numbers (but > 0) will increase indexing space and time for better query range performance; although the performance gains rapidly diminish with each step.
  • IntField, commonly named pint. A legacy implementation that encodes integer values as simple strings. The values are evaluated in unicode string order instead of the numeric order. This field type will be removed in future versions; use TrieIntField instead.
  • SortableIntField is commonly named sint. DateField doesn't follow this naming convention but it also qualifies here. Both SortableIntField and DateField will be removed in the future versions, use TrieIntField and TrieDateField instead.

All of these numeric types sort in their natural numeric order instead of lexicographically.

Some other field types

Solr's geospatial support spans multiple parts of Solr from field types to query parsers, to function queries. Instead of having you read relevant parts of three chapters, we've consolidated it into the last part of Chapter 5, Searching.

CurrencyField (commonly named currency) allows us to let Solr calculate the query-time currency conversions and exchange rates and we can also plug in custom implementations of exchange rate providers. The following is the default configuration for currency fieldType with defaultCurrency="USD" and currencyConfig="currency.xml" in which we can list the currencies and their exchange rates.

<fieldType name="currency" class="solr.CurrencyField" precisionStep="8" defaultCurrency="USD" currencyConfig="currency.xml" />

We will discuss this field more in detail in Chapter 4, Indexing Data, and the query-time features supported by currency fields in Chapter 5, Searching. Alternatively, you can read more on Solr wiki at http://wiki.apache.org/solr/CurrencyField.

ExternalFileField (advanced) reads its float values from a plain text file instead of the index. It was designed for sorting or influencing scores of documents based on data that might change quickly (for example, a rating or click-through) without having to re-index a document. Remember that Lucene fundamentally cannot update just a single field; entire documents need to be re-indexed. This field type is a workaround for this limitation for the aforementioned use cases. This is discussed further in Chapter 6, Search Relevancy.

EnumField allows us to define a field with a closed set of values and the sort order for these values is predetermined but not lexicographic. Along with the name and the class parameters, which are common to all field types, we need provide two additional parameters:

  • enumsConfig: This is the configuration (in XML format) filename. The default location of the file is conf directory for the collection. The file should contain the <enum/> list of field values. You can always add new values to the end of the list, but you can't change the order or existing values in the enum without re-indexing.
  • enumName: This is the specific enumeration in the enumsConfig file to use for the field type.

You can read more about the EnumField at https://cwiki.apache.org/confluence/display/solr/Working+with+Enum+Fields.

Note

Starting from Solr 4.3, we can configure the ManagedIndexSchemaFactory in the solrconfig.xml file, which enables schema modifications through a REST interface, also known as the Schema API. You can read more on the wiki page at https://cwiki.apache.org/confluence/display/solr/Schema+API.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.2.160