One combined index or separate indices

The following discussion concerns how to manage the searching of different types of data, such as artists and releases from MusicBrainz. In the MusicBrainz configuration example, each document of each type gets their own index but they all share the same configuration. Although we wouldn't generally recommend it, this approach was done for convenience and to reduce the complexity for this book at the expense of a one-size-fits-all schema and configuration.

Tip

A Solr server hosts one or more Solr Cores. A Solr Core is an instance of Solr to include the configuration and index, sometimes the word "core" is used synonymously with "index". Even if you have one type of data to search for in an application, you might still use multiple cores (with the same configuration) and shard the data for scaling. Managing Solr Cores is discussed further in Chapter 11, Deployment.

One combined index

A combined index might also be called an aggregate index. As mentioned in the first chapter, an index is conceptually like a single-table relational database schema, thus sharing similarities with some NoSQL (non-relational) databases. In spite of this limitation, there is nothing to stop you from putting different types of data (say, artists and releases from MusicBrainz) into a single index. All you have to do is use different fields for the different document types, and use a field to discriminate between the types. An identifier field would need to be unique across all documents in this index, no matter what the type is, you could easily do this by concatenating the field type and the entity's identifier. This may appear ugly from a relational database design standpoint, but this isn't a database! More importantly, unlike a database, there is no overhead whatsoever for some documents to not populate some fields. This is where the spreadsheet metaphor can break down, because a blank cell in a spreadsheet takes up space, but not in Solr's index.

Here's a sample schema.xml snippet of the fields for a single combined index approach:

<field name="id" ... /><!-- example:  "artist:534445"  -->
<field name="type" ... /><!-- example: "artist", "track", "release",... -->
<field name="name" ... /><!-- (common to various types) -->

<!-- track fields: -->
<field name="PUID" ... />
<field name="num" ... /><!-- i.e. the track # on the release -->
<!-- etc. -->
<!-- artist fields: -->
<field name="startDate" ... /><!-- date of first release -->
<field name="endDate" ... /><!-- date of last release -->
<field name="homeCountry" ... />
<!-- etc. -->

Tip

A combined index has the advantage of being easier to maintain, since it is just one configuration. It is also easier to do a search over multiple document types at once, since this will naturally occur, assuming you search on all the relevant fields. For these reasons, it is a good approach to start off with. However, consider the shortcomings to be described shortly.

For the book, we've taken a hybrid approach in which there are separate Solr Cores (indices) for each MusicBrainz data type, but they all share the same configuration, including the schema.

Problems with using a single combined index

Although a combined index is more convenient to set up, there are some problems that you may face:

  • There may be namespace collision problems unless you prefix the field names by type such as: artist_startDate and track_PUID. In the example that we just saw, most entity types have a name. Therefore, it's straightforward for all of them to have this common field. If the type of the fields were different, then you would be forced to name them differently.
  • If you share the same field for different entities such as the name field in the example that we just saw, then there are some problems that can occur when using that field in a query and while filtering documents by document type. These caveats do not apply when searching across all documents.
  • You will get scores that are of lesser quality due to suboptimal document frequency and total document count values, and components of the IDF part of the score. The document frequency is simply the number of documents in which a queried term exists for a specific field. If you put different types of things into the same field, then what could be a rare word for a track name might not be for an artist name. The total document count ends up being inflated instead of being limited to a specific document type (although the problem isn't as bad as the suboptimal document frequency). Scoring is described further in Chapter 6, Search Relevancy.
  • Prefix, wildcard, and fuzzy queries will take longer. If you share a field with different types of documents, then the total number of terms to be searched is going to be larger, which takes longer for these query types.
  • For a large number of documents, a strategy using multiple indices will prove to be more scalable. Only testing will indicate what "large" is for your data and your queries, but less than a million documents are not likely to benefit from multiple indices. Once you have tens of millions of documents, you would consider multiple indices. There are so many factors involved, so take these numbers as rough guidelines.
  • Committing changes to a Solr index invalidates the caches used to speed up querying, and these get rebuilt during the warming phase of a commit. If this happens often, and the changes are usually due to one type of entity in the index, then you will get better performance by using separate indices.

Separate indices

For separate indices, you simply develop your schemas independently. You can use a combined schema as previously described, and use it for all of your cores so that you don't have to manage them separately. It's not an approach for the purist, but it is convenient and it is also what we've done for the book's example code. The rest of the discussion here assumes that the schemas are independent.

Tip

To share the same schema field type definitions (described in the following sections) across your schemas without having to keep them in sync, use the XInclude feature. XInclude is described in Chapter 11, Deployment.

If you do develop separate schemas and need to search across your indices in one search, then you must perform a distributed search, described in Chapter 10, Scaling Solr. A distributed search is usually feature employed for a large corpus, but it applies here too. Be sure to read more about it before using it as there are some limitations. As in the combined-schema, you will need a unique ID across all documents and you will want a field type to differentiate documents in your search results. You don't need commonly named fields to search on, since the query will be processed at each core using the configuration there to determine, for example, what the default search field is.

Tip

You can't go wrong with multiple indices (Solr Cores); it's just a bit more to manage. And just because you have multiple indices doesn't preclude sharing as much of the configuration (including the schema) as you want to among the cores. Chapter 11, Deployment, will discuss configuring the cores including sharing them and parameterizing them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.161.132