The following discussion concerns how to manage the searching of different types of data, such as artists and releases from MusicBrainz. In the MusicBrainz configuration example, each document of each type gets their own index but they all share the same configuration. Although we wouldn't generally recommend it, this approach was done for convenience and to reduce the complexity for this book at the expense of a one-size-fits-all schema and configuration.
A Solr server hosts one or more Solr Cores. A Solr Core is an instance of Solr to include the configuration and index, sometimes the word "core" is used synonymously with "index". Even if you have one type of data to search for in an application, you might still use multiple cores (with the same configuration) and shard the data for scaling. Managing Solr Cores is discussed further in Chapter 11, Deployment.
A combined index might also be called an aggregate index. As mentioned in the first chapter, an index is conceptually like a single-table relational database schema, thus sharing similarities with some NoSQL (non-relational) databases. In spite of this limitation, there is nothing to stop you from putting different types of data (say, artists and releases from MusicBrainz) into a single index. All you have to do is use different fields for the different document types, and use a field to discriminate between the types. An identifier field would need to be unique across all documents in this index, no matter what the type is, you could easily do this by concatenating the field type and the entity's identifier. This may appear ugly from a relational database design standpoint, but this isn't a database! More importantly, unlike a database, there is no overhead whatsoever for some documents to not populate some fields. This is where the spreadsheet metaphor can break down, because a blank cell in a spreadsheet takes up space, but not in Solr's index.
Here's a sample schema.xml
snippet of the fields for a single combined index approach:
<field name="id" ... /><!-- example: "artist:534445" --> <field name="type" ... /><!-- example: "artist", "track", "release",... --> <field name="name" ... /><!-- (common to various types) --> <!-- track fields: --> <field name="PUID" ... /> <field name="num" ... /><!-- i.e. the track # on the release --> <!-- etc. --> <!-- artist fields: --> <field name="startDate" ... /><!-- date of first release --> <field name="endDate" ... /><!-- date of last release --> <field name="homeCountry" ... /> <!-- etc. -->
A combined index has the advantage of being easier to maintain, since it is just one configuration. It is also easier to do a search over multiple document types at once, since this will naturally occur, assuming you search on all the relevant fields. For these reasons, it is a good approach to start off with. However, consider the shortcomings to be described shortly.
For the book, we've taken a hybrid approach in which there are separate Solr Cores (indices) for each MusicBrainz data type, but they all share the same configuration, including the schema.
Although a combined index is more convenient to set up, there are some problems that you may face:
artist_startDate
and track_PUID
. In the example that we just saw, most entity types have a name. Therefore, it's straightforward for all of them to have this common field. If the type of the fields were different, then you would be forced to name them differently.name
field in the example that we just saw, then there are some problems that can occur when using that field in a query and while filtering documents by document type. These caveats do not apply when searching across all documents.For separate indices, you simply develop your schemas independently. You can use a combined schema as previously described, and use it for all of your cores so that you don't have to manage them separately. It's not an approach for the purist, but it is convenient and it is also what we've done for the book's example code. The rest of the discussion here assumes that the schemas are independent.
To share the same schema field type definitions (described in the following sections) across your schemas without having to keep them in sync, use the XInclude feature. XInclude is described in Chapter 11, Deployment.
If you do develop separate schemas and need to search across your indices in one search, then you must perform a distributed search, described in Chapter 10, Scaling Solr. A distributed search is usually feature employed for a large corpus, but it applies here too. Be sure to read more about it before using it as there are some limitations. As in the combined-schema, you will need a unique ID across all documents and you will want a field type
to differentiate documents in your search results. You don't need commonly named fields to search on, since the query will be processed at each core using the configuration there to determine, for example, what the default search field is.
You can't go wrong with multiple indices (Solr Cores); it's just a bit more to manage. And just because you have multiple indices doesn't preclude sharing as much of the configuration (including the schema) as you want to among the cores. Chapter 11, Deployment, will discuss configuring the cores including sharing them and parameterizing them.
18.227.161.132