The SpellCheck component

One of the best ways to enhance the search experience is by offering spelling corrections. This is sometimes presented at the top of search results with such text as "Did you mean ...". Solr supports this with the SpellCheckComponent.

Note

A related technique is to use fuzzy queries using the tilde syntax. However, fuzzy queries don't tell you what alternative spellings were used; the case is similar for phonetic matching.

For spelling corrections to work, Solr must clearly have a corpus of words (a dictionary) to suggest alternatives to those in the user's query. "Dictionary" is meant loosely as a collection of words, and not their definitions. Typically, you configure an appropriately indexed field as the dictionary or instead, you supply a plain text file. Solr can be configured to use one or more of the following spellcheckers:

  • DirectSolrSpellChecker: This uses terms from a field in the index. It computes suggestions by working directly off the main index. A configurable distance-measure computes similarities between words. By working off of the main index, this choice is very convenient, especially when getting started. For more performance or more options, choose another.
  • IndexBasedSpellChecker: This uses terms from a field in the index. It builds a parallel index and computes suggestions from that. A configurable distance-measure computes similarities between words.
  • FileBasedSpellChecker: This uses a simple text file of words. It's otherwise essentially the same as IndexBasedSpellChecker.
  • WordBreakSolrSpellChecker: This detects when a user has inadvertently omitted a space between words or added a space within a word. It computes suggestions directly off the main index. It's often used in conjunction with one of the other SpellChecker components.

There is also a Suggester SpellChecker that implements auto-suggest / query completion. That choice is deprecated as of Solr 4.7, which introduced a dedicated SearchComponent for suggestions. We'll describe that feature later in this chapter.

The notion of a parallel index, also known as a side-car index, is simply an additional internal working index for a dedicated purpose. These must be 'built', which takes time, and they can get out of sync with the main index.

Note

Before reading on about configuring spell checking in solrconfig.xml, you may want to jump ahead and take a quick peek at an example towards the end of this section, and then come back.

The schema configuration

Assuming your dictionary is going to be based on indexed content instead of a file, a field should be set aside exclusively for this purpose. This is so that it can be analyzed appropriately and so that other fields can be copied into it, as the spellcheckers use just one field. Most Solr setups would have one field; our MusicBrainz searches, on the other hand, are segmented by the data type (artists, releases, and tracks), and so one for each would be best. For the purposes of demonstrating this feature, we will only do it for artists.

In schema.xml, we need to define the field type for spellchecking. This particular configuration is one we recommend for most scenarios:

<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" stored="false" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
            expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

A field type for spellchecking is not marked as stored because the spellcheck component only uses the indexed terms. The important thing is to ensure that the text analysis does not perform stemming, as the corrections presented would suggest the stems, which would look very odd to the user for most stemmer algorithms. It's also hard to imagine a use case that doesn't apply lowercasing.

Now, we need to create a field for this data:

<field name="a_spell" type="textSpell" />

And we need to get data into it with some copyField directives:

<copyField source="a_name" dest="a_spell" />
<copyField source="a_alias" dest="a_spell" />

Arguably, a_member_name may be an additional choice to copy as well, as the dismax search we configured (seen in the following code) searches it too, albeit at a reduced score. This, as well as many decisions with search configuration, is subjective.

Configuration in solrconfig.xml

To use any search component, it needs to be in the components list of a request handler. The spellcheck component is not in the standard list, so it needs to be added:

<requestHandler name="/mb_artists" class="solr.SearchHandler">
  <!-- default values for query parameters -->
  <lst name="defaults">
    <str name="defType">edismax</str>
    <str name="qf">a_name a_alias^0.8 a_member_name^0.4</str>
    <!-- etc. -->
  </lst>
  <arr name="last-components">
    <str>spellcheck</str>
  </arr>
</requestHandler>

This component should already be defined in solrconfig.xml. Within the spellchecker search component, there is one or more XML blocks named spellchecker, so that different dictionaries and other options can be configured. These might also be loosely referred to as the dictionaries, because the parameter that refers to this choice is named that way (more on that later). We have two spellcheckers configured as follows:

  • a_spell: This is an index-based spellchecker that is a typical recommended configuration using DirectSolrSpellChecker on the a_spell field.
  • file: This is a sample configuration where the input dictionary comes from a file (not included).

A complete MusicBrainz implementation would have a different spellchecker for each MB data type, with all of them configured similarly.

Following the excerpt given here is an example configuration of the key options available in the spellchecker component:

<searchComponent name="spellcheck"
    class="solr.SpellCheckComponent">
  <str name="queryAnalyzerFieldType">textSpell</str><!-- 'q'
    only -->

  <lst name="spellchecker">
    <str name="name">a_spell</str>
    <str name="field">a_spell</str>
    <str name="classname">solr.DirectSolrSpellChecker</str>
    <str name="distanceMeasure">internal</str>
    <float name="accuracy">0.5</float>
    <int name="maxEdits">1</int>
    <int name="minPrefix">1</int>
    <int name="maxInspections">5</int>
    <int name="minQueryLength">4</int>
    <float name="maxQueryFrequency">0.01</float>
    <float name="thresholdTokenFrequency">.01</float>
  </lst>
  <!-- just an example -->
  <lst name="spellchecker">
    <str name="name">file</str>
    <str name="classname">solr.FileBasedSpellChecker</str>
    <str name="sourceLocation">spellings.txt</str>
    <str name="characterEncoding">UTF-8</str>
  </lst>
</searchComponent>

Configuring spellcheckers – dictionaries

The double layer of spellchecker configuration is perhaps a little confusing. The outer one just names the search component—it's just a container for configuration(s). The inner ones are distinct configurations to choose at search time.

The following options are common to all spellcheckers, unless otherwise specified:

  • name: This refers to the name of the spellcheck configuration. It defaults to default. Be sure not to have more than one configuration with the same name.
  • classname: This refers to the implementation of the spellchecker. It's optional but you should be explicit. The choices are solr.DirectSolrSpellChecker, solr.IndexBasedSpellChecker, solr.WordBreakSolrSpellChecker, and solr.FileBasedSpellChecker. Further information on these is just ahead.
  • accuracy: This sets the minimum spelling correction accuracy to act as a threshold. It falls between 0 and 1 with a default of 0.5. The higher this number is, the simpler the corrections are. The accuracy is computed by the distanceMeasure. This option doesn't apply to WordBreakSolrSpellChecker.
  • distanceMeasure: This Java class computes how similar a possible misspelling and a candidate correction are. It defaults to org.apache.lucene.search.spell.LevensteinDistance, which is the same algorithm used in fuzzy query matching. Alternatively, org.apache.lucene.search.spell.JaroWinklerDistance works quite well. This option doesn't apply to WordBreakSolrSpellChecker.
  • field: This refers to the name of the field within the index that contains the dictionary. It's mandatory except when using FileBasedSolrSpellChecker where it's not applicable, since its data comes from a file, not an index. The field must be indexed as the data is taken from there and not from the stored content, which is ignored. Generally, this field exists expressly for spell correction purposes and other fields are copied into it.
  • fieldType: This is a reference to a field type in schema.xml to perform text-analysis on words to be spellchecked by the spellcheck.q parameter (not q). If this isn't specified, then it defaults to the field type of the field parameter, and if not specified, it defaults to a simple whitespace delimiter, which most likely would be a misconfiguration. When using the file-based spellchecker with spellcheck.q, be sure to specify this.

Technically, buildOnCommit and buildOnOptimize should be in the preceding list, but it's only worthwhile for the Index- or file-based spellcheckers, since they maintain a parallel index.

DirectSolrSpellChecker options

The DirectSolrSpellChecker component works directly off the Solr index without needing to maintain a parallel index to generate suggestions that might get out of sync. It's a great choice to start with.

  • maxEdits: This is the number of changes to allow for each term; the default value is 2. Since most spelling mistakes are only one letter off, setting this to 1 will reduce the number of possible suggestions.
  • minPrefix: This refers to the minimum number of characters that the terms should share. If you want the spelling suggestions to start with the same letter, set this value as 1.
  • maxInspections: This defines the maximum number of possible matches to review before returning the results; the default is 5.
  • minQueryLength: This specifies how many characters must be in the input query before suggestions are returned; the default is 4.
  • maxQueryFrequency: This is the maximum threshold for the number of documents a term must appear in before being considered as a suggestion. This can be a percentage (such as .01 percent for 1 percent) or an absolute value (such as 2). A lower threshold is better for small indexes.
  • thresholdTokenFrequency: This specifies a document frequency threshold, which will exclude words that don't occur sufficiently often. This can be expressed as a fraction in the range 0-1, defaulting to 0, which effectively disables the threshold, letting all words through. It can also be expressed as an absolute value.

    Tip

    If there is a lot of data and lots of common words, as opposed to proper nouns, then this threshold should be effective. If testing shows spelling candidates including strange fluke words found in the index, then introduce a threshold that is high enough to weed out such outliers. The threshold will probably be less than 0.01—one percent of documents.

IndexBasedSpellChecker options

The IndexBasedSpellChecker component gets the dictionary from the indexed content of a field in a Lucene/Solr index, and it loads it into its own private parallel index to perform spellcheck searches on. The options are explained as follows:

  • buildOnCommit and buildOnOptimize: These Boolean options (defaulting to false) enable the spellchecker's internal index to be built automatically when either Solr performs a commit or optimize. This can make keeping the spellchecker in sync easier than building manually, but beware that commits or optimizes will subsequently be hit with a long delay.
  • spellcheckIndexDir: This is the directory where the spellchecker's internal dictionary is built, not its source. It is relative to Solr's data directory. This is actually optional, which results in an in-memory dictionary.

    Note

    For a high-load Solr server, an in-memory index is appealing. Until SOLR-780 is addressed, you'll have to take care to tell Solr to build the dictionary whenever the Solr core gets loaded. This happens at startup or if you tell Solr to reload a core.

  • sourceLocation: If specified, it refers to a directory containing Lucene index files, such as a Solr data directory. This is an unusual expert choice, but shows that the source dictionary does not need to come from Solr's main index; it could be from another location, perhaps from another Solr core. If you are doing this, then you'll probably also need to use the spellcheck.reload command mentioned later.

    Note

    Warning

    This option name is common to both IndexBasedSpellChecker and FileBasedSpellChecker but is defined differently.

  • thresholdTokenFrequency: This has the same definition as in DirectSolrSpellChecker

FileBasedSpellChecker options

The FileBasedSpellChecker component is very similar to IndexBasedSpellChecker, except that it gets the dictionary from a plain text file instead of the index. It maintains its own private parallel index to perform spellcheck searches on. This can be useful if you are using Solr as a spelling server, or if you don't need spelling suggestions to be based on actual terms in the index. The file format is one word per line. You can find an example file (spellings.txt) in the conf directory.

  • buildOnCommit, buildOnOptimize, and, spellcheckIndexDir: For more on these, see the IndexBasedSpellChecker options section
  • sourceLocation: This is mandatory and references a plain text file with each word on its own line. Note that an option by the same name but different meaning exists for IndexBasedSpellChecker.

    Tip

    For a freely available English word list, check out Spell Checker Oriented Word Lists (SCOWL) at http://wordlist.sourceforge.net. In addition, see the dictionary files for OpenOffice, which supports many languages at http://wiki.services.openoffice.org/wiki/Dictionaries.

  • characterEncoding: This is optional, but should be set. It is the character encoding of sourceLocation, defaulting to UTF-8.

WordBreakSolrSpellChecker options

The WordBreakSolrSpellChecker component offers suggestions by combining adjacent query terms and/or breaking terms into multiple words from the Solr index. It can detect spelling errors resulting from misplaced whitespace without the use of shingle-based dictionaries and provides collation support for word-break errors, including cases where the user has a mix of single-word spelling errors and word-break errors in the same query. The following are options specific to this spellchecker:

  • combineWords: This defines whether words should be combined in a dictionary search; default is true
  • breakWords: This defines whether words should be broken during a dictionary search; default is true
  • maxChanges: This defines how many times the spell checker should check collation possibilities against the index; default is 10 (can be any integer)

For more advanced options, see the Javadocs at http://lucene.apache.org/solr/4_8_0/solr-core/org/apache/solr/spelling/WordBreakSolrSpellChecker.html.

Tip

If you use this spellchecker, you'll probably want to combine its suggestions with one of the other spellcheckers. All you need to do is reference multiple dictionaries at search time (more on that later) and Solr will merge them. Pretty cool!

You can find an example of this spellchecker configuration in Solr's example solrconfig.xml.

Processing the q parameter

We've not yet discussed the parameters of a search with the spellchecker component enabled. But at this point of the configuration discussion, understand that you have the choice of just letting the user query q get processed or you can use spellcheck.q.

When a user query (q parameter) is processed by the spellcheck component to look for spelling errors, Solr needs to determine what words are to be examined. This is a two-step process. The first step is to pull out the queried words from the query string, ignoring any syntax, such as AND. The next step is to process the words with an analyzer so that, among other things, lowercasing is performed.

The analyzer chosen is through a field type specified directly within the search component configuration with queryAnalyzerFieldType. It really should be specified, but it's actually optional. If left unspecified, there would be no text analysis, which would in all likelihood be a misconfiguration.

Note

This algorithm is implemented by a spellcheck query converter—a Solr extension point. The default query converter, known as SpellingQueryConverter, is probably fine.

Processing the spellcheck.q parameter

If the spellcheck.q parameter is given (which really isn't a query per se), then the string is processed with the text analysis referenced by the fieldType option of the spellchecker being used. If a file-based spellchecker is being used, then you should set this explicitly. Index-based spellcheckers will sensibly use the field type of the referenced indexed spelling field.

Note

The dichotomy of the ways in which the analyzer is configured between both q and spellcheck.q arguably needs improvement.

Building index- and file-based spellcheckers

If the spellchecker you are using is IndexedBasedSpellChecker or FileBasedSpellChecker (or, technically, Suggester), then it needs to be built, which is the process in which the dictionary is read and is built into the spellcheckIndexDir. If it isn't built, then no corrections will be offered, and you'll probably be very confused. You'll be even more confused when troubleshooting the results if it was built once but is far out of date and so needs to be built again.

Generally, building is required if it has never been built before, and it should be built periodically when the dictionary changes. It need not necessarily be built for every change, but it obviously won't benefit from any such modifications.

Tip

Using buildOnOptimize or buildOnCommit is a low-hassle way to keep the spellcheck index up to date. However, most apps never optimize or optimize too infrequently to make use of this, or they commit too frequently. So instead (or in addition to buildOnOptimize), issue build commands manually on a suitable time period and/or at the end of your data loading scripts. Furthermore, setting spellcheckIndexDir will ensure the built spellcheck index is persisted between Solr restarts.

In order to perform the build of a spellchecker, simply enable the component with spellcheck=true, add a special parameter called spellcheck.build, and set it to true: http://localhost:8983/solr/mbartists/select?&qt=mb_artists&rows=0&spellcheck=true&spellcheck.build=true&spellcheck.dictionary=a_spell.

The other spellcheck parameters will be explained shortly. There is an additional related option similar to spellcheck.build called spellcheck.reload. This doesn't rebuild the index, but it basically re-establishes connections with the index—both sourceLocation for index-based spellcheckers and spellcheckIndexDir for all types. If you've decided to have some external process build the dictionary or simply share built indexes between spellcheckers, then Solr needs to know to reload it to see the changes—a quick operation.

Issuing spellcheck requests

At this point, we've covered how to configure a spellchecker but not how to issue requests that actually use it. In summary, all that you are required to do is add spellcheck=true to a standard search request, but it is more likely that you will set other options, once you start experimenting.

It's important to be aware that there are effectively three mutually exclusive internal modes that this component places itself in:

  • The default mode is only to offer suggestions for query terms that find no results. This is intuitive, but sometimes a term that finds results was an indexed typo.
  • If spellcheck.onlyMorePopular=true, then the spellcheck component will not only try to offer suggestions for query terms that find no results, it will also do so for the other terms, provided it can offer a suggestion that occurs more frequently in the index. Now Solr is working harder and intuitively, this should help fix cases when the query is an indexed typo. However, the erroneous query term might not be an indexed typo (for example, June versus Jane); can Solr still try harder? Yes…
  • If spellcheck.alternativeTermCount is set, then it will try to find suggestions for all terms, and the suggestions need not occur more frequently.

Despite these progressively aggressive spellcheck modes, there might still be no suggestions or fewer than the number asked for if it simply can't find anything suitable.

Let's now explore the various request parameters recognized by the spellchecker component:

  • spellcheck: This refers to a Boolean switch that must be set to true to enable the component in order to see suggested spelling corrections.
  • spellcheck.dictionary: This is the named reference to a dictionary (spellchecker) to use configured in solrconfig.xml. It defaults to default. This can be set multiple times and Solr will merge the suggestions.
  • spellcheck.q or q: The string containing words to be processed by this component can be specified as the spellcheck.q parameter, and if not present, then the q parameter. Please look for the information presented earlier on how these are processed.

    Tip

    Which should you use: spellcheck.q or q?

    Assuming you're handling user queries for Solr that might contain some query syntax, then the default q is right, as Solr will then know to filter out possible uses of Lucene/Solr's syntax, such as AND, OR, fieldname:word, and so on. If not, then spellcheck.q is preferred, as it won't go through that unnecessary processing. This also allows its parsing to be different on a spellchecker-by-spellchecker basis, which we'll leverage in our example.

  • spellcheck.count: This refers to the maximum number of corrections to offer per word. The default is 1. Corrections are ordered by those closest to the original, as determined by the distanceMeasure algorithm.

    Tip

    Although counter-intuitive, raising this number affects the suggestion ordering—the results get better! The internal algorithm sees ~10 times as many as this number and then it orders them by closest match. Consequently, use a number between 5 and 10 or so to get quality results.

  • spellcheck.extendedResults: This is a Boolean switch that adds frequency information, both for the original word and for the suggestions. It's helpful when debugging.
  • spellcheck.collate: This is a Boolean switch that adds a revised query string to the output that alters the original query (from spellcheck.q or q) to use the top recommendation for each suggested word. It's smart enough to leave any other query syntax in place. The following are some additional options for use when collation is enabled:
    • spellcheck.maxCollations: This specifies the maximum number of collations to return, defaulting to 1.
    • spellcheck.maxCollationTries: This specifies the maximum number of collations to try (verify it yields results), defaulting to 5. If this is non-zero, then the spellchecker will not return collations that yield no results.
    • spellcheck.maxCollationEvaluations: This specifies the maximum number of word correction combinations to rank before the top candidates are tried (verified). Without this limit, queries with many misspelled words could yield a combinatoric explosion of possibilities. The default is 10000, which should be fine.
    • spellcheck.collateExtendedResults: This is a Boolean switch that adds more details to the collation response. It adds the collation hits (number of documents found) and a mapping of misspelled words to corrected words.
    • spellcheck.collateParam.xx: This will allow parameter override, where xx is the parameter you want to override; for example, to override mm from a low value to a high value so that the spellchecker is truly verifying that the replacement (collation) terms exist together in the same document. This is similar to local-params, but is applied to the collated query string verification when maxCollationTries is used.

      Tip

      Enable spellcheck.collate as a user interface will most likely want to present a convenient link to use the spelling suggestions. Furthermore, ensure the collation is verified to return results by setting spellcheck.maxCollationTries to a small non-zero number—perhaps 5.

  • spellcheck.onlyMorePopular: This is a Boolean switch that will offer spelling suggestions for queried terms that were found in the index, provided that the suggestions occur more often. This is in addition to the normal behavior of only offering suggestions for queried terms not found in the index. To detect when this happens, enable extendedResults and look for origFreq being greater than 0. This is disabled, by default.
  • spellcheck.alternativeTermCount: This specifies the maximum number of suggestions to return for each query term that already exists in the index/dictionary. Normally, the spellchecker doesn't offer suggestions for such query terms, and so setting this triggers the spellchecker to try to find suggestions for all query terms. The configured number essentially overrides spellcheck.count for such terms, giving the opportunity to use a more conservative (lower) number, since it's less likely one of these query terms was actually misspelled.
  • spellcheck.maxResultsForSuggest: This specifies the maximum number of results the request can return in order to both generate spelling suggestions and set the correctlySpelled element to false. This acts as an early short-circuit rule in the spellchecker if you set it, otherwise there is no rule. This option is only applicable when spellcheck.onlyMorePopular is true or spellcheck.alternativeTermCount is set, because only those two options can trigger suggestions for queries that return results.

    Tip

    We recommend that you experiment with various options provided by the SpellChecker component, with the real data that you are indexing so that you can find out what options work best for your requirements.

Example usage for a misspelled query

We'll try out a typical spellcheck configuration that we've named a_spell. We've disabled showing the query results with rows=0 because the actual query results aren't the point of these examples. In this example, it is imagined that the user is searching for the band Smashing Pumpkins, but with a misspelling.

Here are the search results for Smashg Pumpkins, using the a_spell dictionary:

<?xml version="1.0"?>
<response>
<lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">124</int>
    <lst name="params">
    <str name="spellcheck">true</str>
    <str name="indent">on</str>
    <str name="spellcheck.extendedResults">true</str>
    <str name="spellcheck.collateExtendedResults">true</str>
    <str name="spellcheck.maxCollationTries">5</str>
    <str name="spellcheck.collate">true</str>
    <str name="rows">0</str>
    <str name="echoParams">explicit</str>
    <str name="q">Smashg Pumpkins</str>
    <str name="spellcheck.dictionary">a_spell</str>
    <str name="spellcheck.count">5</str>
    <str name="qt">/mb_artists</str>
  </lst>
</lst>
<result name="response" numFound="0" start="0"/>
<lst name="spellcheck">
  <lst name="suggestions">
    <lst name="smashg">
      <int name="numFound">5</int>
      <int name="startOffset">0</int>
      <int name="endOffset">6</int>
      <int name="origFreq">0</int>
      <arr name="suggestion">
        <lst>
          <str name="word">smash</str>
          <int name="freq">36</int>
        </lst>
        <lst>
          <str name="word">smashing</str>
          <int name="freq">4</int>
        </lst>
        <lst>
          <str name="word">smashign</str>
          <int name="freq">1</int>
        </lst>
        <lst>
          <str name="word">smashed</str>
          <int name="freq">5</int>
        </lst>
        <lst>
          <str name="word">smasher</str>
          <int name="freq">2</int>
        </lst>
      </arr>
    </lst>
    <bool name="correctlySpelled">false</bool>
    <lst name="collation">
      <str name="collationQuery">smashing Pumpkins</str>
      <int name="hits">1</int>
      <lst name="misspellingsAndCorrections">
        <str name="smashg">smashing</str>
      </lst>
    </lst>
  </lst>
</lst>
</response>

In this scenario, we intentionally chose a misspelling that is closer to another word: "smash". Were it not for maxCollationTries, the suggested collation would be "smash Pumpkins", which would return no results. There are a few things we want to point out regarding the spellchecker response:

  • Applications consuming this data will probably only use the collation query, despite the presence of a lot of other information.
  • The suggestions are ordered by the so-called edit-distance score (closest match), which is not displayed. It may seem here that it is ordered by frequency, which is a coincidence.

    Note

    There is an extension point to the spellchecker to customize the ordering—search Solr's wiki on comparatorClass for further information. You could write one that orders results based on a formula, fusing both the suggestion score and document frequency.

  • startOffset and endOffset are the index into the query of the spellchecked word. This information can be used by the client to display the query differently, perhaps displaying the corrected words in bold.
  • numFound is always the number of suggested words returned, not the total number available, if spellcheck.count were raised.
  • correctlySpelled is intuitively true or false, depending on whether all of the query words were found in the dictionary or not.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.163.31