The MoreLikeThis component

Have you ever searched for something and found a link that wasn't quite what you were looking for but was reasonably close? If you were using an Internet search engine such as Google, then you may have tried the "more like this…" link next to a search result. Some sites use other language like "find similar..." or "related documents…" As these links suggest, they show you pages similar to another page. Solr supports more like this (MLT) too.

The MLT capability in Solr can be used in the following three ways:

  • As a search component: The MLT search component performs MLT analysis on each document returned in a search. This is not usually desired and so it is rarely used.
  • As a request handler: The MLT request handler gives MLT results that are based on a specific indexed document. This is commonly used in reaction to a user clicking a "more like this" link on existing search results. The key input to this option is a reference to the indexed document that you want similar results for.
  • As a request handler with externally supplied text: The MLT request handler can give MLT results based on text posted to the request handler. For example, if you were to send a text file to the request handler, then it would return the documents in the index that are most similar to it. This is atypical, but an interesting option nonetheless.
  • As a query parser: Solr 5 includes a query parser named mlt that can more easily be combined with other queries or relevancy boosting than the other options. See the Solr Reference Guide for further information.

The essences of the internal workings of MLT operate like this:

  1. Gather all of the terms with frequency information from the input document:
    • If the input document is a reference to a document within the index, then loop over the fields listed in mlt.fl, and then the term information needed is readily there for the taking if the field has termVectors enabled. Otherwise, get the stored text and reanalyze it to derive the terms (slower).
    • If the input document is posted as text to the request handler, then analyze it to derive the terms. The analysis used is that configured for the first field listed in mlt.fl.
  2. Filter the terms based on configured thresholds. What remains are only the interesting terms.
  3. Construct a Boolean OR query with these interesting terms across all of the fields listed in mlt.fl.

Configuration parameters

In the following configuration options, the input document is either each search result returned if MLT is used as a component, or it is the first document returned from a query to the MLT request handler, or it is the plain text sent to the request handler. It simply depends on how you use it.

Parameters specific to the MLT search component

Using the MLT search component adorns an existing search with MLT results for each document returned.

  • mlt: You must set this to true to enable MLT when using it as a search component. It defaults to false.
  • mlt.count: This refers to the number of MLT results to be returned for each document returned in the main query. It defaults to 5.

Parameters specific to the MLT request handler

Using the MLT request handler is more like a regular search, except that the results are documents similar to the input document. Additionally, filters (the fq parameter) are applied.

  • q, start, and rows: The MLT request handler uses the same standard parameters for the query start offset, and row count as used for querying. But in this case, start and rows is for paging into the MLT results instead of the results of the query. The query is typically one that simply references one document, such as id:12345 (if your unique field looks like this). start defaults to 0 and rows to 10.
  • mlt.match.offset: This parameter is the offset into the results of q to pick which document is the input document. It defaults to 0 so that the first result from q is chosen. As q will typically search for one document, this is rarely modified.
  • mlt.match.include: The input document is normally included in the response if it is in the index (see the match element in the output of the example) because this parameter defaults to true. Set this parameter to false to exclude it, if that information isn't needed.
  • mlt.interestingTerms: If this is set to list or details, then the so-called interesting terms that MLT uses for the similarity query are returned with the results in an interestingTerms element. If you enable mlt.boost, then specifying details will additionally return the query boost value used for each term. The default, none, disables this. Aside from diagnostic purposes, it might be useful to display these in the user interface, either listed out or in a tag cloud.

    Tip

    Use mlt.interestingTerms while experimenting with the results to get an insight into why the MLT results matched the documents it did.

  • facet, ...: The MLT request handler supports faceting the MLT results. See the previous chapter on how to use faceting.

    Note

    Additionally, remember to configure the MLT request handler in solrconfig.xml. An example of this is shown later in the chapter.

Common MLT parameters

These parameters are common to both the search component and request handler. Some of the thresholds here are to tune which terms are interesting to MLT. In general, expanding thresholds (that is, lowering minimums and increasing maximums) will yield more useful MLT results at the expense of performance. The parameters are explained as follows:

  • mlt.fl: This provides a comma- or space-separated list of fields to consider in MLT. The interesting terms are searched within these fields only. These field(s) must be indexed. Furthermore, assuming the input document is in the index instead of supplied externally (as is typical), then each field should ideally have termVectors set to true in the schema (best for query performance, although index size is larger). If that isn't done, then the field must be stored so that MLT can re-analyze the text at runtime to derive the term vector information. It isn't necessary to use the same strategy for each field.
  • mlt.qf: Different field boosts can optionally be specified with this parameter. This uses the same syntax as the qf parameter that is used by the DisMax query parser (for example: field1^2.0 field2^0.5). The fields referenced should also be listed in mlt.fl. If there is a title or similar identifying field, then this field should probably be boosted higher.
  • mlt.mintf: This parameter specifies the minimum number of times (frequency) a term must be used within a document (across those fields in mlt.fl anyway) for it to be an interesting term. The default is 2. For small documents, such as in the case of our MusicBrainz dataset, try lowering this to 1.
  • mlt.mindf: This specifies the minimum number of documents that a term must be used in for it to be an interesting term. It defaults to 5, which is fairly reasonable. For very small indexes, as little as 2 is plausible, and maybe larger for large multi-million document indexes with common words.
  • mlt.maxdf: This specifies the maximum number of documents that a term must be used in for it to be an interesting term. There is no limit, by default.
  • mlt.minwl: This is used to specify the minimum number of characters in an interesting term. It defaults to 0, effectively disabling the threshold. Consider raising this to 2 or 3.
  • mlt.maxwl: This parameter specifies the maximum number of characters in an interesting term. It defaults to 0 and disables the threshold. Some really long terms might be flukes in input data and are out of your control, but most likely this threshold can be skipped.
  • mlt.maxqt: This specifies the maximum number of interesting terms that will be used in an MLT query. It is limited to 25 by default, which is plenty.
  • mlt.maxntp: Fields without termVectors enabled take longer for MLT to analyze. This parameter sets a threshold to limit the number of terms to consider in an input field to further limit the performance impact. It defaults to 5000.
  • mlt.boost: This Boolean toggles whether or not to boost each interesting term used in the MLT query differently, depending on how interesting the MLT module deems it to be. It defaults to false, but try setting it to true and evaluating the results.

    Tip

    Usage advice

    For ideal query performance, ensure that termVectors is enabled for the field(s) referenced in mlt.fl, particularly in the larger fields. In order to further increase performance, use fewer fields, perhaps just one that is dedicated for use with MLT. Using the copyField directive in the schema makes this easy. The disadvantage is that the source fields cannot be boosted differently with mlt.qf. However, you might have two fields for MLT as a compromise. Use a typical full complement of text analysis including lowercasing, synonyms using a stop list (such as StopFilterFactory), and aggressive stemming in order to normalize the terms as much as possible. The field needn't be stored if its data is copied from some other field that is stored. During an experimentation period, look for interesting terms that are not so interesting for inclusion in the stop word list. Lastly, some of the configuration thresholds that scope the interesting terms can be adjusted based on experimentation.

The MLT results example

Firstly, an important disclaimer on this example is in order.

Note

The MusicBrainz dataset is not conducive to applying the MLT feature, because it doesn't have any descriptive text. If there was perhaps an artist description and/or widespread use of user-supplied tags, then there might be sufficient information to make MLT useful. However, to provide an example of the input and output of MLT, we will use MLT with MusicBrainz anyway.

We'll be using the request handler method—the recommended approach. The MLT request handler needs to be configured in solrconfig.xml. The important bit is the reference to the class, the rest of it is our prerogative.

<requestHandler name="/mlt_tracks" class="solr.MoreLikeThisHandler">
  <lst name="defaults">
    <str name="mlt.fl">t_name</str>
    <str name="mlt.mintf">1</str>
    <str name="mlt.mindf">2</str>
    <str name="mlt.boost">true</str>
  </lst>
</requestHandler>

This configuration shows that we're basing the MLT on just track names. Let's now try a query for tracks similar to the song "The End is the Beginning is the End" by The Smashing Pumpkins. The query was performed with echoParams to clearly show the options used:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">2</int>
  <lst name="params">
    <str name="mlt.mintf">1</str>
    <str name="mlt.mindf">2</str>
    <str name="mlt.boost">true</str>
    <str name="mlt.fl">t_name</str>
    <str name="rows">5</str>
    <str name="mlt.interestingTerms">details</str>
    <str name="indent">on</str>
    <str name="echoParams">all</str>
    <str name="fl">t_a_name,t_name,score</str>
    <str name="q">id:"Track:1810669"</str>
  </lst>
</lst>
<result name="match" numFound="1" start="0"
    maxScore="16.06509">
  <doc>
    <float name="score">16.06509</float>
    <str name="t_a_name">The Smashing Pumpkins</str>
    <str name="t_name">The End Is the Beginning Is the End</str>
  </doc>
</result>
<result name="response" numFound="855211" start="0"     maxScore="6.3063927">
     <doc>
       <str name="t_name">End Is the Beginning</str>
       <str name="t_a_name">In Grey</str>
       <float name="score">6.3063927</float></doc>
     <doc>
       <str name="t_name">Is the End the Beginning</str>
       <str name="t_a_name">Mangala Vallis</str>
       <float name="score">5.6426353</float></doc>
     <doc>
       <str name="t_name">The End Is the Beginning</str>
       <str name="t_a_name">Royal Anguish</str>
       <float name="score">5.6426353</float></doc>
     <doc>
       <str name="t_name">The End Is the Beginning</str>
       <str name="t_a_name">Ape Face</str>
       <float name="score">5.6426353</float></doc>
     <doc>
       <str name="t_name">The End Is the Beginning Is the End</str>
       <str name="t_a_name">The Smashing Pumpkins</str>
       <float name="score">5.0179915</float></doc>
</result>
<lst name="interestingTerms">
  <float name="t_name:end">1.0</float>
  <float name="t_name:is">0.7513826</float>
  <float name="t_name:the">0.6768603</float>
  <float name="t_name:beginning">0.62302685</float>
</lst>
</response>

The <result name="match"> element is there due to mlt.match.include defaulting to true. The <result name="response" …> element has the main MLT search results. The fact that so many documents were found is not material to any MLT response; all it takes is one interesting term in common. The interesting terms were deliberately requested so that we can get an insight on the basis of the similarity. The fact that is and the were included shows that we don't have a stop list for this field—an obvious thing to fix to improve the results. Nearly any stop list is going to have such words.

Tip

For further diagnostic information on the score computation, set debugQuery to true. This is a highly advanced method but it exposes information invaluable to understand the scores. Doing so in our example shows that the main reason the top hit was on top was not only because it contained all of the interesting terms as did the others in the top 5, but also because it is the shortest in length (a high fieldNorm).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.109.151