Finding similar documents

Imagine a situation where you want to show documents similar to those that were returned by Solr. For example, let's assume that we have an e-commerce library, and we want to show users similar books to the ones that they found while using your application. Of course, we can use machine learning and one of the collaborative filtering algorithms, but we can also use Solr for that. This recipe will show you how to do this.

How to do it...

  1. Let's start with the following index structure (just add this to your schema.xml file):
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="name" type="text_general" indexed="true" stored="true" termVectors="true" />
  2. Next, let's index the following test data:
    <add>
     <doc>
      <field name="id">1</field>
      <field name="name">Solr Cookbook first edition</field>
     </doc>
     <doc>
      <field name="id">2</field>
      <field name="name">Solr Cookbook second edition</field>
     </doc>
     <doc>
      <field name="id">3</field>
      <field name="name">Solr by example first edition</field>
     </doc>
     <doc>
      <field name="id">4</field>
      <field name="name">My book second edition</field>
     </doc>
    </add>
  3. Now let's assume that our hypothetical user wants to find books that have cookbook and second in their names. However, in addition to the results for the user query, we would also like to show books that are similar to the query. To do this, we will send the following query:
    http://localhost:8983/solr/cookbook/select?q=cookbook+second&mm=2&qf=name&defType=edismax&mlt=true&mlt.fl=name&mlt.mintf=1&mlt.mindf=1

    The results returned by Solr for the preceding query are as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">3</int>
      <lst name="params">
       <str name="mm">2</str>
       <str name="q">cookbook second</str>
       <str name="defType">edismax</str>
       <str name="mlt">true</str>
       <str name="qf">name</str>
       <str name="mlt.fl">name</str>
       <str name="mlt.mintf">1</str>
       <str name="mlt.mindf">1</str>
      </lst>
     </lst>
     <result name="response" numFound="1" start="0">
      <doc>
       <str name="id">2</str>
       <str name="name">Solr Cookbook second edition</str>
       <long name="_version_">1481427182978859008</long></doc>
     </result>
     <lst name="moreLikeThis">
      <result name="2" numFound="3" start="0">
       <doc>
        <str name="id">1</str>
        <str name="name">Solr Cookbook first edition</str>
        <long name="_version_">1481427182903361536</long></doc>
       <doc>
        <str name="id">4</str>
        <str name="name">My book second edition</str>
        <long name="_version_">1481427182979907585</long></doc>
       <doc>
        <str name="id">3</str>
        <str name="name">Solr by example first edition</str>
        <long name="_version_">1481427182979907584</long></doc>
      </result>
     </lst>
    </response>

Now let's see how it works.

How it works...

As you can see, the index structure and data are really simple. One thing that we need to notice is the termVectors attribute, which is set to true in the name field definition. To achieve the best results with this component, we should always enable term vectors for fields which we plan to use more, such as this functionality. It enables more detailed computation of similarity between documents and will show more similar results. If term vectors are not present, Solr will use data from the stored fields.

Now let's take a look at the query. As you can see, we have added some additional parameters besides the standard q one (and the ones such as mm and defType, which specify how our query should be handled). The mlt=true parameter tells you that we want to add the MoreLikeThis component to the result processing. The mlt.fl parameter specifies which fields we want to use with the MoreLikeThis component. In our case, we will use the name field. The mlt.mintf parameter tells Solr to ignore terms from the source document (the ones from the original result list) with the term frequency below the given value. In our case, we don't want to include the terms that will have the frequency lower than 1. The last parameter, mlt.mindf, tells Solr that words that appear in less than the value of the parameter documents should be ignored. In our case, we want to consider words that appear in at least one document.

The last thing is the search results. As you can see, there is an additional section (<lst name="moreLikeThis">), which is responsible for showing us the MoreLikeThis component results. For each document in the results, there is one more similar to this section added to the response. In our case, Solr added a section for the document with the unique identifier 2 (<result name="2" numFound="3" start="0">) and there were three similar documents found. The value of the id attribute is assigned the value of the unique identifier of the document that the similar documents are calculated for.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.80.209