Time for action – having a look at the term vectors

If we want to look directly at term vectors, there is the possibility to enable a simple search component, which will return them in the results. In this case, we first need to be sure to put some specific parameters for a field in schema.xml; this is shown in the following steps:

  1. Then we need to enable a specific component and a handler for the term vectors:
    <searchComponent name="termVector"class="solr.TermVectorComponent" />
    <requestHandler name="/tvc" class="solr.SearchHandler" startup="lazy">
      <lst name="defaults">
        <bool name="tv">true</bool>
      </lst>
      <arr name="last-components">
        <str>termVector</str>
      </arr>
    </requestHandler>
  2. Once we have enabled this new handler, we are able to search something (for example, the term vermeer in the abstract field) and look at the results that will contain the vector parts:
    >> curl -X GET 'http://localhost:8983/solr/paintings_terms/select?q=abstract:vermeer&start=0&rows=2&fl=abstract,fullText&qt=/tvc&tv.all=true&tv.fl=fullText&wt=json&json.nl=map&omitHeader=true'
    

    Here again, for simplicity, we use the JSON format for output and omit the header.

What just happened?

We can look at the results directly. By extrapolating some interesting bits, we can see uniqueKey, which is reported to give us the chance to relate the vector results to related documents. It's important to remember that term vector results act as metadata in our response. If we use a rows=0 parameter, we will not obtain any results for the vectors either as they are, in fact, behind the proposed actual result and used in the calculation of the matching score. The result is shown in the following screenshot:

What just happened?

The tf-idf parameter is simply an approximation value for calculating the score of a document, which is a ratio between term frequency (how many times a term recurs in the index) and document frequency (how many documents contain the term). This is calculated by multiplying tf*idf, where tf is the term frequency, and idf is the inverse document frequency. Because Solr returns the term frequency and document frequency (df) values instead, we can simply use the ratio tf/df to calculate.

Just to give you an idea, if we search for the values related to the term vermeer itself in the results, we will find that there is a score for the "lacemaker" resource, as shown in the following screenshot:

What just happened?

Note that when taken outside its context, the score value by itself gives us no information. We have to compare this value with others values to give it some meaning.

Reading about functions

Solr provides several built-in functions that can also be extended with functions of your own. The built-in functions could play some important roles, such as ranking, because they can be used as local parameters to boost the score of a document.

You will find an almost complete (and growing, I suppose) list of functions here:

http://wiki.apache.org/solr/FunctionQuery#Available_Functions

Here, you will mostly find scalar functions as the functions that return vectors are experimental. Most of the vector functions will return simple numeric scalar results calculated over term vectors, such as docFreq, which return the number of documents that contain the term in the field (this is a constant as it assumes the same value for all documents in the index). Other notable examples will be the maximum or minimum evaluation, some mathematical functions, or the euclidean distance between two vectors, which can be seen again as some kind of surrogate against a similarity measurement.

A very simple way to play with functions is using them as local parameters, for example:

...&defType=func&q=docfreq(text,$myterm)&myterm=dali
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.93.222