The highlight component

You are probably most familiar with search highlighting when you use an Internet search engine such as Google. Most search results come back with a snippet of text from the site containing the word(s) you search for, highlighted. Solr can do the same thing. In the following screenshot, we see Google highlighting a search including Solr and search (in bold):

The highlight component

To conserve screen space, you might even use this feature to simply tell the user that there was a match in certain fields without showing a highlighted value. This could make sense if there are many metadata fields. Nevertheless you would still likely highlight some.

A highlighting example

Admittedly the MusicBrainz dataset does not make an ideal example to show off highlighting because there's no substantial text, but it can still be useful, nonetheless.

The following is a sample use of highlighting on a search for Corgan in the MusicBrainz's artist dataset. Recall that the /mb_artists request handler is configured to search against the artist's name, alias, and members fields: http://localhost:8983/solr/mbartists/ mb_artists?indent=on&q=corgan&rows=3&hl=true.

And here is the result of that search:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">89</int>
</lst>
<result name="response" numFound="5" start="0">
    <doc>
        <date name="a_begin_date">1967-03-17T05:00:00Z</date>
        <str name="a_name">Billy Corgan</str>
        <date name="a_release_date_latest">2005-06-21T04:00:00Z</date>
        <str name="a_type">1</str>
        <str name="id">Artist:102693</str>
        <str name="type">Artist</str>
    </doc>
    <doc>
        <str name="a_name">Billy Corgan &amp; Mike Garson</str>
        <str name="a_type">2</str>
        <str name="id">Artist:84909</str>
        <str name="type">Artist</str>
    </doc>
    <doc>
        <arr name="a_member_id"><str>102693</str></arr>
        <arr name="a_member_name"><str>Billy Corgan</str></arr>
        <str name="a_name">Starchildren</str>
        <str name="id">Artist:35656</str>
        <str name="type">Artist</str>
    </doc>
</result>
<lst name="highlighting">
    <lst name="Artist:102693">
        <arr name="a_name">
        <str>Billy &lt;em&gt;Corgan&lt;/em&gt;</str>
        </arr>
    </lst>
    <lst name="Artist:84909">
        <arr name="a_name">
        <str>Billy &lt;em&gt;Corgan&lt;/em&gt; &amp; Mike Garson</str>
        </arr>
    </lst>
    <lst name="Artist:35656">
        <arr name="a_member_name">
        <str>Billy &lt;em&gt;Corgan&lt;/em&gt;</str>
        </arr>
    </lst>
</lst>
</response>

What should be noted in this example is the manner in which the highlighting results appear in the response data. Also note that not all of the result highlighting was against the same field.

Note

It is possible to enable highlighting and discover that some of the results are not highlighted. Sometimes this can be due to complex text analysis; although more likely, it could simply be that there is a mismatch between the fields searched and those highlighted.

Choose the Standard, FastVector, or Postings highlighter

Before jumping into the highlighting parameters and configuration, it's important to be aware that Lucene has three highlighter implementations, all of which are exposed through Solr. All of them require that the field you highlight on be marked as stored in the schema, for obvious reasons. If you've at least done that, then you can skip choosing among them for early prototyping/experimentation and proceed to the next section using the venerable standard highlighter. The primary reason there are multiple implementations is performance—particularly for lengthy text. The faster ones make trade-offs either in features or additional index size. Many (but not all) highlighting request parameters apply to all highlighters, but frustratingly, most of the solrconfig.xml based settings vary between the highlighters.

The Standard (default) highlighter

Lucene's original highlighter was simply called the highlighter, but it's now referred to as either the default or standard highlighter. This is the one you get if you take no action to choose the others. This highlighter has the fewest index requirements—simply make sure that the field is stored. For lengthier text fields, it's the slowest since it re-analyzes the text, and if you want phrase queries to highlight correctly (hl.usePhraseHighlighter), then it will index it in-memory on the fly for that feature. But this highlighter is the most accurate, particularly if you are using SpanQueries. The ComplexPhrase and Surround query parsers are the only out-of-the-box query parsers that can produce such queries, but plenty of Solr users write their own that make use of SpanQueries.

Tip

The performance is much faster if you index term vectors, which spares the need to re-analyze the text and to index on the fly for phrase queries. Use the same schema options as required for the FastVector highlighter, which will be described next. If you are using Solr 5 in particular, then the difference can be dramatic.

Unlike the other two highlighters, this one's snippet fragmenting options do not include one based on Java's BreakIterator. BreakIterator has better internationalization support and some overall nice features versus using a regular expression.

The FastVector highlighter

The FastVector highlighter (FVH) was the second highlighter to come about and is fundamentally based on term vector information in the index—something that isn't there unless you enable it. Term vectors are hefty, usually consuming almost as much space on-disk as the stored content does, which is the biggest part of the index. They're fairly accurate with the exception of SpanQueries, as mentioned previously. Also, this highlighter has the unique feature of being able to highlight each query term with different markup, such as a distinct color.

The schema field requirements are indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true". To tell Solr to use the FVH, set hl.useFastVectorHighligher=true in your request parameters.

Note

If a field to highlight doesn't have term vectors enabled, the standard/default highlighter will be used even if this FVH request parameter has been set. This is a good thing as it allows you to use term vectors where they have the most benefit: on long text fields, not short ones.

The Postings highlighter

This new highlighter was introduced in Solr 4.1 and uses Lucene's newfound ability to store offset information with the postings data in the index. This extra information takes up much less space than term vectors do. The Postings highlighter was also written to be as fast as can be, compromising on matching phrase queries or any other query that is position-sensitive accurately. In other words, if the query is a quoted phrase, the highlighter will not honor the adjacency requirement; all words in the phrase will be highlighted, no matter where they lie. And even though it's generally the fastest highlighter, it's markedly slower with wildcard, fuzzy, and other so-called multiterm queries. So, with these points in mind, choose this highlighter when there is a lot of text to be highlighted and speed/efficiency is the top requirement over accuracy.

The schema field requirements are indexed="true" stored="true" storeOffsetsWithPositions="true". If you attempt to use this highlighter in a field that doesn't meet this requirement, it will appear as an error. Now, unlike the standard and FastVector highlighters, you must modify solrconfig.xml to register HighlightComponent configured to use this highlighter, like so:

<searchComponent class="solr.HighlightComponent" name="highlight">
  <highlighting class="org.apache.solr.highlight.PostingsSolrHighlighter"/>
</searchComponent>

Only one search component can be registered with a specific name: highlight in this case. If you want to also highlight with the standard and FVH highlighters for different search requests, then you can register both under separate names and configure separate request handlers to use each.

A final caveat to this highlighter is that it may work incorrectly when the index analysis configuration has token filters that emit tokens in the wrong order with respect to the offsets. The other highlighters have workarounds, but not the postings highlighter. This used to pose more problems in earlier 4.x releases, but they are rarer now, so you might just accept this as a low risk.

Highlighting configuration

Highlighting, like most parts of Solr searching, is largely configured through request parameters. The standard and FastVector highlighters also contain configuration options in solrconfig.xml, while the postings highlighter was designed to be completely configured via request parameters. You can specify these in the URL, but it is more appropriate to specify the majority of these in your application's request handler in solrconfig.xml because they are unlikely to change between requests. Furthermore, it can be convenient to tweak/tune settings on the Solr end versus your application for most of these parameters, since most wouldn't require a change in processing by the application.

What follows are common parameters observed by the highlighter search component. Understand that like faceting, nearly all highlighter parameters can be overridden on a per-field basis. The syntax looks like f.fieldName.paramName=value; for example, f.allText.snippets=0.

  • hl: This is set to true in order to enable search highlighting. Without this, the other parameters are ignored, and highlighting is effectively disabled.
  • hl.fl: This will highlight a comma or space separated list of fields. It is important for a field to be marked as stored in the schema in order to highlight it. Sometimes, this parameter can be omitted, but the highlighter often has difficulty ascertaining which fields are in the query, so you are advised to just set it. You may use an asterisk wildcard, such as * or r_*, to conveniently highlight on all of the text fields. If you use a wildcard, then consider enabling the hl.requireFieldMatch option.
  • hl.requireFieldMatch: If set to true, a field will not be highlighted for a result unless the query also matched against that field. This is set to false by default, meaning that it's possible to query one field and highlight another and get highlights back, as long as the terms searched for are found within the highlighted field. If you use a wildcard in hl.fl, then you will probably enable this. However, if you query against an all-text catch-all field (probably using copy-field directives) then leave this as false, so that the search results can indicate from which field the query text was found. The postings highlighter doesn't support this; the field must match (true).
  • hl.snippets: This is the maximum number of highlighted snippets (also known as fragments) that will be generated per field. It defaults to 1, which you will probably not change. By setting this to 0 for a particular field (for example, f.allText.hl.snippets=0), you can effectively disable highlighting for that field. You might do that if you used a wildcard for hl.fl and want to make an exception.
  • hl.fragsize: This is the maximum number of characters returned in each snippet (fragment), which is measured in characters. The default is 100. If 0 is specified, then the field is not fragmented and whole field values are returned. Obviously, be wary of doing this for large text fields.
  • hl.mergeContiguous: If set to true, then overlapping snippets are merged. The merged fragment size is not limited by hl.fragsize. The default is false, but you will probably set this to true when hl.snippets is greater than zero and fragsize is non-zero.

    Note

    In this edition of the book, we only document some common parameters. See the Solr Reference Guide for definitive information on all of the rest (there are a lot more) at https://cwiki.apache.org/confluence/display/solr/Highlighting.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.134.118.95