The Stats component

The stats component calculates some mathematical statistics of fields in the index. The main requirement is that the field be indexed. The following statistics are computed over the non-null values (except missing which counts the nulls):

  • min: The smallest value
  • max: The largest value
  • sum: The sum
  • count: The quantity of non-null values accumulated in these statistics
  • missing: The quantity of records skipped due to missing values
  • sumOfSquares: The sum of the square of each value; this is probably the least useful and is used internally to compute stddev efficiently
  • mean: The average value
  • stddev: The standard deviation of the values
  • distinctValues: A list of all distinct (non-duplicating) values
  • countDistinct: The size of distinctValues

If you calculate stats on a string or date field, then only min, max, count, missing, distinctValues, and countDistinct are calculated. The distinctValues and countDistinct are only present if stats.calcdistinct is enabled.

Configuring the stats component

This component is simple to configure and can be done as follows:

  • stats: Set this to true in order to enable the component. It defaults to false.
  • stats.field: Set this to the name of the indexed field to calculate statistics on. It is required. This field must be indexed or preferably have DocValues. This parameter can be added multiple times in order to calculate statistics on more than one field. And like facet.field, it can be preceded with a filter query exclusion in local-params syntax; for example, &stats.field={!ex=t_duration}t_duration&fq={!tag=t_duration}t_duration:1000.
  • stats.calcdistinct: A Boolean option to include a list of all distinct (non-duplicating) values for this field. Be judicious about using this! Using it on some fields could trigger an OutOfMemoryError easily. Solr 5.2 has a scalable option to provide an estimated count.
  • stats.facet: Optionally, set this to the name of the field in which you want to facet the statistics over. Instead of the results having just one set of stats (assuming one stats.field), there will be a set for each value in this field, and those statistics will be based on that corresponding subset of data. This is analogous to the GROUP BY syntax in SQL. This parameter can be specified multiple times to compute the statistics over multiple fields' values. In addition, you can use the field-specific parameter name syntax for cases when you are computing stats on different fields and you want to use a different facet field for each statistic field. For example, you can specify f.t_duration.stats.facet=tracktype assuming a hypothetical field tracktype to categorize the t_duration statistics on. The field should be indexed or have DocValues and not tokenized.

    Note

    Due to the bug SOLR-1782, a stats.facet field should not be multivalued, and it should be limited to a string. If you don't heed this advice, then the results are in question and you may get an error!

Statistics on track durations

Let's look at some statistics for the duration of tracks in MusicBrainz at http://localhost:8983/solr/mbtracks/mb_tracks?rows=0&indent=on &stats=true&stats.field=t_duration.

And here are the results:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">5202</int>
</lst>
<result name="response" numFound="6977765" start="0"/>
<lst name="stats">
  <lst name="stats_fields">
    <lst name="t_duration">
      <double name="min">0.0</double>
      <double name="max">36059.0</double>
      <double name="sum">1.543289275E9</double>
      <long name="count">6977765</long>
      <long name="missing">0</long>
      <double name="sumOfSquares">5.21546498201E11</double>
      <double name="mean">221.1724348699046</double>
      <double name="stddev">160.70724790290328</double>
    </lst>
  </lst>
</lst>
</response>

This query shows that on average, a song is 221 seconds (or 3 minutes 41 seconds) in length. An example using stats.facet would produce a much longer result, which won't be given here in order to leave space for other components. However, there is an example at http://wiki.apache.org/solr/StatsComponent.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.37.35