Spatial search

Note

This section was written by David Smiley, a committer on Lucene/Solr specializing in spatial search.

Spatial search is the ability to find geometric information in a multidimensional space. Most information retrieval systems that support spatial data, including Solr, are limited to a two-dimensional Cartesian plane, with additional support for geospatial search in which two dimensions reference the location on the surface of a sphere.

That description is a bit abstract, so let's now review some common spatial requirements of an application. If your Solr documents represent businesses and you know where the business resides in terms of a latitude and a longitude, then you probably want to show search results (businesses) filtered to the vicinity of where the user is looking. The user interface might have a map centered at a region of interest, and/or you know approximately where the user is from the GPS of their mobile device, or you might even have a GeoIP database at your disposal to map their IP address to an approximate location. Beyond filtering, you then might want to sort or relevancy-boost by the distance between the center of where the user's search area is and where the business is. Indexing points, filtering them by a rectangle or circle, getting the distance to sort or boost, and displaying that distance to the user, are the most common user requirements.

Spatial in Solr is confusing because there are basically two implementations to pick from: LatLonType (since Solr 3) and SpatialRecursivePrefixTreeFieldType also known as RPT (since Solr 4), and they are quite different. I'll describe the internal workings of both, and then how to use them, pointing out differences along the way.

Spatial in Solr 3 – LatLonType and friends

Solr 3 was the first release to have spatial support, and its implementation is still appropriate for common requirements. It is principally comprised of the LatLonType and PointType field types, the geofilt and bbox query parsers, and the geodist, dist, and sqedist functions. There are some other lesser used functions too.

LatLonType is geospatially oriented: latitude and longitude with geodetic math—notably the Haversine distance formula. PointType holds x and y coordinates on a classic 2D Cartesian plane with faster and simpler Euclidean geometry calculations, such as the Pythagorean Theorem for distance. PointType uniquely supports a variable number of dimensions; it's an obscure feature.

The underlying implementation resides in Solr itself, not Lucene-spatial. The overall approach is straightforward, notwithstanding some optimization tricks. The latitude and longitude (or x and y) are internally indexed into separate numeric fields. This approach doesn't support multivalued data, such as modeling businesses with multiple locations. If the query is just a rectangular filter, then it's pretty fast since it just needs to do a couple simple numeric range queries. If, on the other hand, the query shape is a point-distance (circle) shape, then the distance is calculated to potentially all points, depending on the circumstances; consequently, it isn't very scalable. Another thing to be aware of is that the FieldCache is used whenever the distance is required, which is for a point-distance query shape, and/or to sort by distance. The FieldCache holds all coordinate values in memory.

Some of the key characteristics have been summarized here:

  • Single-valued point indexed field
  • Point-radius (circle) query shape, or a rectangle
  • Good implementation for small datasets; bbox is scalable but geofilt (circle) is not

Configuration

Configuring LatLonType and PointType is easy. In the following excerpt taken from Solr's example schema, given the field name store, there will be two automatically generated fields named store_0_coordinate and store_1_coordinate, which you'll see in Solr's schema browser:

<field name="store" type="location" indexed="true"
  stored="true"/>
<fieldType name="location" class="solr.LatLonType"
  subFieldSuffix="_coordinate"/>
<dynamicField name="*_coordinate" type="tdouble"
  indexed="true" stored="false"/>

Tip

Use floats instead of doubles

Change the *_coordinate dynamic field to use type tfloat to use half the memory. Using a 32-bit float for latitude and longitude has precision no worse than 2.37 meters—plenty accurate for most use cases.

We'll show you how to index and search for LatLonType further in this chapter.

Spatial in Solr 4 – SpatialRecursivePrefixTreeFieldType

Solr 4 introduced SpatialRecursivePrefixTreeFieldType, referred to as RPT because it's a mouthful. This one field type can be configured for either geodetic or Euclidean math. The Solr code is not much more than an adapter to the technology that mostly lives in the Lucene spatial module, plus a dependency on Spatial4j for the shape implementations. Additionally, the third-party JTS Topology Suite is required for some shapes, such as polygons. The implementation scheme is based on a variable-depth hierarchical grid in which the world is decomposed into grid cells, which are in turn recursively decomposed into smaller grid cells, until the desired precision is reached.

Indexed shapes of basically any kind are represented completely on the grid, and there are scalable algorithms to find them in relation to a query shape at search time. It's more powerful and often faster at filtering than the comparatively simple LatLonType. For distance sorting/boosting, it includes a custom point cache, but this feature should be avoided as it doesn't scale well.

Spatial in Solr 4 – SpatialRecursivePrefixTreeFieldType

This picture shows how a polygon of France is decomposed into geohash grid cells of varying sizes. This was easily generated using a utility in the web demo of Spatial Solr Sandbox on GitHub that generates a KML file that Google Earth can render.

Its key characteristics can be summarized as follows:

  • Indexes basically any shape, not just points:
    • Shapes are approximated to a grid of configurable precision
    • Multivalued fields
  • Query by basically any shape, and with configurable precision—fast!
  • Query by Intersects, Contains, and IsWithin predicates
  • Multivalued point cache for distance sorting and relevancy
    • The implementation is not scalable

Tip

If you need the distance and you have more than perhaps a million documents, or if you have real-time search requirements, then you should use both spatial implementations: RPT for its fast filtering and LatLonType to get the distance. On the other hand, if RPT's features aren't useful to you and you have a million or less documents, then I recommend using LatLonType alone. In a future Solr release, see LUCENE-4698.

Configuration – basic

The RPT field type has many configuration options; we're going to stick to the ones needed for basic geospatial search requirements right now and address other options later. A basic configuration exists in Solr's example schema:

<fieldType name="location_rpt"
  class="solr.SpatialRecursivePrefixTreeFieldType"
  geo="true" distErrPct="0.025" maxDistErr="0.000009"
  units="degrees" />

To use this field type, you need to declare a field that references this field type.

  • geo: This is a Boolean that specifies whether this spatial mathematical model is based on a spherical earth model using latitude and longitude coordinates, or a flat 2D plane.
  • units: Ignore this; it must be set to degrees. In Solr 5, it will be renamed to distanceUnits and support values like kilometers. In Solr 4, it has no effect. When you see a reference to a measurement in degrees (such as maxDistErr below), know that it refers to 1/360th of the circumference of a sphere when geo=true. Assuming the earth is that sphere, this works out to 111.2 kilometers per degree (2πR/360, R=6,353km).

    Note

    The distance (d) parameter used by the geofilt and bbox parsers and the distances returned by the geodist function remain kilometers even if it's used with this field. Those parsers will be described shortly.

  • maxDistErr: This refers to the precision of the spatial data in terms of the maximum distance error, as measured in degrees. For example, if the greatest precision you care about is a meter (0.001km), set it as measured in degrees (0.001/111.2 = 0.000009). The actual maximum distance error will be smaller than this, since it's used to choose a grid level that meets or exceeds this precision.
  • distErrPct: If this number is non-zero, then non-point shapes will be approximated as a function of their overall size. This value is a fraction of a shape's approximate radius to be the allowable error, which in turn indicates to what grid level the shape will be maximally represented as. Under no circumstance can you have more precision than what maxDistErr yields. It applies to both indexed non-point shapes and query non-point shapes. There is a way to use a different value for a query shape if desired.

    Note

    If that's confusing, look again at the gridded map of France and look at the edge. If distErrPct is raised higher, the edge will eventually be even blockier. There are obviously scalability limitations if you attempt to index non-point shapes with a distErrPct of 0. Query shapes can handle it fine.

Indexing points

The simplest and most common spatial data to index is a point.

Note

If you have named locations (for example, Boston, MA) or addresses, then the data needs to be resolved to latitudes and longitudes using a geocoder. You can run your own with Gisgraphy—found at http://www.gisgraphy.com—or use a hosted service from Google, Yahoo, or others. Most hosted services have caps and/or fees.

When providing data to this field, it is formatted as a string with the latitude then the longitude, separated by a comma. Here's an example in Solr's Update-XML format:

<field name="store">43.17614,-90.57341</field>

Whether you use SolrJ, the DIH, or any other client/format, it appears as a string going in and coming out of Solr. If you have multiple points to index, simply supply them as additional values, as you would for any other multivalued field. LatLonType can't handle this but RPT will.

If the field type is PointType, then the dimension order is x,y. If the field type is RPT and geo is false, then you should supply points as x y (a space in between). If you use a comma in this circumstance, then the dimensions will be flipped. This is probably a bug, so don't rely on that behavior.

Filtering by distance or rectangle

Perhaps the most common geospatial need is to filter search results to those documents within a distance radius from a center point. If you are building an application in which the user is presented with a map, perhaps using Google Maps, then the center point is the center of the map the user is looking at and the distance radius is the distance from the center to the nearest map edge. Such a query is generally specified using a Solr filter query (fq parameter) leaving q open for the possibility of a combined keyword search if desired. Both the geofilt and bbox query parsers perform geospatial filtering. The geofilt query parser implements a point-distance based filter, a circle shape, whereas bbox uses the minimum bounding latitude-longitude box surrounding that circle. You can also specify an arbitrary rectangle using Solr's standard range syntax but using points.

Tip

LatLonType tuning

If you are using LatLonType, bbox and the rectangle range query syntax are faster than geofilt because they are able to make simple/scalable latitude and longitude numeric range searches, whereas geofilt computes the distance to every point it sees. If you need geofilt and if your spatial queries aren't very cacheable (tend to not be in your filter cache), then try adding these local-params: cache=false cost=100.

Here is a quick example based on Solr's example schema and dataset, showing the URL parameters needed to do the search:

q=*:*&fq={!bbox}&sfield=store&pt=45.15,-93.85&d=5

The parameters that geofilt and bbox require can be specified as either local-params (between the parser name and closing bracket) or standard request parameters, as shown above. To be clear, here's the same query using local-params:

q=*:*&fq={!bbox sfield=store pt=45.15,-93.85 d=5}

The advantage of not using local-params is that a combined distance sort can reuse the same parameters, as you'll see in a bit. Here are geofilt and bbox's parameters:

  • sfield: The name of the spatial field
  • pt: A latitude-comma-longitude pair for the center point of the query
  • d: The query distance from pt in kilometers (see sphere_radius)
  • sphere_radius: The radius of the sphere (Earth) in desired units for d. It defaults to the Earth's mean radius in kilometers.

To query by an arbitrary latitude-longitude rectangle, use a range query between the lower-left corner (smallest latitude and longitude) to the upper-right like so:

q=*:*&fq=store:[43.2,-94.1 TO 46.3,-92.0]

Note

If you use this syntax with LatLonType, it won't work if the dateline is crossed (bug SOLR-2609). RPT does not have that limitation.

Sorting by distance

Solr can sort search results by the distance between a document's point and another point supplied at query time. That point is typically the center point of a combined spatial filter. The typical way to spatially sort is to use Solr's ability to sort by a function query—a spatial function, geodist() in particular. The geodist() function calculates the geospatial distance (the "great circle distance") as calculated using the Haversine formula between a pair of points. The points are each taken from the first available of an argument, the pt parameter, or the sfield parameter. Any of these might be absent, but at least two must be specified. When a point is specified as an argument (within the parenthesis), it can either be a geospatial field name or a pair of arguments that are in latitude then longitude order. The latitude or longitude can be constants, or they may reference numeric fields' names. Here's an example of this:

&sort=geodist(store,42.4,-71.1) asc

Note

By design, these parameter names align with those for the geofilt and bbox query parsers, which pair well with geodist(). Consequently, it is rare to supply arguments if you are also spatially filtering.

Before the RPT field supported geodist() in Solr 4.5, a different, more awkward syntax was needed. You used to have to add a score=distance local-param to the spatial query and then put the spatial query into q to sort by score and filter at the same time. Or, even more awkwardly, you could sort by the query() function query. You can find examples of that syntax, if you must, online.

Returning the distance

It's often desirable to show the distance to users in search results. The most straightforward way to do this is to use a new Solr 4 feature that allows putting a function query into the field list: the fl param. Doing this is independent of any filtering or sorting that may exist, and so it's easy to use.

&fl=*,score,dist:geodist()&sfield=store&pt=45.15,-93.8

That example also gave a label of dist to this number in the search results; another label could have been chosen. Without a label, it is named by the actual function query itself in the results.

Boosting by distance

Perhaps you don't want to sort by distance, but you want to influence the relevancy (so-called boosting) by distance. Relevancy tuning is covered in the next chapter, so you will need to read that to better understand this section. The base formula to use for distance-based boosting should be the reciprocal. A Solr function query for the reciprocal used for boosting in this way should look like recip(x,1,c,c), where x is the distance and c is 1/10th of what I call the horizon distance. If you have a spatial filter in place, then use the radius of the query shape (center to edge or corner) as the horizon. Otherwise, pick an approximate distance at which any greater distance is unlikely to be relevant to the user. The result of the reciprocal function used in this way will range from 1.0 at the center of the query point to 0.1 at the horizon distance; and it approaches 0 further away.

For example, let's say you have a spatial filter of a 100-kilometer radius at 42.15° N, 93.85° W; you have a keyword search in place using the edismax query parser; and you want to multiply the distance boost to the score:

&defType=edismax&q=…&qf=…&sort=score desc
&fq={!geofilt}&sfield=store&pt=45.15,-93.85&d=100
&boost=recip(geodist(),1,10,10)

Memory and performance of distance sorting and boosting

Sorting or boosting by distance will require all indexed points to be in memory. Use floats instead of doubles, if possible, to reduce this footprint with LatLonType. The RPT field type still has a sub-par point cache implementation that has a high memory overhead on a per-point and per-document basis. If you can, use LatLonType for sorting instead of RPT, until that is rectified.

Sorting and boosting on a search that matches a great many documents can be slow. This is because the distance needs to be computed to each matching point, even if you only return the top 10 of them. Furthermore, the Haversine formula involves a fair amount of trigonometry that can computationally really add up when computed between many points. What can be done to help is use a cheaper distance function that is less accurate in a geospatial sense but is faster to compute. For example, if all the indexed data is just in one region of the world, then you could project the data onto a 2D plane with minimal distortion around the edges. The data would go in a pair of float fields and then you could replace use of geodist() with sqedist()—squared Euclidean distance. It takes four arguments, the x and y of the indexed fields, and another pair for the query point. Projecting data is well beyond the scope of this book. For further information, visit http://trac.osgeo.org/proj4j/.

Advanced spatial

This book sets aside a fair amount of space to cover spatial search in Solr for the needs of most applications, but there's more that couldn't be included:

  • Non-geodetic (for example, Euclidean) spatial such as PointType and dist(), and the implications of geo="false" on RPT.
  • Spatial Well Known Text (WKT) syntax: WKT is a standard for expressing a variety of shapes, including Polygons. Lucene/Solr can index and search by them. Related to this is including JTS with Solr.
  • Spatial predicates that include Intersects, Contains, IsWithin, and IsDisjointTo. When indexing non-point shapes, there are more applicable relations than what can occur with just points.
  • The BBoxField field type supports nearly every predicate and has area-overlap relevancy. It is new in Solr 4.10.
  • Indexing and searching on multi-value time durations: If you want to index time or other numeric durations in Solr, particularly when there's a variable number of them per document, then the only way to do this is to express the times as points in spatial. I bet you'll find this fascinating: http://wiki.apache.org/solr/SpatialForTimeDurations.

For more coverage of these spatial topics, see the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/Spatial+Search.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.218.254