CHAPTER 5

image

Spatial Indexing

The previous chapter introduced an index for a specific data type, XML. That is not the only data type within SQL Server that has a specialized index that caters to the unique needs of the data type. Spatial data, introduced in SQL Server 2008, also comes with specific indexing needs because of the nature of the data stored within.

Spatial data storage advances the storage capabilities of SQL Server into an area that defines shape and location as a specific data type. Before these enhancements, spatial data was often stored as string or numeric values without meaning within the database and required cumbersome conversions and calculations to resolve the information into something meaningful. As part of the spatial data support, SQL Server introduced the GEOMETRY and GEOGRAPHY data types. These types support planar and geodetic data, respectively. Planar data is composed of lines, points, and polygons on a 2D plane, while geodetic is composed of the same but on a geodetic ellipsoid, a fancy term describing a map of Earth. In simple terms, you can look at these two data types like so: GEOMETRY is a flat representation of the shape described, and GEOGRAPHY encompasses a rounded global representation.

Spatial data indexes are unique in how they are created and interpreted. Each index is composed of a set of grids. These grids consists of a set of cells, laid out kind of like a square spreadsheet. The grids can be up to 16×16 and as small as 4×4. The cells within the grid contain the values that define the objects that define the spatial data being stored. There is a distinct difference between the GEOGRAPHY and GEOMETRY data types in this type of indexing. The GEOMETRY data type requires a bounding box, which is a limit on the size of the area defined by the index. The GEOGRAPHY data type does not have a bounding box since it’s basically bound by the size of the planet.

This chapter will explore spatial indexes, their behaviors, and their use within queries to help enhance the performance of your spatial data.

How Spatial Data Is Indexed

The grids that make up a spatial index are actually nested within each other. At the top layer, known as level 1, you can have, for example, a 4×4 grid. Each cell within that level 1 grid then contains another grid, consisting of the number of cells defined for that level, in this example 4×4. This second grid defines level 2. The cells in level 2 each have a grid that defines level 3, and the cells there contain another grid, level 4. Figure 5-1 shows how a GEOMETRY index consists of these four levels. The index is then made up of these four grids, each one of which is composed of a series of cells. This layering and grid hierarchy, called decomposing, is created when the index is created.

9781484211199_Fig05-01.jpg

Figure 5-1. Grid storage representation of the GEOMETRY index storage and cells

As many as four billion cells are possible, as shown in Figure 5-1. This is important when creating the index and determining what density to use at creation. Each layer, or level, can have a specified density. There are three levels of density (low = 4×4, medium = 8×8, and high = 16×16). If the density is omitted at the time an index is created, the default is medium. Manipulating the density is most commonly useful for tuning the actual space of the index. All layers may not be required at a high density. Save space by not using more density than you need.

All this is necessary because the actual storage of the information from within these grids is the same B-tree that is used to store standard indexes. But the definitions within the storage, and obviously the retrieval of those definitions, are radically different within spatial indexes than they are within standard indexes. To get the information into the B-tree, additional processing on top of the grid is necessary.

The next step in the indexing process that SQL Server performs is tessellation. Tessellation is the process that places or fits the objects into the grid hierarchy starting at layer 1. This process may require only the first layer of the grid but can require all four depending on the objects involved. Tessellation is essentially taking all the data from the spatial column and placing it onto the grids in cells while retaining each cell that is touched. The index then knows exactly how to go back to find the cells in each grid when a request is evaluated, using the B-tree.

So far, I’ve gone over how the cells in a grid are filled and how the overall tessellation process is achieved. Having the cells in a grid storage and tessellation process, however, doesn’t sit well in theory because there are openings for the cells to be misused or not used efficiently based on the extreme number of touched cells to retain. With the GEOMETRY data type and indexes created on it, the bounding box is required because SQL Server needs a finite space. Creating such a box is done by using the coordinates xmin, xmax and ymin, ymax. The result can be visualized as a square having the x-coordinate and y-coordinate of the lower-left corner and the x-coordinate and y-coordinate of the upper-right corner. What is most critical when determining the bounding box with an index on a GEOMETRY data type is to ensure that all the objects are within the bounding box, without making the bounding box excessively large, with lots and lots of empty cells, a balancing act. An index will be effective only for the objects, or shapes, within the bounding box. Not containing objects within a bounding box could severely impact performance and cause poor performance with spatial queries.

Furthermore, to retain the ability to use an index efficiently in the tessellation process, rules are applied. These rules are as follows:

  • Covering rule: The covering rule is the most basic rule applied in tessellation. Not to be confused with the common term of covering index, this rule states that any cell that is completely covered is not recorded individually for that object. Covered cells are counted for the object. Not storing covered cells saves processing and data storage time and space.
  • Cells-per-object rule: The cells-per-object rule is a more in-depth rule that applies a limit to the number of cells that can be counted for a specific object. In Figure 5-2, the circle shown covers two cells in level 1 and 14 in level 2. The circle is tessellated to the second layer because of a cells-per-object default of 16. If the circle did cover more than 16 cells at level 2, tessellation would not continue through to level 2. Since the object would cover a lot more than 16 cells at level 3, tessellation stops here. Tuning the cells per object can enhance the accuracy of an index. Tuning this value based on the data stored can be very effective. Given the importance of the cells-per-object rule, the setting is exposed in a dynamic management view, sys.spatial_index_tessellations. You will review this setting later in this chapter.

    9781484211199_Fig05-02.jpg

    Figure 5-2. Visual representation of an object and how many cells the object covers within the grid layers

  • Deepest cell rule: The last rule of the tessellation process is the deepest cell rule. As discussed, each layer of grids and the cells within them are referenced in each deeper layer. So in Figure 5-2, cells defined in level 2 are the only ones needed to completely refer to any other levels, in this case level 1, effectively. This rule cannot be broken and is built into the optimizer’s processing of retrieving the data from the index.

With the GEOGRAPHY type, there is the added challenge of projecting the form in a flattened representation through the tessellation process. This process first divides the GEOGRAPHY grid into two hemispheres. Each hemisphere is projected onto the facets of a quadrilateral pyramid and flattened, and then the two are joined into a non-Euclidean plane. Once this process is complete, the plane is decomposed into the aforementioned grid hierarchy.

Creating Spatial Indexes

The Create Spatial Index statement has most of the same options of a normal clustered or nonclustered index. However, there are specific options that are also required for this index type, as listed in Table 5-1.

Table 5-1. Spatial Index Options

Option Name

Description

USING

The USING clause specifies the spatial data type. This will be GEOMETRY_GRID or GEOGRAPHY_GRID and cannot be NULL.

WITH GEOMETRY_GRID,

GEOGRAPHY_GRID

The WITH options include the setting of the tessellation schema for either the GEOMETRY_GRID or the GEOGRAPHY_GRID based on the column data type.

BOUNDING_BOX

The BOUNDING_BOX is used in the GEOMETRY data type to define the bounding box of the cells. This option does not have defaults and must be specified when creating indexes on the GEOMETRY data type. The CREATE SPATIAL INDEX IDX_CITY_GEOM

(in Listing 5-1) shows the syntax for this option. Setting the BOUNDING_BOX is done by setting the xmin and ymin and xmax and ymax coordinates, like so: BOUNDING_BOX = (XMIN = xmin, YMIN = ymin, XMAX = xmax, YMAX = ymax).

GRIDS

The GRIDS option is used for altering the density of each grid layer. All layer defaults are medium density but can be altered to low or high to further tune spatial indexes and density settings.

Take the CREATE TABLE statement in Listing 5-1 as an example.

This table will consist of the primary key, the city name, and then a GEOMETRY column that holds map data for the city itself. The city’s density may affect tuning the cell-per-object rule in tessellation as well as the density of each layer in the grid hierarchy.

To index the CITY_GEOM column, the CREATE statement should in Listing 5-2 would be used with a grid-layer density of LOW for the first two layers and then MEDIUM and HIGH for third and fourth layers. This density change allows for tuning the object in the index and the covering cells as the layers go deeper in the grid. The cell-per-object setting is 24 maximum cells an object can cover. The bounding box coordinates are also set.

To utilize and test the index created, you will need to review the estimated and actual execution plans to determine whether the index has been used. In the case of spatial data, reviewing the actual results that a query will yield is also beneficial. SQL Server Management Studio has a built-in spatial data viewer that can be used for reviewing spatial data.

Listing 5-3 creates a table that can benefit from spatial indexing. The table is created to store ZIP codes and other data from the U.S. Census Bureau. This table will be created in the AdventureWorks2014 database.

The GEOM column will store the GEOMETRY data. This column will be used to query the data from SQL Server Management Studio to show the imaging that can be done from other applications.

Image Note  You can, as I’m going to do with this example, import data from the U.S Census Bureau. Go to http://1.usa.gov/1vhKa4F and download the county shape files from that page. This data can be loaded directly into SQL Server using a tool that can read and load the data into tables. There are a number of commercial products for doing this but few that are freely available. I’ll be using the ogr2ogr utility. You can read about what it is, how it works, and how to integrate it with SQL Server at http://bit.ly/1E1YdMe. This can be a cumbersome part of the process.

Reviewing the actual data from a query of a GEOMETRY data type column is not useful in the normal grid and tabular resultset from within SSMS. To take advantage of the spatial data features, using the Spatial Results tab in SSMS is much more effective. Given the table from Listing 5-3, a simple SELECT on the column geom can be executed, and the results of the SELECT statement will automatically generate the Spatial Results tab. For example, the query in Listing 5-4 will result in an image generated of the state of Washington, coding each County area in a different color.

Click the Spatial Results tab in the result window of SSMS to reveal the image generated by the query. You should see something like that in Figure 5-3.

9781484211199_Fig05-03.jpg

Figure 5-3. Output from spatial query against County data

The query in Listing 5-4 used a standard column, STATEFP, to filter the information so that you are looking only at counties within a particular state. Before using this data, though, it’s a good idea to ensure that you are working with only good shapes within the GEOM column. It is possible to have improper data stored, so cleaning your data may be required. To do this, you can use the MakeValid() method to modify any GEOMETRY instances, making them valid. According to the documentation from Microsoft, using the function can cause shapes to “shift slightly,” but it’s unclear the extent to which it may affect the shapes under your control. Executing Listing 5-5 will result in an update to any invalid GEOMETRY instances in the GEOM column.

The MakeValid() method should be used sparingly, and all invalid GEOMETRY instances that are found should be reviewed in a production setting. You should plan on reviewing your shapes after using the MakeValid function because it could possibly modify those shapes.

You can also use the spatial columns to filter the data being returned based on the behavior of locations and distances. Listing 5-6 shows an example of invoking one of the special methods that have been defined to work with spatial information. The query returns the ten counties closest to the county of Tulsa in Oklahoma (see Figure 5-4).

9781484211199_Fig05-04.jpg

Figure 5-4. Narrowing the results of the ZIP code data using STDistance()

The query from Listing 5-6 creates the execution plan shown in Figure 5-5.

9781484211199_Fig05-05.jpg

Figure 5-5. Execution plan generated from STDistance() without indexing

If you review the Spatial Results tab, the northeast corner of Oklahoma containing the ten counties will look like Figure 5-4. However, the query’s execution plan shown in Figure 5-5 is less than ideal, with an index scan on the clustered index created from the primary key and a high-cost Filter operation. With the use of the STDistance predicate, the query is a candidate for using an index on the GEOMETRY column, so an index should be added.

Supporting Methods with Indexes

With GEOMETRY and GEOGRAPHY data types, only certain methods are supported with the use of indexes. The STDistance() method will support indexing, which would benefit the query shown in Listing 5-6. Before diving deeply into indexing the query, the methods that do support indexing should be pointed out. These methods have rules in how respective predicates are written. The following is a list of supported methods for the GEOMETRY type:

  • GEOMETRY.STContains() = 1
  • GEOMETRY.STDistance() < number
  • GEOMETRY.STDistance() <= number
  • GEOMETRY.STEquals() = 1
  • GEOMETRY.STIntersects() = 1
  • GEOMETRY.STOverlaps() = 1
  • GEOMETRY.STTouches() = 1
  • GEOMETRY.STWithin() = 1

And the following are the supported methods for the GEOGRAPHY type:

  • GEOGRAPHY.STIntersects() = 1
  • GEOGRAPHY.STEquals() = 1
  • GEOGRAPHY.STDistance() < number
  • GEOGRAPHY.STDistance() <= number

For both GEOMETRY and GEOGRAPHY, to return any result that is not null, the first parameter and the second parameter must have the same spatial reference identifier (SRID), which is a spatial reference system based on a specific ellipsoid used to flatten or round the earth.

Recall that the query used in Figure 5-6 to return the counties around Tulsa uses the STDistance() method in the expression STDistance(@polygon) < 1. Based on the methods supported and analyzing the options and CREATE syntax for spatial indexing, you could use the INDEX CREATE statement shown in Listing 5-7 in an attempt to optimize the query.

Executing the query in Listing 5-6 results in the much different execution plan, shown in Figure 5-6. It results in a shorter duration when executing and returning the results, plus spatial results. The largest difference in the execution plan is the use of the index IDX_COUNTY_GEOM.

9781484211199_Fig05-06.jpg

Figure 5-6. Optimized details of a tuned execution plan using spatial data

You can see an overall improvement and more optimal execution plan from the creation of the spatial index. The index and optimal execution plan are good, but validating the actual improvement by checking the overall duration in execution time should not be skipped. Capturing the execution time using Extended Events, an overall review of the execution of the statement can be retrieved. In the case of the query that searches for the counties near Tulsa, the results with the index in place returned 500 milliseconds. Dropping the index and executing the same query returns 1,500 milliseconds for total execution time. This test is extremely basic, but it’s a solid foundation upon which you can begin to form a strategy for indexing existing spatial data to improve overall performance.

Understanding Statistics, Properties, and Information

Indexes in general have many data management views and functions that make the administration of the indexes much easier and more efficient than manual statistics gathering. With spatial indexes, there are additional catalog views that are added to assist in the unique settings and administration of them. In addition to the views, there are also some built-in procedures that you can invoke to get information about spatial indexes.

The Views

There are two catalog views: sys.spatial_index and sys.spatial_index_tessellation. The sys.spatial_index view provides the type and tessellation scheme as well as basic information about each spatial index. The spatial_index_type column returned by sys.spatial_index returns a 1 for GEOMETRY indexes and a 2 for GEOGRAPHY indexes. Listing 5-8 is an example query against the view, and Figure 5-7 shows the results.

9781484211199_Fig05-07.jpg

Figure 5-7. Querying sys.spatial_indexes and results showing IDX_WIZIP_GEOM index

Now query the sys.spatial_index_tessellation view to see the parameters of the index and the tessellation scheme. Listing 5-9 is the query, and Figure 5-8 shows the results.

9781484211199_Fig05-08.jpg

Figure 5-8. Querying sys.spatial_index_tessellations and partial results

Both of these catalog views can be joined on the object_id to become extremely useful for tuning and maintenance tasks. At times, it may prove effective to manipulate and re-create indexes as needed when the spatial data dictates.

The Procedures

As well as the additional catalog views, four other procedures have been provided internally for further analysis of the spatial indexes. These procedures return a complete listing of properties that are set on the indexes. The four procedures and their parameters are as follows:

sp_help_spatial_GEOMETRY_index [ @tabname =] 'tabname'
     [ , [ @indexname = ] 'indexname' ]
     [ , [ @verboseoutput = ] 'verboseoutput'
     [ , [ @query_sample = ] 'query_sample']

sp_help_spatial_GEOMETRY_index_xml [ @tabname =] 'tabname'
     [ , [ @indexname = ] 'indexname' ]
     [ , [ @verboseoutput = ]'{ 0 | 1 }]
     [ , [ @query_sample = ] 'query_sample' ]
     [ ,.[ @xml_output = ] 'xml_output' ]

sp_help_spatial_GEOGRAPHY_index [ @tabname =] 'tabname'
[ , [ @indexname = ] 'indexname' ]
[ , [ @verboseoutput = ] 'verboseoutput' ]
[ , [ @query_sample = ] 'query_sample' ]

sp_help_spatial_GEOGRAPHY_index_xml [@tabname = 'tabname'
[ , [ @indexname = ] 'indexname' ]
[ , [ @verboseoutput = ] 'verboseoutput' ]
[ , [ @query_sample = ] 'query_sample' ]
[ ,.[ @xml_output = ] 'xml_output' ]

Listing 5-10 is an example showing how to execute these stored procedures. The example returns information about the GEOMETRY index IDX_COUNTY_GEOM created earlier in Listing 5-7. Figure 5-9 shows the results.

9781484211199_Fig05-09.jpg

Figure 5-9. sp_help_spatial_GEOMETRY_index example and results (results may vary)

This information can be useful for adjusting the index to make it function better. The information returned functions in a similar way to the statistics for an index. You can see how many objects are available in each of the levels of the index. You can also see how it returns data that match the provided query sample. Seeing that a particular number of intersecting objects match the query sample shows you whether a given object will be returned by the index. You can also see the percentage of objects in the index that are not returned from the query sample by comparing the objects in the index to the ones that match. All this helps you understand how well the index is meeting your query requirements.

Tuning Spatial Indexes

As you saw in Listing 5-7, when a spatial index is created, you have some options. Manipulating these options allows you to adjust the behavior of your spatial indexes. Some experimentation will be necessary to arrive at the right set of options for the optimal behavior of your index. Use a combination of the execution plan and the query performance metrics just as you did earlier in this chapter.

For a GEOMETRY column, you can add a bounding box to the index. This limits the area that the index covers, which can allow you to create an index that can help satisfy certain query criteria better than a general index. For example, if I change the bounding box and re-create the index as in Listing 5-11, I see about a 10 percent reduction in execution time.

But, by changing the bounding box, I am excluding some objects from the index. This could lead to poor performance if other parameters are used to query the counties in the United States. As you can see, there is no easy answer here.

Another adjustment you can make is to change the grids of the index. The choices made in the examples so far are a fairly standard choice if you’re not sure how your data is distributed and if you’re unsure of how many matches you’re likely to get from any one query. If your query has a higher percentage of inclusive results, a different distribution on the grids can result in higher speed. It’s largely a question of experimentation. But, just as with the bounding box, changing the grid distribution for one data set could hurt another. You’ll have to perform rigorous testing to get this right.

Using the same example, if I were to make the level 1 grid into a HIGH detailed grid, I lose 10 percent of my performance, making the query run slower. Changing it to MEDIUM neither benefited nor hurt the execution time. In this case, adjusting the grid levels in any combination didn’t result in a significant improvement in speed, but having the HIGH level of detail on either level 1 or level 2 of the grid would negatively impact performance. With this experiment complete, I would choose to leave the default grids alone in this instance.

Restrictions on Spatial Indexes

Spatial indexes provide some unique features and restrictions. The following is a comprehensive listing of restrictions for spatial indexing:

  • A spatial index can be created only on a column of type GEOMETRY or GEOGRAPHY.
  • Spatial indexes can be defined only on a table that has a primary key. The maximum number of primary key columns on the table is 15.
  • The maximum size of index key records is 895 bytes. Larger sizes raise an error.
  • The use of Database Tuning Advisor is not supported.
  • You cannot perform an online rebuild of a spatial index.
  • Spatial indexes cannot be specified on indexed views.
  • You can create only up to 249 spatial indexes on any of the spatial columns in a supported table. Creating more than one spatial index on the same spatial column can be useful, for example, to index different tessellation parameters in a single column.
  • You can create only one spatial index at a time.
  • An index build of spatial data cannot use available process parallelism.

Summary

Indexing spatial data is a complicated form of data storage and manipulation. This chapter covered the main points of how spatial data is processed and stored to help in managing and reviewing an implementation of the spatial data types in databases.

With spatial indexes, you now have the ability to quickly determine whether points lie within regions or whether regions overlap other regions. Instead of having to fully render each spatial artifact, spatial indexes allow queries to quickly calculate the results of the spatial function. Remember to always examine the execution plan to ensure that the spatial index is actually being used.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.79.84