Down the index rabbit-hole

So far in this chapter we have looked at the introduction of features and functionalities that were available in disk-based objects to the In-Memory OLTP engine. The improvements in indexing and the ability to alter your indexing strategy with memory-optimized tables without dropping it are particularly attractive. With that in mind, we will now spend some time investigating how the indexes are treated inside the storage engine. This will include a journey through the system catalogs and Dynamic Management Views (DMVs) to allow us to see how index information can be queried.

We will now take this opportunity to explore what the alteration of a bucket count can have on the data in our demonstration table.

Let's begin by rolling back two of our changes from earlier in the chapter by dropping the HSH_ANewColumn hash index, dropping the two columns ANewColumn and AnotherColumn, and finally re-creating a HSH_LoginTime hash index, using an extremely irregular and sub-optimal value of 2, as shown in the following code:

 
USE InMemoryTest 
GO 
ALTER TABLE dbo.InMemoryTable DROP INDEX HSH_ANewColumn 
GO 
ALTER TABLE dbo.InMemoryTable DROP COLUMN ANewColumn 
GO 
ALTER TABLE dbo.InMemoryTable DROP COLUMN AnotherColumn 
GO 
ALTER TABLE dbo.InMemoryTable ADD INDEX HSH_LoginTime NONCLUSTERED HASH (LoginTime) WITH (BUCKET_COUNT=2); 
GO

We will begin by taking a look at the system catalog views we've used in past versions of SQL Server to inspect the table and indexes we have created so far, using the following code:

USE InMemoryTest 
GO  
SELECT OBJECT_NAME(i.object_id) AS [table_name], 
    COALESCE(i.name,'--HEAP--') AS [index_name], 
    i.index_id, 
    i.type, 
    i.type_desc 
FROM sys.indexes AS i 
WHERE i.object_id = OBJECT_ID('InMemoryTable'); 

table_name      index_name           index_id  type  type_desc
--------------- ------------------ ----------- ---- ------------------
InMemoryTable    --HEAP--               0       0     HEAP
InMemoryTable      HSH_UserName         2       7     NONCLUSTERED HASH
InMemoryTable      PK_InMemory          3       2     NONCLUSTERED
InMemoryTable      HSH_LoginTime        4       7     NONCLUSTERED HASH

The results of this first query against sys.indexes shows us that we have a HEAP hash index(the data pages of the memory-optimized table), as well as the three additional indexes we have already created. Particularly noteworthy here are the two non-clustered hash indexes that we created: HSH_UserName and HSH_LoginTime. Both appear as index type 7 and with the index description NONCLUSTERED HASH. This should come as no great surprise, but it shows us that the old system views have been extended to include information regarding memory-optimized tables and indexes. The previous listing may be executed without the WHERE clause to see that the details for both memory-optimized and disk-based tables and indexes can be queried simultaneously.

We already know that hash indexes are exclusive to memory-optimized tables. However, if we go through the demo code so far, we also know that the primary key of this table is also memory-optimized. The query referencing sys.indexes, however, only shows us that the primary key is a non-clustered index. So, we have no way of knowing if this non-clustered index is a memory-optimized or disk-based index.

The information about whether a table is memory-optimized or disk-based is stored in the newly extended sys.tables catalog view, as shown in the following code:

SELECT COALESCE(i.name,'--HEAP--') AS [index_name], 
    i.index_id, 
    i.type, 
   t.is_memory_optimized, 
   t.durability, 
   t.durability_desc 
FROM sys.tables t 
    INNER JOIN sys.indexes AS i 
        ON i.object_id = t.object_id 
WHERE t.name = 'InMemoryTable' 

index_name index_id type is_memory_optimized durability durability_desc
---------- -------- ----- ------------------ ----------- ------------
--HEAP--      0       0      1                  0       SCHEMA_AND_DATA
HSH_UserName  2       7      1                  0       SCHEMA_AND_DATA
PK_InMemory   3       2      1                  0       SCHEMA_AND_DATA
HSH_LoginTime 4       7      1                  0       SCHEMA_AND_DATA

As the results of the query show, the table and all indexes are memory-optimized (is_memory_optimized = 1). We are also able to determine which durability option has been chosen for the table.

With these two queries, we are able to take our first look at the index information for memory-optimized tables. However, we are not able to see any particular information regarding our hash index (bucket counts or chain lengths). To access that information, we leave the general system catalog views sys.indexes and sys.tables behind and venture forwards into more feature specific catalog views.

Our first port of call is the obviously named sys.hash_indexes , which displays information on any hash index in a database. The sys.hash_indexes catalog view has the same columns as the sys.indexes catalog view with the exception of the bucket_count column. This column displays the bucket count for the index from creation time, or the last rebuild, with the BUCKET_COUNT option supplied:

USE InMemoryTest 
GO  
SELECT hi.name AS [index_name], 
   hi.index_id, 
   hi.type, 
   hi.bucket_count 
FROM sys.hash_indexes AS hi; 

index_name     index_id   type bucket_count
------------- ----------- ---- ------------
HSH_UserName      2         7     16384
HSH_LoginTime     4         7       2

This catalog view only provides us with information about the index structure from the time of creation, it does not provide us with any details on chain length inside each bucket (the number of rows that are assigned to each bucket via the hashing algorithm).

To access this more important and relevant information, we must access another catalog view specifically created for the memory-optimized indexes, sys.dm_db_xtp_hash_index_stats

By joining the system sys.dm_db_xtp_hash_index_stats catalog view to sys.indexes (or even directly to sys.hash_indexes), we are able to finally see our bucket count and our usage statistics. We are now able to see how many buckets are empty, the average chain length, and the maximum chain length for each hash index. The following code illustrates this technique:

SELECT COALESCE(i.name, '--HEAP--') AS [index_name], 
    i.index_id, 
    i.type, 
    ddxhis.total_bucket_count AS [total_buckets], 
    ddxhis.empty_bucket_count AS [empty_buckets], 
    ddxhis.avg_chain_length, 
    ddxhis.max_chain_length 
FROM sys.indexes AS i 
    LEFT JOIN sys.dm_db_xtp_hash_index_stats AS ddxhis 
        ON ddxhis.index_id = i.index_id 
           AND ddxhis.object_id = i.object_id 
WHERE i.object_id = OBJECT_ID('InMemoryTable');

We will want to monitor the statistics of our hash indexes to see whether the bucket count is optimal or not. As each bucket takes up 8 bytes of memory, the higher the bucket count, the more memory we will use for the index. A higher bucket count also results in a slower traversal of the index when performing index scans. So, if we note a high number of empty buckets, we may consider reducing the bucket count to improve potential scan speeds. In general, it is best to avoid performing index scans on a hash index, as the scan must first scan the hash buckets and then the chains of rows for each bucket. This has considerable overhead versus scanning a non-clustered index, which doesn't have the added complexity of the bucket structure to traverse.

Equally, if we see a high average or maximum chain length, this could indicate that there are not enough buckets available for the data coming into the index. Higher chain lengths can also impact performance when performing seeks or inserts.

The general idea of a hash index is to have just enough buckets to cover all possible unique values going into the index (or as near to this as possible), so that when a query seeks a single hash value, the storage engine can retrieve the one (or very few) row(s) that belong to that hash value.

The recommendation from Microsoft is to aim for an empty bucket percentage of 33% or higher. If the bucket count matches the number of unique values for the index, the hashing algorithm's distribution will cause around 33% of buckets to remain empty.

The recommendation from Microsoft is to aim for a chain length of less than 10, with the ideal value being 1—one row per hash value and therefore bucket.

If you have wildly non-unique values with many duplicates, a non-clustered index would, more than likely, be a better choice versus a hash index.

We can see from the following code how the empty buckets and chain lengths are affected by inserting a handful of rows into the test table and re-running the previous query:

USE InMemoryTest 
GO  
INSERT INTO dbo.InMemoryTable 
        ( UserId, UserName , LoginTime, LoginCount ) 
VALUES 
         (2, 'Donald Duck'    , '2016-01-02', 1), 
         (3, 'Steve Jobs'     , '2016-01-03', 1), 
         (4, 'Steve Ballmer'  , '2016-01-04', 1), 
         (5, 'Bill Gates'     , '2016-01-05', 1), 
         (6, 'Ted Codd'       , '2016-01-06', 1), 
         (7, 'Brian Kernighan', '2016-01-07', 1), 
         (8, 'Dennis Ritchie' , '2016-01-08', 1); 
GO  
SELECT COALESCE(i.name, '--HEAP--') AS [index_name], 
    i.index_id, 
    i.type, 
    ddxhis.total_bucket_count AS [total_buckets], 
    ddxhis.empty_bucket_count AS [empty_buckets], 
    ddxhis.avg_chain_length, 
    ddxhis.max_chain_length 
FROM sys.indexes AS i 
    LEFT JOIN sys.dm_db_xtp_hash_index_stats AS ddxhis 
        ON ddxhis.index_id = i.index_id 
           AND ddxhis.object_id = i.object_id 
WHERE i.object_id = OBJECT_ID('InMemoryTable');

Here, we can see that the two hash indexes now have slightly different information in them. We now have a total of eight rows in the table (and the indexes). HSH_UserName has 16384 buckets with a hash function applied to the UserName column. As we inserted seven new rows, each with a unique value for UserName, they will all be stored in an empty bucket. This leaves the average and maximum chain lengths at 1. The data inserted into LoginTime was also unique for each of the seven rows. However, there are only two buckets assigned to HSH_LoginTime. This results in the seven rows hashing to one of two possible values and being placed in one of the two available buckets. The average and maximum chain lengths are then no longer 1.

This example is simplified, but allows us to recognize that the implementation of the hash index on LoginTime requires attention. Either the bucket count needs raising, or the choice of a hash index may be incorrect. Equally, the hash index on UserName provides ideal chain lengths, but has an excessive empty bucket count. The index therefore uses more memory than is necessary and may need the bucket count reducing to release memory for other memory-optimized objects. Now that we know that bucket counts and chain lengths can affect the amount of memory required to store memory-optimized tables and indexes, we should also take a quick look at how the two hash indexes in the example can differ by using the following code:

USE InMemoryTest 
GO  
SELECT COALESCE(i.name, '--HEAP--') AS [index_name], 
    i.index_id, 
    i.type, 
    c.allocated_bytes, 
    c.used_bytes 
FROM sys.dm_db_xtp_memory_consumers c 
    JOIN sys.memory_optimized_tables_internal_attributes a 
        ON a.object_id = c.object_id 
           AND a.xtp_object_id = c.xtp_object_id 
    LEFT JOIN sys.indexes i 
        ON c.object_id = i.object_id 
           AND c.index_id = i.index_id 
WHERE c.object_id = OBJECT_ID('InMemoryTable') 
      AND a.type = 1 
ORDER BY i.index_id; 

index_name  index_id type  allocated_bytes used_bytes
---------- -------- ----  --------------- -------------
--HEAP--      NULL   NULL     131072         696
HSH_UserName  2      7        131072         131072
PK_UserId     3      2        196608         1368
HSH_LoginTime 4      7        16             16

The results show an interesting fact. The index HSH_LoginTime was created with two buckets and has 16 bytes allocated and used. This makes sense when we think about how each bucket takes up 8 bytes of memory. However, HSH_UserName takes up 128 MB of memory even though the index has 10,000 buckets. We would expect ~78 MB (10,000 * 8 bytes = 80,000 bytes), however the memory allocation follows base—2 size allocations. As the size is larger than 64 MB, the next largest size in base—2 is 128, and therefore 128 MB are allocated. We can test this theory by altering the number of buckets to a value low enough for the size to be below 64 (the next step down the base—2 scale), as shown in the following code:

USE InMemoryTest 
GO  
ALTER TABLE dbo.InMemoryTable ALTER INDEX HSH_UserName REBUILD WITH (BUCKET_COUNT=8000); 
GO 
SELECT COALESCE(i.name, '--HEAP--') AS [index_name], 
    i.index_id, 
    i.type, 
    c.allocated_bytes, 
    c.used_bytes 
FROM sys.dm_db_xtp_memory_consumers c 
    JOIN sys.memory_optimized_tables_internal_attributes a 
        ON a.object_id = c.object_id 
           AND a.xtp_object_id = c.xtp_object_id 
    LEFT JOIN sys.indexes i 
        ON c.object_id = i.object_id 
           AND c.index_id = i.index_id 
WHERE c.object_id = OBJECT_ID('InMemoryTable') 
      AND a.type = 1 
ORDER BY i.index_id; 

index_name  index_id type allocated_bytes used_bytes
---------- --------  ---- --------------- -------------
--HEAP--      NULL   NULL      131072        760
HSH_UserName   2      7        65536         65536
PK_InMemory    3      2        196608        1368
HSH_LoginTime  4      7        16            16

We now see that the index HSH_UserName has 64 MB allocated, although 8000 buckets equates to ~62.5 MB. This gives us further information regarding memory use for indexing memory-optimized tables, and tells us that we have a certain amount of overhead for the storage of indexes in relation to the number of buckets we assign to a hash index.

Table of Contents for Down the index rabbit-hole

Create new playlist

Sign In

Sign Up

Table of Contents for
Down the index rabbit-hole