© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
G. FritcheySQL Server 2022 Query Performance Tuninghttps://doi.org/10.1007/978-1-4842-8891-7_20

20. Graph Databases

Grant Fritchey1  
(1)
Grafton, MA, USA
 

Introduced in SQL Server 2017, the Graph database stores nodes and edges within SQL Server, allowing you to use T-SQL to query graph data. Graph data is specifically oriented around many-to-many relationships. Graph data doesn’t lend itself to traditional relational structures but instead is focused on hierarchies and relationships past those more traditional structures. Graph data is separated from either traditional analytical data, better served by columnstore indexes, or from OLTP systems, better served by rowstore indexes. However, when you have a data set that consists of complex hierarchical data, focused on the relationships between that data, the graph database comes into play.

In this chapter, I’ll cover the following:
  • Introduction to Graph databases

  • Querying graph data

  • Performance considerations of graph data

Introduction to Graph Databases

You’re allowed to create one graph per database. A graph consists of node and edge tables. Node tables describe an object. Edge tables describe the relationships between the values in the node tables. For this chapter, I’m going to create this simple graph data set in Figure 20-1.

A diagram depicts the relationships of radio operators, frequency, and radio to call and uses. The first three items are the nodes and the last two are the edges.

Figure 20-1

A graph relationship with nodes and edges

In the example, I’ll have three nodes, or objects:
  • Radio operators

  • Frequencies

  • Radios

This will be with two edges, or relationships:
  • Call

  • Uses

In short, many radio operators will use many frequencies using radios to call out to make many contacts with each other.

While the definition is that we’re going to create a database, in fact, we’re mainly just working with node and edge tables, more or less like regular tables. We could easily build out a structure using traditional relational mechanisms to store the data for functionality outlined previously. However, graph queries give us a much more efficient way to query that data.

You can easily see deeper relationships in the data, determining the number of different steps between radio operators who haven’t contacted each other but have contacted others for example.

To work on this, I’m going to create a database along with a set of node and edge tables. Listing 20-1 is the start.
CREATE DATABASE RadioGraph;
GO
USE RadioGraph;
GO
--Create schema to hold different structures
CREATE SCHEMA grph;
GO
CREATE SCHEMA rel;
GO
Listing 20-1

Creating a database for Graph data

Notice that there’s nothing special about the database. All the work will be done with tables and T-SQL. I’ve created two schemas in the database:
  • grph: Will hold the Graph structures

  • rel: Will hold a relational equivalent so that we can compare some query performance

Listing 20-2 shows the three node tables.
CREATE TABLE grph.RadioOperator
(
    RadioOperatorID int IDENTITY(1, 1) NOT NULL,
    OperatorName varchar(50) NOT NULL,
    CallSign varchar(9) NOT NULL
) AS NODE;
CREATE TABLE grph.Frequency
(
    FrequencyID int IDENTITY(1, 1) NOT NULL,
    FrequencyValue decimal(6, 3) NOT NULL,
    Band varchar(12) NOT NULL,
    FrequencyUnit varchar(3) NOT NULL
) AS NODE;
CREATE TABLE grph.Radio
(
    RadioID int IDENTITY(1, 1),
    RadioName varchar(50) NOT NULL
) AS NODE;
Listing 20-2

Creating node tables

As you can see, the only thing special I do while creating the tables is use the AS NODE command. Other than that definition, there’s nothing special that you have to do. Creating the edge tables is quite a lot different (Listing 20-3).
CREATE TABLE grph.Calls AS EDGE;
CREATE TABLE grph.Uses AS EDGE;
Listing 20-3

Creating edge tables

In this case, I don’t define columns of any kind. They are created for me. Additionally, the node tables also get new columns. Figure 20-2 shows the new folder, graph tables, and the associated tables with columns.

A tree folder for RadioGraph displays two subfolders, database diagrams, and tables. The subfolder tables have many more subfolders and files under them.

Figure 20-2

Tables inside of a graph database

For the node table, grph.Frequency highlighted in Figure 20-2, you can see that a $graph_id and a $node_id column have been added. We’ll be using the node_id quite a bit. The edge table, in this case, grph.Calls, has a bunch of columns created, despite our CREATE TABLE statement not including any. Before I explain the $node_id, $from_id, and $to_id columns, let’s load some data.

First, I’m going to load the node tables using Listing 20-4.
INSERT INTO grph.RadioOperator
(
    OperatorName,
    CallSign
)
VALUES
('Grant Fritchey', 'KC1KCE'),
('Bob McCall', 'QQ5QQQ'),
('Abigail Serrano', 'VQ5ZZZ'),
('Josephine Wykovic', 'YQ9LLL');
INSERT INTO grph.Frequency
(
    FrequencyValue,
    Band,
    FrequencyUnit
)
VALUES
(14.250, '20 Meters', 'MHz'),
(145.520, '2 Meters', 'MHz'),
(478, '630 Meters', 'kHz'),
(14.225, '20 Meters', 'MHz'),
(14.3, '20 Meters', 'MHz'),
(7.18, '40 Meters', 'MHz');
INSERT INTO grph.Radio
(
    RadioName
)
VALUES
('Yaesu FT-3'),
('Baofeng UV5'),
('Icom 7300'),
('Raddiodity GD-88'),
('Xiegu G90');
Listing 20-4

Adding data to Node tables

The only really important point to understand here is that I’m not populating the $node_id column. It’s taken care of similar to an IDENTITY column by internal processes.

Note

Just as an aside, my call sign is real. The other two call signs are fake along with the names of the people.

I am going to use the $node_id column when I go to load the edge table as shown in Listing 20-5.
INSERT INTO grph.Uses
(
    $from_id,
    $to_id
)
VALUES
(
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 1
    ),
    (
        SELECT $node_id FROM grph.Radio AS r WHERE r.RadioID = 1
    ));
Listing 20-5

Adding data to an edge table

Here’s where graph databases get a little tricky. I’m going to define the $from_id and $to_id however I want, but I have to do it consistently. In my case, I’m saying that a RadioOperator uses a radio. So the $from_id is the $node_id from the RadioOperator table. In this case, for RadioOperatorID = 1.

We can load all sorts of relationships within the Uses table. In the next code listing, I also make the association between radios and frequencies to understand which radio uses which frequency. However, you must be consistent. If you accidentally reverse the $from_id and $to_id, you’re creating a new type of relationship, not simply a new relationship within the table.

Listing 20-6 loads some more data into the grph edge tables.
INSERT INTO grph.Uses
(
    $from_id,
    $to_id
)
VALUES
(
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 1
    ),
    (
        SELECT $node_id FROM grph.Radio AS r WHERE r.RadioID = 2
    )),
(
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 1
    ),
    (
        SELECT $node_id FROM grph.Radio AS r WHERE r.RadioID = 3
    )),
(
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 2
    ),
    (
        SELECT $node_id FROM grph.Radio AS r WHERE r.RadioID = 2
    )),
(
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 3
    ),
    (
        SELECT $node_id FROM grph.Radio AS r WHERE r.RadioID = 4
    )),
(
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 1
    ),
    (
        SELECT $node_id FROM grph.Radio AS r WHERE r.RadioID = 5
    )),
(
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 3
    ),
    (
        SELECT $node_id FROM grph.Radio AS r WHERE r.RadioID = 1
    )),
(
    (
        SELECT ro.$node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 4
    ),
    (
        SELECT r.$node_id FROM grph.Radio AS r WHERE r.RadioID = 1
    ));
--edges for radio uses frequency
INSERT INTO grph.Uses
(
    $from_id,
    $to_id
)
VALUES
(
    (
        SELECT $node_id FROM grph.Radio AS r WHERE r.RadioID = 1
    ),
    (
        SELECT $node_id FROM grph.Frequency AS F WHERE F.FrequencyID = 2
    )),
(
    (
        SELECT $node_id FROM grph.Radio AS r WHERE r.RadioID = 2
    ),
    (
        SELECT $node_id FROM grph.Frequency AS F WHERE F.FrequencyID = 2
    )),
(
    (
        SELECT $node_id FROM grph.Radio AS r WHERE r.RadioID = 1
    ),
    (
        SELECT $node_id FROM grph.Radio AS r WHERE r.RadioID = 2
    ));
--edges for calls
INSERT INTO grph.Calls
(
    $from_id,
    $to_id
)
VALUES
(
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 1
    ),
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 4
    )),
(
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 1
    ),
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 3
    )),
(
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 2
    ),
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 3
    )),
(
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 3
    ),
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 4
    )),
(
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 3
    ),
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 1
    )),
(
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 3
    ),
    (
        SELECT $node_id FROM grph.RadioOperator AS ro WHERE ro.RadioOperatorID = 2
    ));
Listing 20-6

Loading the remaining data into the edge tables

It’s not a large data set, but it does take up quite a bit of space here.

With the data defined, we can query the information. This query shows which of our operators has called out to other operators (Listing 20-7).
SELECT Calling.OperatorName,
       Calling.CallSign,
       Called.OperatorName,
       Called.CallSign
FROM grph.RadioOperator AS Calling,
     grph.Calls AS C,
     grph.RadioOperator AS Called
WHERE MATCH(Calling-(C)->Called);
Listing 20-7

Querying graph tables

This is a new way of querying data, so we should break things down a bit. First up, your FROM clause simply lists the tables. We’re not defining JOIN operations using an ON clause. Instead, we have a new command, MATCH. In that, we’re looking for where the first table RadioOperator, aliased as Calling, through the edge table of grph.Calls, aliased as C, matches values in RadioOperator again, but this time, aliased as Called. The results look like Figure 20-3.

A table has four columns and six rows. The headers operator name and call sign are repeated twice, and there are six row entries.

Figure 20-3

Which radio operators have called out to which other operators

This is just one example of how a graph database works. However, it’s not an exciting example. If we take Listing 20-8 and create and load traditional tables, I can show the same behavior.
CREATE TABLE rel.RadioOperator
(
    RadioOperatorID int IDENTITY(1, 1) NOT NULL,
    OperatorName varchar(50) NOT NULL,
    CallSign varchar(9) NOT NULL
);
CREATE TABLE rel.RadioOperatorCall
(
    CallingOperatorID INT NOT NULL,
    CalledOperatorID INT NOT NULL
);
INSERT INTO rel.RadioOperator
(
    OperatorName,
    CallSign
)
VALUES
('Grant Fritchey', 'KC1KCE'),
('Bob McCall', 'QQ5QQQ'),
('Abigail Serrano', 'VQ5ZZZ');
INSERT INTO rel.RadioOperatorCall
(
    CallingOperatorID,
    CalledOperatorID
)
VALUES
(1, 2),
(1, 3),
(2, 3),
(3, 1),
(3, 2),
(3, 4);
Listing 20-8

Building a relational table to do the same thing

With the same data loaded into a many-to-many relationship, we can then write a query like Listing 20-9.
SELECT Calling.OperatorName,
       calling.CallSign,
       CALLED.OperatorName,
       CALLED.CallSign
FROM rel.RadioOperator AS Calling
    JOIN rel.RadioOperatorCall AS roc
        ON Calling.RadioOperatorID = roc.CallingOperatorID
    JOIN rel.RadioOperator AS CALLED
        ON roc.CalledOperatorID = CALLED.RadioOperatorID;
Listing 20-9

Querying the relational table

The results from Listing 20-7 and Listing 20-9 are the same. However, the performance is quite different with the graph query running in 219mcs and the relational query returning in over 1ms, five times slower.

Querying Graph Data

We’ve already seen one example of a query for graph data in Listing 20-7, but getting relationships like that is just the start of what can be done with Graph data. The syntax is relatively straightforward. As was already demonstrated in Listing 20-7, there are no substantive changes to the SELECT and FROM clauses. The only real change there is that you won’t use an ON clause to put the relationships together. Instead, within the WHERE clause, you’ll use the new MATCH function along with ASCII-art to define the relationship. Here is the WHERE clause again:
WHERE MATCH(Calling-(C)->Called);
I’m defining a node, Calling, to node, Called, relationship through the edge table, C. The direction, consisting of a dash on either side of the edge table, and a dash and arrow showing the direction of the relationship on the other side of the graph table. I could rewrite it this way:
WHERE MATCH(Called<-(C)-Calling);
The use of the arrow is what is meant by ASCII-art. You can even extend the relationships and add filtering. For example, Listing 20-10 shows how a query shows all the people I called and who they called.
SELECT Calling.OperatorName,
       Calling.CallSign,
       CALLED.OperatorName,
       CALLED.CallSign,
       TheyCalled.OperatorName,
       TheyCalled.CallSign
FROM grph.RadioOperator AS Calling,
     grph.Calls AS C,
     grph.RadioOperator AS CALLED,
     grph.Calls AS C2,
     grph.RadioOperator AS TheyCalled
WHERE MATCH(Calling-(C)->CALLED-(C2)->TheyCalled)
            AND Calling.RadioOperatorID = 1;
Listing 20-10

Multiple relationships

I can add traditional filtering to limit the data returned. It all functions as you would expect. The results are in Figure 20-4.

A table has six columns and four rows. The headers operator name and call sign are repeated thrice, and there are four row entries.

Figure 20-4

The people called by the people I called

You can also write a query to find everyone who called the same person as shown in Listing 20-11.
SELECT AllCalled.CallSign AS WasCalledBy,
       WeCalled.CallSign AS CALLED,
       TheyCalled.CallSign AS AlsoCalled
FROM grph.RadioOperator AS AllCalled,
     grph.RadioOperator AS WeCalled,
     grph.RadioOperator AS TheyCalled,
     grph.Calls AS C,
     grph.Calls AS C2
WHERE MATCH(Wecalled-(C)->Allcalled<-(C2)-TheyCalled)
ORDER BY WasCalledBy ASC;
Listing 20-11

Relationships can occur in multiple directions

So each set of relation is expressed to a common person in the same way, from a node table, through an edge table to a node table. The results are shown in Figure 20-5.

A table has three columns and ten rows. The headers are was called by, called, and also called, and there are ten row entries.

Figure 20-5

Relationships can be expressed in multiple ways

There is one additional piece of functionality within the Graph data that we should explore: the ability to track a path through relationships.

Shortest Path

The shortest path function, SHORTEST_PATH, allows you to find several different things:
  • The single shortest path between two nodes

  • The shortest path between multiple nodes

  • A single source path

Basically, this action is performed by aggregating and ordering a set of relationships as a PATH. Once that data is defined, using functions and syntax within the MATCHING clause, you can then apply aggregate functions such as STRING_AGG to aggregate on a string, COUNT to see a path count, and LAST_VALUE to see the end of the path.

All this requires some changes to the standard syntax. Listing 20-12 illustrates the changes.
SELECT op1.OperatorName,
       STRING_AGG(op2.OperatorName, '->')WITHIN GROUP(GRAPH PATH) AS Friends,
       LAST_VALUE(op2.OperatorName)WITHIN GROUP(GRAPH PATH) AS LastNode,
       COUNT(op2.OperatorName)WITHIN GROUP(GRAPH PATH) AS levels
FROM grph.RadioOperator AS op1,
     grph.Calls FOR PATH AS C,
     grph.RadioOperator FOR PATH AS op2
WHERE MATCH(SHORTEST_PATH(op1(-(C)->op2)+));
Listing 20-12

Implementing SHORTEST_PATH in a graph query

I’ll start at the FROM clause. You’ll notice that we’ve now added the syntax FOR PATH to two of the tables: grph.Calls and grph.RadioOperator. That informs the engine that these tables are involved in a path operation.

Next, the MATCH command has added the function SHORTEST_PATH. Further, the definition of the path has also changed. We’re starting with op1 the same as any of the other examples. However, we’re placing the path definition within parenthesis so that it is separated from just normal relationships. We’ve also added a plus (+) sign to indicate that the pattern can be repeated as many times as necessary to find the path. You can limit that using an integer value.

Finally, back in the SELECT operation, I’ve used three aggregate values to show the STRING_AGG, LAST_VALUE, and COUNT. That lets us see the basic layout of the paths as shown in Figure 20-6.

A table has four columns and twelve rows. The headers are the operator name, friends, last node, and levels, and there are twelve row entries.

Figure 20-6

The paths and levels showing which operators have called who

We can also take advantage of a function called LAST_NODE that lets us use the final node in a path with other paths such as this one showing only radio operators using a radio called “Xiegu G90” in Listing 20-13.
SELECT op1.OperatorName,
       STRING_AGG(op2.OperatorName, '->')WITHIN GROUP(GRAPH PATH) AS Friends,
       LAST_VALUE(op2.OperatorName)WITHIN GROUP(GRAPH PATH) AS LastNode,
       COUNT(op2.OperatorName)WITHIN GROUP(GRAPH PATH) AS levels
FROM grph.RadioOperator AS op1,
     grph.Calls FOR PATH AS C,
     grph.RadioOperator FOR PATH AS op2,
        grph.Uses AS u,
        grph.Radio AS r
WHERE MATCH(SHORTEST_PATH(op1(-(C)->op2)+) AND LAST_NODE(op2)-(u)->r)
AND r.RadioName = 'Xiegu G90';
Listing 20-13

Adding additional relationships to a path

The LAST_NODE knows that it’s going to be using values from the defined node, in this case, op2, to match to another path, through our grph.Uses and grph.Radio edge and node. Note that neither is defined as FOR PATH since they’re used in a normal graph set.

There is more that can be done with querying graph data, but from these examples, you get the understanding. Now, let’s examine the performance implications of graph data.

Performance Considerations of Graph Data

As you’ve already seen, for a simple comparison between traditional many-to-many relational structures and graph structures, the performance comparison between Listing 20-7 and Listing 20-9 is stark. For this small data set, we saw a five times improvement in speed. However, more can be attained. Let’s take a look at the execution plan from Figure 20-7.

A diagram depicts a simple execution plan from a graph query. It includes two nested loops and three table scans for calls, radio operator called, and radio operator calling.

Figure 20-7

A fairly normal execution plan from a graph query

As you can see, this is a relatively standard execution plan with few surprises. Indexes are created on the graph tables as you create the tables, a unique, nonclustered index acting as a primary key. However, you can add your own indexes. In fact, Microsoft recommends that you define an index on the $NodeID columns of node tables as you define them. You should also add an index to edge tables on the $from_id and $to_id columns. In Listing 20-14, I do just that, putting a clustered index in place.
CREATE UNIQUE CLUSTERED INDEX CallsToFrom ON grph.Calls ($from_id, $to_id);
Listing 20-14

Creating an index on an edge table

With that index in place, let’s take a look at the execution plan now (Figure 20-8).

A diagram depicts an execution plan from a graph query. It includes two nested loops, two table scans for radio operator called, and radio operator calling, and a clustered index seek for calls.

Figure 20-8

Table scan has been replaced

The calls table is no longer being scanned. Instead, we have a Clustered Index Seek. With this tiny data set, performance only improved marginally, from 219mcs to 200mcs. Reads went from 39 to 38. However, with a larger data set, those improvements will be much higher.

Obviously, you can keep going from there. Graph tables support clustered and nonclustered indexes, both rowstore and columnstore. You can use INCLUDE columns and more. In short, they act similarly to relational tables.

Graph queries will suffer from the same problems as other queries, so putting functions on columns in the WHERE clause and other bad practices will hurt performance here in the same way.

Summary

This chapter introduced the concept of a graph database within SQL Server consisting of node and edge tables. The most important thing to focus on when working with graph tables is the type of queries you intend to run. If they are many to many, or hierarchical, the graph data type will absolutely improve performance. Get in the habit of creating indexes on the $node_id columns when you create your tables. Also, make sure you add an index to the $from_id and $to_id columns. These can improve performance. Lastly, write your queries with the same thoughts in mind as with relational data avoiding bad code smells.

The next chapter deals with all the mechanisms that Microsoft has introduced to dynamically enhance query performance on the fly: Intelligent Query Processing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.51.157