Chapter 27

Graphs in SQL

Abstract

Graphs are important because they are a general way to represent many different types of data and their relationships. SQL was not meant to be a graph language (there are graph databases today), but you can model graphs in SQL.

Keywords

Graph theory

Indegree

Outdegree

Adjacency list

John Gilson

Edge

Node

Tree

Directed graph

Acyclic graph

Nonreconvergent

Reconvergent

Reachable nodes

Source

Sink

Isolated nodes

Internal nodes

Path

Adjacency matrix

Taxi cab geometry

Equivalence class

Cliques

The terminology in graph theory pretty much explains itself; if it does not, you can read some of the books suggested in the appendix for graph theory. Graphs are important because they are a general way to represent many different types of data and their relationships. Here is a quick review of terms.

A graph is a data structure made up of nodes connected by edges. Edges can be directed (permit travel in only one direction) or undirected (permit travel in both directions). The number of edges entering a node is its indegree; likewise, the number of edges leaving a node is its outdegree. A set of edges that allow you to travel from one node to another is called a path. A cycle is a path that comes back to the node from which it started without crossing itself (this means that a big ‘O’ is fine but a figure ‘8’ is not).

A tree is a type of directed graph that is enough important to have its own terminology. Its special properties and frequent use have made it enough important to be covered in a separate chapter. The following section will stress other useful kinds of generalized directed graphs. Generalized directed graphs are classified into nonreconvergent and reconvergent graphs. In a reconvergent graph, there are multiple paths between at least one pair of nodes. Reconvergent graphs are either cyclic or acyclic.

The most common way to model a graph in SQL is with an adjacency list model. Each edge of the graph is shown as a pair of nodes in which the ordering matters and then any values associated with that edge is shown in another column.

27.1 Basic Graph Characteristics

The following code is from John Gilson. It uses an adjacency list model of the graph, with nodes in a separate table. This is the most common method for modeling graphs in SQL.

CREATE TABLE Nodes
(node_id INTEGER NOT NULL PRIMARY KEY);
CREATE TABLE Adjacency_List_Graph
(begin_node_id INTEGER NOT NULL REFERENCES Nodes (node_id),
 end_node_id INTEGER NOT NULL REFERENCES Nodes (node_id),
 PRIMARY KEY (begin_node_id, end_node_id),
 CHECK (begin_node_id <> end_node_id));

It is also possible to load an acyclic-directed graph into a nested set model by splitting the nodes.

CREATE TABLE Nested_Sets_Graph
(node_id INTEGER NOT NULL REFERENCES Nodes (node_id),
 lft INTEGER NOT NULL CHECK (lft >= 1) PRIMARY KEY,
 rgt INTEGER NOT NULL UNIQUE,
 CHECK (rgt > lft),
 UNIQUE (node_id, lft));

You split nodes by starting at the sink nodes and move up the tree. When you come to a node of (indegree > 1), replace it with that many copies of the node under each of its superiors. Continue to do this until you get to the root. The acyclic graph will become a tree, but with duplicated node values. There are advantages to this model. We will discuss them in Section 27.3.

27.1.1 All Nodes in the Graph

CREATE VIEW Graph_Nodes (node_id)
AS
SELECT DISTINCT node_id FROM Nested_Sets_Graph;

27.1.2 Path Endpoints

A path through a graph is a traversal of consecutive nodes along a sequence of edges. Clearly, the node at the end of one edge in the sequence must also be the node at the beginning of the next edge in the sequence. The length of the path is the number of edges that are traversed along the path.

Path endpoints are the first and last nodes of each path in the graph. For a path of length zero, the path endpoints are the same node. If there is more than one path between two nodes, then each path will be distinguished by its own distinct set of number pairs for the nested-set representation.

If there is only one path p between two nodes but this path is a sub-path of more than one distinct path, then the endpoints of p will have number pairs for each of these greater paths. As a canonical form, the least numbered pairs are returned for these endpoints.

CREATE VIEW Path_End_Points
(begin_node_id, end_node_id,
 begin_lft, begin_rgt,
 end_lft, end_rgt)
AS
SELECT G1.node_id, G2.node_id,
G1.lft, G1.rgt, G2.lft, G2.rgt
 FROM (SELECT node_id, MIN(lft), MIN(rgt)
FROM Nested_Sets_Graph
 GROUP BY node_id) AS G1 (node_id, lft, rgt)
INNER JOIN
Nested_Sets_Graph AS G2
ON G2.lft >= G1.lft
AND G2.lft < G1.rgt;

27.1.3 Reachable Nodes

If a node is reachable from another node, then a path exists from one node to the other. It is assumed that every node is reachable from itself.

CREATE VIEW Reachable_Nodes (begin_node_id, end_node_id)
AS
SELECT DISTINCT begin_node_id, end_node_id
 FROM Path_End_Points;

27.1.4 Edges

Edges are pairs of adjacent connected nodes in the graph. If edge E is represented by the pair of nodes (n0, n1), then (n1) is reachable from (n0) in a single traversal.

CREATE VIEW Edges (begin_node_id, end_node_id)
AS
SELECT begin_node_id, end_node_id
 FROM Path_Endpoints AS PE
 WHERE begin_node_id <> end_node_id
AND NOT EXISTS
(SELECT *
FROM Nested_Sets_Graph AS G
 WHERE G.lft > PE.begin_lft
AND G.lft < PE.end_lft
AND G.rgt > PE.end_rgt);

27.1.5 Indegree and Outdegree

The indegree of a node n is the number of distinct edges ending at n. Nodes that have 0 indegree are not returned. Indegree of all nodes in the graph:

CREATE VIEW Indegree (node_id, node_indegree)
AS
SELECT N.node_id, COUNT(E.begin_node_id)
 FROM Graph_Nodes AS N
LEFT OUTER JOIN
Edges AS E
ON N.node_id = E.end_node_id
 GROUP BY N.node_id;

Outdegree of a node (n) is the number of distinct edges beginning at (n). Nodes that have zero outdegree are not returned. Outdegree of all nodes in the graph:

CREATE VIEW Outdegree (node_id, node_outdegree)
AS
SELECT N.node_id, COUNT(E.end_node_id)
 FROM Graph_Nodes AS N
LEFT OUTER JOIN
Edges AS E
ON N.node_id = E.begin_node_id
 GROUP BY N.node_id;

27.1.6 Source, Sink, Isolated, and Internal Nodes

A source node of a graph has a positive outdegree but 0 indegree, that is, it has edges leading from, but not to, the node. This assumes there are no isolated nodes (nodes belonging to no edges).

CREATE VIEW Source_Nodes (node_id, lft, rgt)
AS
SELECT node_id, lft, rgt
 FROM Nested_Sets_Graph AS G1
 WHERE NOT EXISTS
(SELECT *
FROM Nested_Sets_Graph AS G
 WHERE G1.lft > G2.lft
AND G1.lft < G2.rgt);

Likewise, a sink node of a graph has positive indegree but 0 outdegree. It has edges leading to, but not from, the node. This assumes there are no isolated nodes.

CREATE VIEW Sink_Nodes (node_id)
AS
SELECT node_id
 FROM Nested_Sets_Graph AS G1
 WHERE lft = rgt - 1
AND NOT EXISTS
 (SELECT *
 FROM Nested_Sets_Graph AS G2
WHERE G1.node_id = G2.node_id
AND G2.lft < G1.lft);

An isolated node belongs to no edges, i.e., it has zero indegree and zero outdegree.

CREATE VIEW Isolated_Nodes (node_id, lft, rgt)
AS
SELECT node_id, lft, rgt
 FROM Nested_Sets_Graph AS G1
 WHERE lft = rgt - 1
AND NOT EXISTS
(SELECT *
FROM Nested_Sets_Graph AS G2
 WHERE G1.lft > G2.lft
 AND G1.lft < G2.rgt);

An internal node of a graph has an (indegree > 0) and an (outdegree > fPa0), that is, it acts as both a source and a sink.

CREATE VIEW Internal_Nodes (node_id)
AS
SELECT node_id
 FROM (SELECT node_id, MIN(lft) AS lft, MIN(rgt) AS rgt
FROM Nested_Sets_Graph
 WHERE lft < rgt - 1
GROUP BY node_id) AS G1
 WHERE EXISTS
(SELECT *
 FROM Nested_Sets_Graph AS G2
WHERE G1.lft > G2.lft
AND G1.lft < G2.rgt)

27.2 Paths in a Graph

Finding a path in a graph is the most important operation done with graphs in commercial use. They mode transportation networks, electrical and cable systems, process control flow, and thousands of other things.

A path P of length L from a node (n0) to a node (nk) in the graph is defined as a traversal of (L + 1) contiguous node along a sequence of edges where the first node is node number 0 and the last is node number (k).

CREATE VIEW Paths
(begin_node_id, end_node_id, this_node_id,
 seq_nbr,
 begin_lft, begin_rgt, end_lft, end_rgt,
 this_lft, this_rgt)
AS
SELECT PE.begin_node_id, PE.end_node_id, G1.node_id,
(SELECT COUNT(*)
FROM Nested_Sets_Graph AS G2
 WHERE G2.lft > PE.begin_lft
 AND G2.lft <= G1.lft
 AND G2.rgt >= G1.rgt),
 PE.begin_lft, PE.begin_rgt,
 PE.end_lft, PE.end_rgt,
 G1.lft, G1.rgt
FROM Path_End_Points AS PE
 INNER JOIN
 Nested_Sets_Graph AS G1
 ON G1.lft BETWEEN PE.begin_lft
AND PE.end_lft
 AND G1.rgt >= PE.end_rgt

27.2.1 Length of Paths

The length of a path is the number of edges that are traversed along the path. A path of N nodes has a length of (N − 1).

CREATE VIEW Path_Lengths
(begin_node_id, end_node_id,
path_length,
begin_lft, begin_rgt,
end_lft, end_rgt)
AS
SELECT begin_node_id, end_node_id, MAX(seq_nbr),
 begin_lft, begin_rgt, end_lft, end_rgt
FROM Paths
GROUP BY begin_lft, end_lft, begin_rgt, end_rgt,
 begin_node_id, end_node_id;

27.2.2 Shortest Path

This gives the shortest path length between all nodes, but it does not tell you what the actual path is. There are other queries that use the new CTE feature and recursion which we will discuss in Section 27.3.

CREATE VIEW Shortest_Path_Lengths
(begin_node_id, end_node_id, path_length,
begin_lft, begin_rgt, end_lft, end_rgt)
AS
SELECT PL.begin_node_id, PL.end_node_id,
 PL.path_length,
 PL.begin_lft, PL.begin_rgt,
 PL.end_lft, PL.end_rgt
FROM (SELECT begin_node_id, end_node_id,
 MIN(path_length) AS path_length
FROM PathLengths
 GROUP BY begin_node_id, end_node_id) AS MPL
 INNER JOIN
 PathLengths AS PL
 ON MPL.begin_node_id = PL.begin_node_id
AND MPL.end_node_id = PL.end_node_id
AND MPL.path_length = PL.path_length;

27.2.3 Paths by Iteration

First let’s build a graph with a cost associated with each edge and put it into an adjacency list model.

INSERT INTO Edges (out_node, in_node, cost)
VALUES ('A', 'B', 50),
('A', 'C', 30),
('A', 'D', 100),
('A', 'E', 10),
('C', 'B', 5),
('D', 'B', 20),
('D', 'C', 50),
('E', 'D', 10);

To find the shortest paths from one node to those that it can reach, we can write this recursive VIEW.

CREATE VIEW Shortest_Paths (out_node, in_node, path_length)
AS
WITH RECURSIVE Paths (out_node, in_node, path_length)
AS
(SELECT out_node, in_node, 1
 FROM Edges
UNION ALL
SELECT E1.out_node, P1.in_node, P1.path_length + 1
 FROM Edges AS E1, Paths AS P1
WHERE E1.in_node = P1.out_node)
SELECT out_node, in_node, MIN(path_length)
FROM Paths
GROUP BY out_node, in_node;
out_nodein_nodepath_length
‘A’‘B’1
‘A’‘C’1
‘A’‘D’1
‘A’‘E’1
‘C’‘B’1
‘D’‘B’1
‘D’‘C’1
‘E’‘B’2
‘E’‘D’1

To find the shortest paths without recursion, stay in a loop and add one edge at a time to the set of paths defined so far.

CREATE PROCEDURE Iterate_Paths()
LANGUAGE SQL
MODIFIES SQL DATA
BEGIN
DECLARE old_path_tally INTEGER;
SET old_path_tally = 0;
DELETE FROM Paths; -- clean out working table
INSERT INTO Paths
SELECT out_node, in_node, 1
FROM Edges; -- load the edges
-- add one edge to each path
WHILE old_path_tally < (SELECT COUNT(*) FROM Paths)
DO SET old_path_tally = (SELECT COUNT(*) FROM Paths);
 INSERT INTO Paths (out_node, in_node, lgth)
 SELECT E1.out_node, P1.in_node, (1 + P1.lgth)
 FROM Edges AS E1, Paths AS P1
WHERE E1.in_node = P1.out_node
AND NOT EXISTS -- path is not here already
(SELECT *
 FROM Paths AS P2
WHERE E1.out_node = P2.out_node
AND P1.in_node = P2.in_node);
END WHILE;
END;

The Least Cost Path is basically the same algorithm, but instead of a constant of one for the path length, we use the actual costs of the edges.

CREATE PROCEDURE Iterate_Cheap_Paths ()
LANGUAGE SQL
MODIFIES SQL DATA
BEGIN
DECLARE old_path_cost INTEGER;
SET old_path_cost = 0;
DELETE FROM Paths; -- clean out working table
INSERT INTO Paths
SELECT out_node, in_node, cost
FROM Edges; -- load the edges
-- add one edge to each path
WHILE old_path_cost < (SELECT COUNT(*) FROM Paths)
DO SET old_path_cost = (SELECT COUNT(*) FROM Paths);
 INSERT INTO Paths (out_node, in_node, cost)
 SELECT E1.out_node, P1.in_node, (E1.cost + P1.cost)
 FROM Edges AS E1
INNER JOIN
(SELECT out_node, in_node, MIN(cost)
FROM Paths
 GROUP BY out_node, in_node)
AS P1 (out_node, in_node, cost)
ON E1.in_node = P1.out_node
 AND NOT EXISTS
 (SELECT *
FROM Paths AS P2
 WHERE E1.out_node = P2.out_node
 AND P1.in_node = P2.in_node
 AND P2.cost <= E1.cost + P1.cost);
END WHILE;
END;

27.2.4 Listing the Paths

I got data for this table from the book Introduction to Algorithms by Cormen, Leiserson, and Rivest (ISBN 0-262-03141-8), page 518. This book was very popular in college courses in the United States. I made one decision that will be important later; I added self-traversal edges (i.e., the node is both the out_node and the in_node of an edge) with weights of zero.

INSERT INTO Edges
VALUES ('s', 's', 0);
INSERT INTO Edges
VALUES ('s', 'u', 3),
('s', 'x', 5),
('u', 'u', 0),
('u', 'v', 6),
('u', 'x', 2),
('v', 'v', 0),
('v', 'y', 2),
('x', 'u', 1),
('x', 'v', 4),
('x', 'x', 0),
('x', 'y', 6),
('y', 's', 3),
('y', 'v', 7),
('y', 'y', 0);

I am not happy about this approach, because I have to decide the maximum number of edges in path before I start looking for an answer. But this will work and I know that a path will have no more than the total number of nodes in the graph. Let’s create a table to hold the paths:

CREATE TABLE Paths
(step1 CHAR(2) NOT NULL,
step2 CHAR(2) NOT NULL,
step3 CHAR(2) NOT NULL,
step4 CHAR(2) NOT NULL,
step5 CHAR(2) NOT NULL,
total_cost INTEGER NOT NULL,
path_length INTEGER NOT NULL,
PRIMARY KEY (step1, step2, step3, step4, step5));

The “step1” node is where I begin the path. The other columns are the second step, third step, fourth step, and so forth. The last step column is the end of the journey. The “total_cost” column is the total cost, based on the sum of the weights of the edges, on this path. The path length column is harder to explain, but for now, let’s just say that it is a count of the nodes visited in the path.

To keep things easier, let’s look at all the paths from ‘s’ to ‘y’ in the graph. The INSERT INTO statement for constructing that set looks likes this:

INSERT INTO Paths
SELECT G1.out_node, -- it is 's' in this example
 G2.out_node,
 G3.out_node, G4.out_node,
 G4.in_node, -- it is 'y' in this example
 (G1.cost + G2.cost + G3.cost + G4.cost),
 (CASE WHEN G1.out_node NOT IN (G2.out_node,  G3.out_node, G4.out_node) THEN 1 ELSE 0 END
+ CASE WHEN G2.out_node NOT IN (G1.out_node,  G3.out_node, G4.out_node) THEN 1 ELSE 0 END
+ CASE WHEN G3.out_node NOT IN (G1.out_node,  G2.out_node, G4.out_node) THEN 1 ELSE 0 END 
+ CASE WHEN G4.out_node NOT IN (G1.out_node,  G2.out_node, G3.out_node) THEN 1 ELSE 0 END)
FROM Edges AS G1,
 Edges AS G2, 
 Edges AS G3, 
 Edges AS G4 
WHERE G1.out_node = 's'
 AND G1.in_node = G2.out_node
 AND G2.in_node = G3.out_node
 AND G3.in_node = G4.out_node
 AND G4.in_node = 'y';

I put in ‘s’ and ‘y’ as the out_node and in_node of the path and made sure that the in_node of each step in the path was the out_node of the next step in the path. This is a combinatorial explosion, but it is easy to read and understand.

The sum of the weights is the cost of the path, which is easy to understand. The path_length calculation is a bit harder. This sum of CASE expressions looks at each node in the path. If it is unique within the row, it is assigned a value of one, if it is not unique within the row, it is assigned a value of zero.

All paths will have five steps in them because the way to table is declared. But what if a path exists between the two nodes which is shorter than five steps? That is where the self-traversal rows are used. Consecutive pairs of steps in the same row can be repetitions of the same node.

Here is what the rows of the Paths table look like after this INSERT INTO statement, ordered by descending path_length, and then by ascending cost.

Paths

step_1step_2step_3step_4step_5total_costpath_length
ssxxy110
sssxy111
sxxxy111
sxuxy142
ssuvy112
ssuxy112
ssxvy112
ssxyy112
suuvy112
suuxy112
suvvy112
suxxy112
sxvvy112
sxxvy112
sxxyy112
sxyyy112
sxyvy204
sxuvy144
suvyy114
suxvy114
suxyy114
sxvyy114

t0015

Clearly, all pairs of nodes could be picked from the original Edges table and the same INSERT INTO run on them with a minor change in the WHERE clause. However, this example is big enough for a short magazine article. And it is too big for most applications. It is safe to assume that people really want the cheapest path. In this example, the total_cost column defines the cost of a path, so we can eliminate some of the paths from the Paths table with this statement.

DELETE FROM Paths
WHERE total_cost
 > (SELECT MIN(total_cost)
FROM Paths);

Again, if you had all the paths for all possible pairs of nodes, the subquery expression would have a WHERE clause to correlate it to the subset of paths for each possible pair.

In this example, it got rid of 3 out of 22 possible paths. It is helpful and in some situations we might like having all the options. But these are not distinct options.

As one of many examples, the paths

(s, x, v, v, y, 11, 2)

and

(s, x, x, v, y, 11, 2)

are both really the same path, (s, x, v, y). Before we decide to write a statement to handle these equivalent rows, let’s consider another cost factor. People do not like to change airplanes or trains. If they can go from Amsterdam to New York City on one plane without changing planes for the same cost, they are happy. This is where that path_length column comes in. It is a quick way to remove the paths that have more edges than they need to get the job done.

DELETE FROM Paths
WHERE path_length
 > (SELECT MIN(path_length)
FROM Paths);

In this case, that last DELETE FROM statement will reduce the table to one row: (s, s, x, x, y, 11, 0) which reduces to (s, x, y). This single remaining row is very convenient for my article, but if you look at the table, you will see that there was also a subset of equivalent rows that had higher path_length numbers.

(s, s, s, x, y, 11, 1)

(s, x, x, x, y, 11, 1)

(s, x, x, y, y, 11, 2)

(s, x, y, y, y, 11, 2)

Your task is to write code to handle equivalent rows. Hint: the duplicate nodes will always be contiguous across the row.

27.3 Acyclic Graphs as Nested Sets

Let’s start with a simple graph in an adjacency list model.

INSERT INTO Nodes (node_id)
VALUES ('a'), ('b'), ('c'), ('d'),
 ('e'), ('f'), ('g'), ('h'),
INSERT INTO Adjacency_List_Graph (begin_node_id, end_node_id)
VALUES ('a', 'b'), ('a', 'c'), ('b', 'd'), ('c', 'd'),
 ('c', 'g'), ('d', 'e'), ('d', 'f'), ('e', 'h'),
 ('g', 'h'),

We can convert this adjacency list model to the nested sets model with a simple stack algorithm.

-- Stack to keep track of nodes being traversed in depth-first fashion
CREATE TABLE NodeStack
(node_id INTEGER NOT NULL PRIMARY KEY
REFERENCES Nodes (node_id),
distance INTEGER NOT NULL CHECK (distance >= 0),
lft INTEGER CHECK (lft >= 1),
rgt INTEGER,
CHECK (rgt > lft));
CREATE PROCEDURE Adjacency_Lists_To_Nested_Sets_Graph ()
LANGUAGE SQL
READS SQL DATA
BEGIN
DECLARE path_length INTEGER;
DECLARE current_number INTEGER;
SET path_length = 0;
SET current_number = 0;
-- Clear the table that will hold the result
DELETE FROM Nested_Sets_Graph;
-- Initialize stack by inserting all source nodes of graph
INSERT INTO NodeStack (node_id, distance)
SELECT DISTINCT G1.begin_node_id, path_length
FROM Adjacency_List_Graph AS G1
WHERE NOT EXISTS
 (SELECT *
FROM Adjacency_List_Graph AS G2
 WHERE G2.end_node_id = G1.begin_node_id);
WHILE EXISTS (SELECT * FROM NodeStack)
DO
SET current_number = current_number + 1;
IF EXISTS (SELECT * FROM NodeStack WHERE distance = path_length)
THEN UPDATE NodeStack
SET lft = current_number
WHERE distance = path_length 
AND NOT EXISTS 
(SELECT *
 FROM NodeStack AS S2
WHERE distance = path_length
AND S2.node_id < NodeStack.node_id);
 INSERT INTO NodeStack (node_id, distance)
 SELECT G.end_node_id, (S.distance + 1)
 FROM NodeStack AS S,
Adjacency_List_Graph AS G
WHERE S.distance = path_length
AND S.lft IS NOT NULL
AND G.begin_node_id = S.node_id;
 SET path_length = (path_length + 1);
ELSE SET path_length = (path_length - 1);
 UPDATE NodeStack
SET rgt = current_number
WHERE lft IS NOT NULL
AND distance = path_length;
 INSERT INTO Nested_Sets_Graph (node_id, lft, rgt)
 SELECT node_id, lft, rgt
 FROM NodeStack
WHERE lft IS NOT NULL
AND distance = path_length;
DELETE FROM NodeStack
 WHERE lft IS NOT NULL 
 AND distance = path_length;
END IF;
END WHILE;
END;

You can now use modified versions of the nested set queries you already know on this kind of graph. However, be sure to use DISTINCT options to remove duplicate node references.

27.4 Adjacency Matrix Model

An adjacency matrix is a square array whose rows are out-node and columns are in-nodes of a graph. A one in a cell means that there is edge between the two nodes. Using the graph in figure 30.1, we would have an array like this:

ABCDEFGH
A11100000
B01010000
C00110010
D00011100
E00001001
F00000100
G00000011
H00000001

t0020

Many graph algorithms are based on the adjacency matrix model and can be translated into SQL. Go back to the chapter on modeling matrices in SQL and in particular matrix multiplication in SQL. For example, Dijkstra’s algorithm for the shortest distances between each pair of nodes in a graph looks like this in pseudo-code.

FOR k = 1 TO n
DO FOR i = 1 TO n
 DO FOR j = 1 TO n
IF a[i,k] + a[k,j] < a[i,j]
THEN a[i,j] = a[i,k] + a[k,j]
END IF;
 END FOR;
END FOR;
END FOR;

You need to be warned that for a graph of (n) nodes, the table will be of size (n^2). The algorithms often run in (n^3) time. The advantage it has is that once you have completed a table it can be used for look ups rather than recomputing distances over and over.

27.5 Points Inside Polygons

While not actually part of graph theory, this seemed to be the reasonable place to put this section since it is also related to spatial queries. A polygon can be described as a set of corner point in an (x, y) coordinate system. The usual query is to tell if a given point is inside or outside of the polygon.

This algorithm is due to Darel R. Finley. The main advantage it has is that it can be done in Standard SQL without trigonometry functions. The disadvantage is that it does not work for concave polygons. The work-around is to dissect the convex polygons into concave polygons and then add column for the name of the original area.

-- set up polygon, with any ordering of the corners

CREATE TABLE Polygon
(x FLOAT NOT NULL,
y FLOAT NOT NULL,
PRIMARY KEY (x, y));
INSERT INTO Polygon
VALUES (2.00, 2.00), (1.00, 4.00),
 (3.00, 6.00), (6.00, 4.00), (5.00, 2.00);
--set up some sample points
CREATE TABLE Points
(xx FLOAT NOT NULL,
yy FLOAT NOT NULL,
location VARCHAR(10) NOT NULL, -- answer the question in advance!
PRIMARY KEY (xx, yy));
INSERT INTO Points
VALUES (2.00, 2.00, 'corner'),
 (1.00, 5.00, 'outside'),
 (3.00, 3.00, 'inside'),
 (3.00, 4.00, 'inside'),
 (5.00, 1.00, 'outside'),
 (3.00, 2.00, 'side'),
-- do the query
SELECT P1.xx, P1.yy, p1.location, SIGN(
SUM
(CASE WHEN (polyY.y < P1.yy AND polyY.x >= P1.yy
 OR polyY.x < P1.yy AND polyY.y >= P1.yy)
 THEN CASE WHEN polyX.y + (P1.yy - polyY.y)
 /(polyY.x - polyY.y) * (polyX.x - polyX.y) < P1.xx
 THEN 1 ELSE 0 END
 ELSE 0 END))AS flag
FROM Polygon AS polyY, Polygon AS polyX, Points AS P1
GROUP BY P1.xx, P1.yy, p1.location;

When flag = 1, the point is inside; when flag = 0, it is outside.

xxyyLocationFlag
15Outside0
22Corner0
33Inside1
34Inside1
51Outside0
32Side1

t0025

Sides are counted as inside, but if you want to count the corner points as inside, you should start the CASE expression with:

CASE WHEN EXISTS
 (SELECT * FROM Polygon
 WHERE x = P1.xx AND y = P1.yy)
 THEN 1 ..".

27.6 Taxicab Geometry

Taxicab geometry is a form of geometry in which the usual plane is replaced with a grid of points in a Cartesian coordinate system, connected by horizontal and vertical paths. Think of a city map, with streets and intersections. The usual distance function or metric of Euclidean geometry is replaced by a new metric, the taxi distance. This is a count of the number of blocks you have to travel (by taxi) to get from one point to another.

Imagine that you are in a taxi and want to get from point (x1, y1) to (x2, y2), how would you compute the distance? You would use high school math and the Euclidean distance: d = √((x1 − x2)2 + (y1 − y2)2). To make this concrete, let’s take a taxi from the center of town (0,0) and want to get to the Chez Vermin Hotel at (3,4), how far do you have to travel? (Figure 27.1)

f27-01-9780128007617
Figure 27.1 Taxicab Distance.

Your Euclidean formula gives us five blocks, but look at the map. You go three block east and then four blocks north for a total of seven blocks in the taxi. He cannot fly, so he has to drive through the streets. The shortest distance is seven blocks in Taxicab geometry. But that means there are many ways to walk between two points. I could walk three blocks east and then four blocks north; four blocks north and then three blocks east; there are all kinds of zig-zags, but as long as the total number of blocks north is three and the total number of blocks east is four (Figure 27.2).

f27-02-9780128007617
Figure 27.2 Taxicab Circle.

27.6.1 Taxi Versus Euclidean Distance

Let’s invent some notations that we can turn into programs. Use uppercase letters for point (hotel) names:

A = (x1, y1), B = (x2, y2)

De (A, B) = Euclidean Distance = d = √((x1 − x2)2 + (y1 − y2)2)

Dt (A, B) = Taxicab Distance = (|x1 − x2| + |y1 − y2|).

You might want to convince yourself that

Dt (A, A) = 0

Dt (A, B) = Dt (B, A)

Unlike Euclidean geometry, there is more than one shortest path between two points. It is a combinatorial problem, with the formula N = r!/(n!(r − n)!). The idea is that if you need to travel (i) blocks North and (j) blocks East, then any permutation of (i) blocks North and (j) blocks East will get you to your destination.

27.6.2 Taxi Shapes

The definition of a circle in Taxicab geometry is that all points (hotels) in the set are the same distance from the center. Just like a Euclidean circle, but with a finite number of points.

Circle = {X: Dt (X, P) = k}, k is the radius and P is the center

When you look at diagram, you will see a diamond shape. Remember the definition of π? The ratio of the circumference to the radius of a circle. This gives us a value of 4 for taxi-π.

Triangles and other polygons are defined by replacing Euclidean Distance with Taxicab Distance. But the results are not as pretty as the circle. Congruent triangles are hard to find; you cannot use the simple side and angle rules. Rectangles and other polygons are not visually obvious.

But one of the interesting tools is Pick’s Theorem. Given a general polygon, you can find the area by Pick’s formula,

A = i + b/2 − 1, where A = area, i = the number of interior points, b = number of boundary points (Figure 27.3).

f27-03-9780128007617
Figure 27.3 Pick’s Theorem

The math is much simpler than a Euclidean model, which would involve trigonometry.

27.7 Equivalence Classes and Cliques

We keep harping that SQL is based on sets, but how many of us ever go back and re-read any old text books on set theory? In particular, one concept gets forgotten is equivalence relations on sets, which create equivalence classes. These classes are disjoint, and we can put an element from a set into one of them with some kind of rule.

Let’s use ~ as the notation for an equivalence relation. The definition is that it has three properties:

 The relation is reflexive: A ~ A. This means the relation applies to itself.

 The relation is symmetric: if (A ~ B) ⇔ (B ~ A). This means when the relation applies to two different elements in the set, it applies both ways.

 The relation is transitive: (A ~ B) ∧ (B ~ C) ⇒ (A ~ C). This means we can deduce all elements in a class by applying the relation to any known members of the class. This is pretty important for programming, as we will see.

So, if this is such a basic set operation, why isn’t it in SQL? The problem is that it produces classes, not a set. A class is a set of sets, but a table is just a set. The notation for each class is a pair of square bracket that contains some representative of each set. Assume we have …

a ∈ A

b ∈ B

These expressions are all the same:

A ~ B

[a] = [b]

[a] ∩ [b] ≠ ∅

A common example is the set Z of integers and any MOD() operator will give you equivalence classes. The MOD(n, 2) operation gives you one class consisting of all even numbers, and the other consisting of all odd numbers. The nice part is that you can compute the MOD() function with arithmetic.

An equivalence classes can also be defined by a set of sets. This is an important concept in graph theory and graph database. If you have not seen it, Google “Six Degrees of Kevin Bacon” (http://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon). The idea is that any individual involved in the Hollywood film industry can be linked through his or her film roles to Kevin Bacon within six steps.

Let’s steal a sample set from a SQL Forum posting of friends, where ~ now means “is a friend of” and that we can use “foaf”—“Friend of a friend”—to build cliches.

{Fred, Jessica, John, Mary}

{Albert, Nancy, Peter}

{Abby}

{Frank, Joe}

A complete graph means that each node is connected to every other node by one edge. If a subgraph is complete, it is actually called a Clique in graph theory.

The complete graph Kn of order n is a simple graph with n vertices in which every vertex is adjacent to every other. The complete graph on n vertices has n(n − 1)/2 edges, which corresponding to all possible choices of pairs of vertices. But this is not an equivalence relation because it does not include the reference of each node to itself. Remember A ~ A? Think about Abby in this set, assuming that she likes herself, in spite of her lack of social skills.

Obviously, if you have a clique with (k) nodes in it, you can a complete subgraph of any size (j < k). A maximal clique is a clique that is not a subset of any other clique (some authors reserve the term clique for maximal clique.

SQL seems to have two equivalence relations that are major parts of the language. The first is plain old vanilla equal (=) for scalar values in the base data types. SQL is just like any other programming language, but we also have NULLs to consider. Opps! We all know that (NULL = NULL) is not true. So we have to exclude NULLs or work around them.

The other “almost” equivalence relation is GROUP BY in which the class is the grouping columns. A quick example would be “SELECT city_name, state_code FROM Customers GROUP BY state_code”; this is still not quite right because we have to do some kind of aggregation on the nongrouping columns in the query.

27.7.1 Graph for Equivalence Relations

As we have seen, the idiom for graphs in SQL is to use a table with the nodes involved and a second table with pairs of nodes (a, b) to model the edges, which model our ~ relation. The bad news is that when you use it for equivalence relations, it can get big. A class with one member is has one row to show the edge back to itself. A class with two members {a, b} has {(a, a), (b, b), (a, b), (b, a)} to show the ~ properties as edges. Three members give us six rows: {(a, a), (b, b), (c, c), (a, b), (a, c), (b, a), (b, c), (c, a), (c, b)}. The general formula is (n * (n − 1)) + n, which is pretty close to (n!).

First, load the table a few facts we might already know:

CREATE TABLE Friends
(lft_member VARCHAR(20) NOT NULL,
rgt_member VARCHAR(20) NOT NULL,
PRIMARY KEY (lft_member, rgt_member))
INSERT INTO Friends (lft_member, rgt_member)
VALUES
('John', 'Mary'),
('Mary', 'Jessica'),
('Peter', 'Albert'),
('Peter', 'Nancy'),
('Abby', 'Abby'),
('Jessica', 'Fred'),
('Joe', 'Frank'),

27.7.2 Reflexive Rows

The first thing we noticed is that only Abby has a reflexive property row. We need to add those rows and can do this with basic set operators. Bu it is not that easy, as you can see with this insertion statement:

INSERT INTO Friends
SELECT X.lft_member, X.rgt_member
FROM
((SELECT lft_member, lft_member FROM Friends AS F1
UNION
SELECT rgt_member, rgt_member FROM Friends AS F2)
EXCEPT
SELECT lft_member, rgt_member FROM Friends AS F3)
AS X (lft_member, rgt_member);

The use of table aliases is tricky. You have to be sure that the SQL engine will construct a table in such a way that you do not get scoping problems. The UNION and EXCEPT operators are used to assure that we do not have primary key violations.

AbbyAbby
AlbertAlbert
FrankFrank
FredFred
JessicaJessica
JessicaFred
JoeJoe
JoeFrank
JohnMary
JohnJohn
MaryMary
MaryJessica
NancyNancy
PeterPeter
PeterNancy
PeterAlbert

27.7.3 Symmetric Rows

Look at the update table and there are no {(a, b), (b, a)} pairs. We need to add this second property to the relation table. This follows the pattern of set operators we just used:

INSERT INTO Friends
SELECT X.lft_member, X.rgt_member
FROM
(
(SELECT F1.rgt_member, F1.lft_member
FROM Friends AS F1
WHERE F1.lft_member <> F1.rgt_member)
EXCEPT
(SELECT F2.lft_member, F2.rgt_member FROM Friends AS F2)
) AS X (lft_member, rgt_member);

This will give us:

AbbyAbby
AlbertPeter
AlbertAlbert
FrankJoe
FrankFrank
FredJessica
FredFred
JessicaMary
JessicaJessica
JessicaFred
JoeJoe
JoeFrank
JohnMary
JohnJohn
MaryMary
MaryJohn
MaryJessica
NancyPeter
NancyNancy
PeterPeter
PeterNancy
PeterAlbert

If you do quick GROUP BY query, you see that Abby is seriously antisocial with a count of 1, but Mary and Jessica look very social with a count of 3. This is not quite true because the property could be the dreaded “Rock-paper-scissors-lizard-Spock” relationship. But this is what we are using.

27.7.4 Transitive Rows

Transitivity goes on forever. Well, until we have what mathematicians call a closure. This means we have gotten a set of elements that have all of the valid relationships. In some cases, these sets are infinite. But in database, we can only have insanely large tables. Let’s pull a subset that is part of a clique of four friends.

FredJessica
FredFred
JessicaMary
JessicaJessica
JessicaFred
JohnMary
JohnJohn
MaryMary
MaryJohn
MaryJessica

Now apply the transitive relation to get some of the missing edges of a graph:

INSERT INTO Friends
SELECT X.lft_member, X.rgt_member
FROM
((SELECT F1.lft_member, F2.rgt_member
FROM Friends AS F1, Friends AS F2
WHERE F1.rgt_member = F2.lft_member)
EXCEPT
(SELECT F3.lft_member, F3.rgt_member FROM Friends AS F3)
) AS X (lft_member, rgt_member);

This is a simple statement using the definition of transitivity, without a universal quantifier or loop on it. It will add these rows to this subset:

FredMary
JessicaJohn
JohnJessica
MaryFred

When you look at Mary and Jessica have all of their friends. However, Fred and John do not know that they are friends. So we invoke the statement again and get those two rows. If you try it for a third time, there are zero rows added.

The first thought is this sounds like a job for a recursive statement. But it is not that easy. If the original graph has a cycle in it, you can hang in an infinite loop if you try to use a recursive CTE. The assumption is that each clique has a spanning graph in the pairs in. Oops! New term: a spanning graph is a subgraph that includes the nodes and some or all of the edges of the graph. A complete graph is the spanning graph has all the possible edges, so each node is directly connected to any other node.

27.7.5 Cliques

Now change your mindset from graphs to sets. Let’s look at the size of the cliques:

WITH X (member, clique_size)
AS
(SELECT lft_member, COUNT(*) FROM Friends GROUP BY lft_member)
SELECT * FROM X ORDER BY clique_size;
Abby1
Frank2
Joe2
Nancy3
Peter3
Albert3
Fred4
Jessica4
John4
Mary4

t0050

What we want to do it is assign a number to each clique. This sample data is biased by the fact that the cliques are all of different sizes. You cannot simply use the size to assign a clique_nbr.

Create this table and load it with the names.

CREATE TABLE Cliques
(clique_member VARCHAR(20) NOT NULL PRIMARY KEY,
clique_nbr INTEGER);

We can start by assigning everyone their own clique, using whatever your favorite method is:

INSERT INTO Cliques
VALUES
('Abby', 1),
('Frank', 2),
('Joe', 3),
('Nancy', 4),
('Peter', 5),
('Albert', 6),
('Fred', 7),
('Jessica', 8),
('John', 9),
('Mary', 10);

Everyone is equally a member of their clique in this model. That means we could start with anyone to get the rest of their clique. The update is very straight forward. Take any clique number, find the member to whom it belongs and use that name to build that clique. But which number to we use? We could use the MIN, MAX, or a random number in the clique; I will use the MAX for no particular reason.

I keep thinking that there is a recursive update statement that will do this in one statement. But I know it will not port (Standard SQL has no recursive update statement right now. We would do it with SQL/PSM or a host language) and I think it would be expensive. Recursion will happen for each member of the whole set, but if we consolidate a clique for one person, we have removed his friends from consideration.

The worst situation would be a bunch of hermits, so the number of clique would be the cardinality of set. That is not likely or useful in the real world. Let’s put the update in the body of a loop.

BEGIN
DECLARE clique_loop INTEGER;
DECLARE node CHAR(10);
SET clique_loop = (SELECT MAX(clique_nbr) FROM Cliques);
WHILE clique_loop > 0
DO
SET node
= (SELECT clique_member
FROM Cliques
 WHERE clique_nbr = clique_loop);
UPDATE Cliques
 SET clique_nbr
 = (SELECT clique_nbr
FROM Cliques
 WHERE clique_member = node)
WHERE clique_member
IN (SELECT rgt_member
FROM Friends
 WHERE lft_member = mode);
SET clique_loop = clique_loop - 1;
END WHILE;
END;

As a programming assignment, replace the simple counting loop with one that does not do updates with clique_loop values that were removed in the prior iteration. This can save a lot of work in a larger social network than the sample data used here.

27.7.6 Adding Members

Now that we have a Cliques table, we can do a simple insertion if the new guy belongs to one clique. However, we can have someone who knows a people in different cliques. This will merge the two cliques into one.

CREATE PROCEDURE Clique_Merge
(IN clique_member_1 CHAR(10),
IN clique_member_2 CHAR(10))
UPDATE Cliques
SET clique_nbr
 =(SELECT MAX(clique_nbr)
 FROM Cliques
WHERE clique_member
IN (clique_member_1, clique_member_2))
WHERE clique_nbr
 =(SELECT MIN (clique_nbr)
 FROM Cliques
WHERE clique_member
IN (clique_member_1, clique_member_2));

Deleting a member or moving him to another clique is trivial.

27.8 Conclusion

This chapter is full of handy tricks to use with SQL. But graphs are better handled with a graph database than with SQL. This is no surprise; that is what they were meant to do! The graph database can also handle relationships that do not have just the simple mathematical properties needed for a single relationship. Over Publications is a great source for books on math and they have a good offering in Graph Theory.

 “A First Course in Graph Theory” by Gary Chartrand, Ping Zhang (ISBN: 978-0486483689)

 “Graph Theory” by Ronald Gould (ISBN: 978-0486498065)

 “Introduction to Graph Theory” by Richard J. Trudeau (ISBN: 978-0486678702)

 “Introductory Graph Theory” by Gary Chartrand (ISBN: 978-0486247755)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.158.230