© Davide Mauri, Silvano Coriani, Anna Hoffman, Sanjay Mishra, Jovan Popovic 2021
D. Mauri et al.Practical Azure SQL Database for Modern Developershttps://doi.org/10.1007/978-1-4842-6370-9_8

8. Multi-model Capabilities

Davide Mauri1 , Silvano Coriani2, Anna Hoffman3, Sanjay Mishra1 and Jovan Popovic4
(1)
Redmond, WA, USA
(2)
Legnano, Milano, Italy
(3)
Sherman Oaks, CA, USA
(4)
Belgrade, Serbia
 

Azure SQL is not a traditional relational database platform. Every modern database must enable you to use various data formats for different scenarios. Although traditional normalized relational format is battle-tested and proven as optimal technology for a wide range of different scenarios, in some cases, you might find that some other formats might be the better fit for the problem that you are solving.

One common example is denormalization. Traditionally, you are representing domain objects as a set of tables in the so-called normal form described in the previous chapters. This is a well-known database design technique where every complex domain structure, such as list or array, is placed in a separate table. The normalized model is optimal for highly concurrent workloads where many threads simultaneously update and read different parts of the domain object without affecting the threads that are updating other parts. However, if you have a workload where one thread accesses a single object at a time, a highly normalized database might force you to join a lot of tables to reconstruct the domain object from its physical parts. In this case, you might consider serializing the object as a single unit instead of decomposing it.

Azure SQL enables you to use both relational and non-relational structures in the same physical data model.

The ability to work with relational, structured, or several types of semi-structured formats categorizes Azure SQL as a Multi-Model database. If you visit https://db-engines.com/en/ranking site that scores popularity of databases, you might notice that all highly ranked relational databases are also classified as multi-model databases, because supporting various data formats is a must-have requirement for all modern database systems.

Leveraging multi-model capabilities in database design

Before we deep dive into multi-model capabilities of Azure SQL, let us see when you should use them in your applications. Some software engineers like NoSQL concepts because they make a database easier to use and don’t require (apparently) a lot of design upfront, and as such, you can start to use it right away without putting too much effort, and it also doesn't require complex joins between tables. Others prefer relational model because it is battle-tested and guarantees that you may model any domain regardless of complexity. As a mature software engineer, you should not use dogmatic architectural decisions based on some arbitrary preference. The choice between relational and non-relational models or their combination should be based on the domain model of your application.

If you are using something like Domain-Driven Development (DDD) , you are probably identifying Aggregates in your domain. Aggregate objects in DDD are the core entities in your domain model that represent the main access point for data and reference other entities and values. As an example, Customer might be the primary entity that you will use in your sales management module. This entity has some value objects and the reference to the other entities like Order or Invoice. If you know that client code will access customer orders and invoices through Customer object, then Customer is an aggregate that binds other entities that belong to him and represents the main access point for any external component that references any of the entities.

Attention!

Do not confuse Aggregates in DDD with aggregate functions (SUM, AVG, MAX) in SQL terminology.

The dogmatic NoSQL approach would be to serialize the entire Customer aggregate and all related entities (Orders, Invoices) as one big collection that contains everything that you need in your application layer. This is a perfect choice for the applications that can fetch all necessary data using a single data access operation. Dogmatic relational approach would be to break every entity into a separate highly normalized table so different services in your application can access only a minimal set of information that they need. This is a perfect choice for highly concurrent applications where different threads at the same time access and modify different parts of entities or aggregates.

The drawback of dogmatic decisions is that in both cases people assume the access patterns instead of analyzing usage patterns in the business domain. The proper approach is to analyze data access patterns that will be used and choose the best fit for the domain. Here are some examples of data access patterns that might influence your data design decisions:
  • If you have big and complex forms where you need to preload both Order and Invoice entities after fetching Customer root aggregate and update information in Customer, Invoice, and Order entities once you save data on the form, it makes sense to store dependent entities as denormalized collections. This way you will read and persist all data with a single data access and avoid complex joins that gather data and transactions that span over multiple tables.

  • If you know that different forms and services in your application might simultaneously update Invoice or Order entities that belong to the same Customer aggregate or you are frequently using lazy loading techniques, it makes sense to break entities into separate tables. This way, you can implement granular services that write and read only the objects that are needed. Breaking the entities to separate tables increases concurrency of your system, because different transactions can touch different parts of the entities without blocking each other. Otherwise, they would be synchronized and potentially blocked on each other.

The application access patterns are the key factors that can guide your decision to use normalized or denormalized models.

Classifying domain models

Behavior analysis of your services and identification of data access patterns will enable you to identify what is the optimal storage design for your domain models. Depending on data access patterns in your application, you can classify your data models as
  • Highly normalized relational model where you have dependencies and relationships between the entities (like classes in UML diagrams). This model is a perfect choice for highly concurrent applications and services updating or loading different parts of entities.

  • Graph models are special kinds of models where nodes are interconnected with edges forming logical graph structure. This structure is a perfect choice for services that frequently break or establish bounds between the entities and traverse through the relations finding best paths or “friend-of-a-friend” type of analytics.

  • Non-relational data where information is self-contained into the isolated data entities with very weak or non-existing relationships between them. This structure is perfect for the services that read and save domain objects as a single unit.

Depending on the structure of your data entities, your models can be classified as
  • Structured where all data entities have uniform or fixed schema. These types of entities can be easily visualized as rows and columns in Excel tables or serialized in CSV format.

  • Semi-structured where data entities have some structure, but it is not always strict or uniform. Data entities have some common properties that are repeated across all entities, and some properties vary. Imagine a key-value collection with the different keys across entities or a hierarchically organized document with some missing values and sub-objects. These objects are typically serialized in JSON format.

  • Unstructured data where patterns highly vary between data entities, and it is hard to find the common structure. The typical examples can be textual documents, images, or videos – there is a well-defined format that enables you to read information, but a combination of information in images makes them very different.

Azure SQL is the best fit for relational structured and also a good fit for relational semi-structured data. The querying capabilities of SQL language enable you to apply the same processing rules over a large set of structured and semi-structured data and also easily traverse through foreign key and edge relationships.

Structured information without relationships might be more efficiently stored as CSV or Excel files, especially if you need to store a large amount of data on Azure Data Lake or Azure Blob Storage. It doesn’t mean that you are losing query capabilities because Azure SQL still enables you to easily load these files from external storage and query them as in-database rows. Semi-structured and self-contained documents that are not related to other entities might be more efficiently stored in specialized document databases such as Azure Cosmos DB. Azure Cosmos DB provides many functionalities specialized for querying and indexing self-contained documents that you might leverage if you don’t need to cross-relate entities or implement some complex reports.

Why would you choose Azure SQL for non-relational models?

Azure SQL is a multi-model database that enables you to combine different relational and non-relational models to find the best fit for your scenario.

Unlike the traditional relational and NoSQL databases where you need to upfront decide what physical model you want to use, Azure SQL enables you to combine these relational and non-relational concepts and find the model that is the best fit for your needs.

One of the key advantages of multi-model support in Azure SQL is the fact that data models are not mutually exclusive. Azure SQL enables you to seamlessly combine multiple models and leverage the best from all of them. You can create a classical relational model with some columns containing JSON or Spatial data, declare some of the tables as graph nodes and connect them using edges, place JSON columns in memory-optimized tables to leverage the speed and non-locking behavior. Multi-model capabilities can leverage all advanced language and storage features that Azure SQL provides. You can use the same T-SQL language to query both structured and semi-structured data, which enables a variety of applications and libraries to use any data format that you store in Azure SQL database.

The main reason why you would select the multi-model capabilities of Azure SQL is the fact that they are seamlessly integrated in the core battle-tested features of relational databases. The combination of JSON, Graph features with advanced querying capabilities, possibility to use all collations to process strings in JSON documents, Columnstore, and memory-optimized objects that can provide extreme performance, in-database machine learning with Python/R would provide you advanced data processing experience that you might not get even in the fully specialized NoSQL database.

Azure SQL enables you to represent your models using the following non-relational concepts:
  • JSON that enables you to integrate your databases with a broad range of web/mobile applications and log file formats or even to denormalize and simplify your relational schema. JSON functionality also simplifies a lot of the work needed to be done by a developer to communicate with Azure SQL. You may ask Azure SQL, in fact, to return the result as JSON documents instead of a table with columns and rows, if doing so can simplify your code.

  • Graph capabilities that enable you to represent your data model as a set of nodes and edges. This structure is an ideal choice in the domains where the domain entities are organized in network structure and where you can take advantage of a specialized query language to query graph data.

  • Spatial support that enables you to store geometrical and geographical information in databases, index them using specialized spatial indexes, and use advanced spatial queries to retrieve the data.

  • XML support that enables you to store XML documents in the tables, index XML information using specialized XML indexes, query XML data using T-SQL or XQuery languages, and transform your relational data to or from XML format.

In the following sections, you will learn about the core multi-model capabilities that exist in Azure SQL Database.

JSON support

JSON (JavaScript Object Notation) is a popular data format initially used data exchange format used to transfer data between web clients and browsers, but it is also used to store semi-structured information such as settings and log information. This is the mainstream format for representing self-contained objects especially in the modern NoSQL database such as Azure Cosmos DB, MongoDB, and so on.

Azure SQL enables you to parse JSON text and extract information from JSON documents, store JSON text in the tables like any other type, and produce JSON text based on a set of rows.

The core JSON functionalities are shown in Figure 8-1.
../images/493913_1_En_8_Chapter/493913_1_En_8_Fig1_HTML.jpg
Figure 8-1

JSON functionalities in Azure SQL

Azure SQL enables you to work with JSON text that you are storing in tables or sending from your application by applying built-in functions JSON_VALUE, JSON_QUERY, ISJSON, and JSON_MODIFY. These functions enable you to parse JSON from text and extract and modify information in JSON documents.

If you want to transform your semi-structured data represented in JSON format and load JSON documents into tables, you can use OPENJSON function . The OPENJSON function takes an array of objects in JSON format and splits them into a set of rows. You can use this Table-Valued Function to transform JSON into tables and load data in a relational format.

Azure SQL also enables you to format the results of the SQL queries into JSON format. There is a FOR JSON clause that can specify that the results of the query should be returned in JSON format instead of a set of rows.

These functionalities will enable you to implement most of the functionalities that you might want to use in your applications that require some handling of JSON data. In addition to built-in JSON functions and operators, you can combine other features of Azure SQL like indexes, collations, and advanced query capabilities on JSON data. Thanks to JSON support, it is very easy to integrate Azure SQL in Full-Stack or Back-End solutions, as – from a developer perspective – you will just send and receive JSON from Azure SQL. Here’s a list of Back-End API implemented in various languages, leveraging JSON and Azure SQL to simplify the development effort:

And here’s a Full-Stack example of the ToDoMVC sample app, implemented using NodeJS, Azure Functions, and Azure SQL, using JSON as transport format:

https://github.com/Azure-Samples/azure-sql-db-todo-mvc

Formatting query results as JSON document

Modern applications are commonly implemented as distributed services that are exchanging data via HTTP endpoints. In most of the cases, the data is exchanged in JSON format.

Back-end developers spend a lot of their time getting the data from the database and serializing the results as JSON text that will be returned to the caller (e.g., front-end code). You might find a large amount of back-end code in a REST API that is just a wrapper around SQL queries. Some model classes are used just to temporarily load a set of rows returned as a result of SQL queries into memory and then immediately serialize these memory objects as JSON text that will be returned to the caller. This is known as the DTO (Data Transfer Object) pattern where model classes just transfer data between the application layers. Many DTO-like model classes are not even used by back-end code and represent just a template that data access frameworks will use to load data and serialize it as JSON results. This might affect performance and especially memory consumption because data returned by a query is copied into model memory objects and then copied again into JSON text that will be returned as result. Besides resource consumption, this approach might increase the latency in your services, not to mention the fact that such code is just a very low added-value plumbing code. Every transformation is blocking, and the results returned by the query must be fully loaded into a collection of memory objects and then some serializer (like JSON.Net in .Net languages) serializes the entire collection as JSON text.

Azure SQL enables you to extremely simplify this process. FOR JSON query clause enables you to specify that the query results should be returned as JSON string and not as collection of rows. This way, you can directly stream the results of your query to the client instead of building the layers and wrapper that just pass parameters from client to the query and transform results as JSON results.

The following example shows a simple ASP.NET MVC action method that will be executed once a user sends HTTP GET request to the URL http://<web -application-domain>/LogAnalytic/CountBySeverity. The action gets the results of the query as JSON text and directly streams it into the HTTP response body:
public class LogAnalyticController {
  [HttpGet]
  public void CountBySeverity(){
    var QUERY = @"SELECT severity, total = COUNT(*)
           FROM WebSite.Logs
           GROUP BY severity
           FOR JSON PATH";
    connection.QueryInto(Response.Body, QUERY);
  }
}

This action method uses .NET, the micro-ORM Dapper and the Dapper.Stream extension that enables you to execute a SQL query on a connection to Azure SQL. The query has FOR JSON clause that instructs Azure SQL to return results in JSON format instead of tabular format. QueryInto method – from the Dapper.Stream extension – will stream the JSON result into the body of the HTTP response. This way, you need two C# lines of code to transform your query into a JSON web service. FOR JSON clause enables you to easily finish the journey from any SQL query to a fully functional REST API with a couple of lines of code.

Storing JSON documents

In some cases, you will have data structures that have high variety, and these structures could not be effectively represented as normalized schema. In this case, you might follow the NoSQL approach and serialize complex structures as JSON text.

JSON is, by definition, Unicode text format, so it is represented using NVARCHAR type in Azure SQL. You don’t need to use some specialized types in your applications, special client-side drivers, or API to send or retrieve JSON data. Any string type in most of the programming languages can be used to read or write content in JSON columns. The following example shows a simple table that stores website log messages that has few fixed columns (logDate and severity) that are common to all log messages and log column that contains all variable information that could be found in logs:
create table WebSite.Logs (
    _id bigint identity,
    logDate datetime2,
    severity tinyint,
    log nvarchar(max)
);

This way, you can store semi-structured data or data with volatile structure without creating a custom subset of tables for every new variation. In this scenario, log messages are write once and read many times, so we don’t need to worry about efficient updates of semi-structured values in JSON log columns.

If you decide to store data in a table, you need to understand how to parse the values from a JSON column and use them in queries, speed up queries using indexes, and ensure that JSON content is valid.

Querying JSON data

Azure SQL enables you to run standard SQL queries that use both classic scalar columns and the values extracted from JSON text. Azure SQL provides four simple functions that process JSON text:
  • JSON_VALUE(json, path_to_value) function will return a scalar value from JSON document on the specified path.

  • JSON_QUERY(json, path_to_object) function will return a complex object or array from JSON document on the specified path.

  • JSON_MODIFY(json, path_to_object, json_object) will take the JSON document provided as the first argument and locate value or object on the path specified with the second parameter, and instead of this value, it will inject the value provided as third parameter. The first argument will not be modified, and the modified JSON text will be returned. If you provide NULL value as a third parameter, the value on the path will be deleted.

  • ISJSON(json_string) returns value 1 if the string provided as argument is properly formatted JSON and 0 otherwise.

Using these simple functions, you can parse any JSON text in columns, parameters, or variables and extract the values in any query clause. The path in these functions represents the location of a value within the JSON document. The syntax used in paths is easy to understand because it is similar to most of the modern object-oriented languages that reference the fields within the object (e.g., $.info[3].name.firstName).

The following query creates report that uses both standard and JSON columns on the WebSite.Logs table shown in the previous section:
SELECT
    severity,
    ip = JSON_VALUE(log, '$.ip'),
    duration = AVG(CAST(JSON_VALUE(log,'$.duration') as int))
FROM
    WebSite.Logs
WHERE
    CAST(JSON_VALUE(log,'$.date') as datetime) > @datetime GROUP BY
    severity, JSON_VALUE(log, '$.ip')
HAVING
    AVG(CAST(JSON_VALUE(log,'$.duration') as int) ) > 100
ORDER BY
    AVG(CAST(JSON_VALUE(log,'$.duration') as int) )

Scalar values from the severity column are directly referenced, and scalar values from the JSON column are extracted using JSON_VALUE function. In this query, you can notice the true advantage of multi-model capabilities of Azure SQL. You are using one SQL language with few additional functions to process semi-structured data mixed with the scalar relational columns. If you are familiar with standard SQL language, learning JSON extension will be an easy task.

JSON paths

In the previous examples, you have seen that all JSON functions use a path that references properties within the JSON that is going to be parsed. JSON paths in Azure SQL can have the following elements:
  • $ represents the current JSON document that is provided as a first argument of JSON function.

  • Field references start with a dot and reference the sub-property within the context. As an example, $.info.name.firstName will find an “info” property and then find name property within that object and then firstName object within name.

  • Array references that can be applied on arrays and reference elements by index. As an example, $.children[2].name will find a “children” array, take the element with index 2, and then find name property within that object. Indexes in the array references are zero-based.

With this syntax of JSON paths, you can reference any property within the JSON document using the same style that is used in object-oriented languages.

Collation awareness

Collations in Azure SQL are the rules that define how the query processor must compare and sort values. For example, in French language, the last accent in each word determines the sort order, so the following four words would be sorted this way:
  1. 1.

    cote

     
  2. 2.

    côte

     
  3. 3.

    coté

     
  4. 4.

    côté

     
JSON standards don’t define how values should be compared and sorted in documents; however, you can leverage Azure SQL collations to apply some custom language rules while working with JSON data. If you specify a collation French_100_CS_AS on a NVARCHAR column that contains the JSON documents, Azure SQL will use French language rules to compare and sort results based on the values extracted from the JSON content from this column. Even if the collation is not specified on the column, we can always explicitly apply required collation in the query:
SELECT JSON_VALUE(log,'$.msg')
FROM WebSite.Logs
ORDER BY JSON_VALUE(log,'$.msg') COLLATE French_100_CS_AS

Collation awareness is one important feature especially for international applications. The fact that JSON functions can leverage this feature makes it a powerful addition to multi-model capabilities of Azure SQL.

Ensuring data integrity in JSON documents

Azure SQL will not check if you have placed a valid JSON document in the NVARCHAR column. If you are sure that you are always landing JSON documents in the table or if you have validation in application logic that ensures this, this is not an issue. Validation of JSON format requires some time and resources, so Azure SQL will not slow down the inserts and updates. If you want to ensure that your JSON documents are always valid, you can explicitly add CHECK CONSTRAINT that will do this validation:
ALTER TABLE Webite.Logs
ADD CONSTRAINT [Data should be formatted as JSON]
CHECK (ISJSON(log) = 1);

This way, you have full control over the storage process, and you choose when JSON data should be validated. In this example, a simple rule is used to make sure that values inserted in the log column are valid JSON documents. You can create custom rules that extract the values from a JSON document using JSON_VALUE function and compare the returned value in a separate CHECK CONSTRAINT.

Indexing JSON values

Indexes generally help to boost the performance of your queries. Azure SQL doesn’t have specialized index for JSON documents, but you can effectively use the following standard indexes to improve performance of JSON queries:
  • Clustered Columnstore indexes compress JSON documents and enable high-performance analytics on JSON values.

  • B-Tree indexes enable you to quickly find the rows with some value in the JSON column.

A Clustered Columnstore index is a core analytical technology in Azure SQL that provide high compression of data and fast analytic capabilities. A Clustered Columnstore index might be a good solution if you have a large number of documents in the table because it can efficiently compress JSON documents, decreasing IO required to load documents and improving performance of analytic reports that you might run on your documents. The following code can add Clustered Columnstore index on a table containing JSON documents:
CREATE CLUSTERED COLUMNSTORE INDEX cci ON WebSite.Logs;

Clustered Columnstore indexes on top of tables with JSON documents are a good choice if you need to do analytics on JSON documents. Columnstore indexes will apply so-called batch mode processing where they will leverage vector processing and SIMD instructions that can boost performance of your analytic queries over JSON data.

In the scenarios where you need to filter or sort documents using some specific property in JSON text, you can also leverage standard SQL language features such as computed columns and indexes. Computed columns can expose a value from JSON content with the JSON_VALUE function, and you can create standard B-tree index on this computed expression, as it is shown in the following code:
alter table WebSite.Logs
add [$severity] AS JSON_VALUE(log, '$.severity');
go
create index ix_severity on WebSite.Logs ([$severity]);

The computed column $severity is just a named expression that exposes value from JSON content stored in column log. It doesn’t use additional space unless you want to explicitly pre-compute the extracted value by adding a PERSISTED keyword. Whenever you filter or sort rows using the JSON_VALUE(log, '$.severity') expression, Azure SQL will know that there is an index on the matching and use it to speed up your query.

Importing JSON documents

Modern services send data formatted as JSON, and in many cases, you will need to store these JSON values in Azure SQL tables. Azure SQL provides the OPENJSON Table-Value Function that accepts a textual parameter containing JSON text, parses it, and returns the values from the JSON document in tabular format. If you want to import an array of objects by providing them as a parameter to a procedure, you can use something like the following code to insert an incoming JSON message into a table:
CREATE OR ALTER PROCEDURE InsertDeviceLog (@msg nvarchar(max))
AS
INSERT INTO WebApi.Logs(logDate, severity, log)
SELECT *
FROM OPENJSON(@msg)
     WITH (
        logDate datetime2 '$.properties.DeviceTime',
        severity tinyint '$.properties.severity',
        log nvarchar(max) '$.info' AS JSON
     );
GO

The first argument of OPENJSON function is a text containing the JSON array that should be parsed. This function expects to get an array of JSON objects where every object will be converted into a new row in the result. If the array that should be converted is placed somewhere within the document, you can provide a second parameter that represents a JSON path where OPENJSON function should find the array of objects that should be transformed to the resultset. Every object in the referenced array will become one object in the resultset that will be returned.

The WITH clause is used to define the output schema and to map the properties from the objects in the JSON array to the output columns. This section will contain one entry per each output column with the following information:
  • Name of the output column. If an optional JSON path is not provided, this name is also the key of the JSON property that should be returned as a result value in this column.

  • SQL type of the output column. OPENJSON will use the semantic of CONVERT T-SQL function to convert the textual value in parsed JSON text to SQL value.

  • Optional JSON path that will be used to reference the property in the JSON object that contains the value that should be returned. If this path is not specified, the name of the column will be used to reference the property. As an example, if the column name is severity and the JSON path is not specified, then OPENJSON will try to find a value on the $.severity path.

  • Optional AS JSON clause. By default, OPENJSON will use JSON_VALUE function to get the value from the object that is currently converted. Therefore, it cannot return sub-objects or sub-arrays. If there is a JSON object or array on that path, it will not be returned unless you specify the AS JSON clause.

The result of the OPENJSON function will be a resultset that can be inserted into the table.

Graph structures

Graphs are complex structures with objects (nodes) that are connected with relationships (edges). Scenarios where you would use graphs might be:
  • Social networks where you have people connected with relationships like friends, family, partners, or co-workers

  • Transportation maps where you have towns and places connected with roads, rivers, and flight lines

  • Bill-Of-Materials solutions where you have parts connected to other parts that are themselves connected to other parts and so on

The graph structures might be represented as tables and foreign keys relationships in some scenarios. The key reason why you would choose graph model instead of relational model is when you are working on a project where relationships are dominant and there are a small number of entities connected to each other in many different, direct, and indirect ways. In fact, the key difference in that domain is the transitive nature of the relationship. The term transitive is hereby used for its mathematical meaning: if a is connected to b and if b is connected to c, then a is connected to c. In these cases, you would not be interested only in a single-hop relationship (like fetching order lines for an order via foreign key relationship), but in all the hops needed to move from one node to another. To do that efficiently, you would like to leverage graph-specific semantics for query processing. Examples might be abilities to find the transitive closure (is a connected to z?), to find a shortest path between two objects or recursively traverse across all relationships starting with a specified object.

In Azure SQL, nodes and edges are represented using special tables. You may be wondering why using relational tables has been the chosen implementation to represent graph elements and if that is a good idea at all.

First, it is very common to have some information tied to nodes and edges, and tables are a perfect structure to hold that information. A second benefit of using tables is that the Azure SQL query optimizer can be leveraged to improve overall query performance. The third benefit of this choice is that, as they are just tables, columnstore and indexes can be used to improve performance even further.

As an example, let’s look at the simple structure that models Airports and the Flight lines that represent connection between them. This domain might be modeled using the following structure:
CREATE TABLE Airport (
    AirportID int PRIMARY KEY,
    Name NVARCHAR(100),
    CityID int FOREIGN KEY REFERENCES Application.Cities(CityID)
) AS NODE
GO
CREATE TABLE Flightline (
    Name NVARCHAR(10)
) AS EDGE;

Airport is a node of the graph, but it behaves as a regular table. We can add any column, index, or constraint to describe information in this node. This is another example of how Azure SQL multi-model capabilities enable you to combine advanced battle-tested database features in the new scenarios. Node table has a hidden column that represents a unique identifier of the node that would be explained later.

Flightline is an edge table that links two Airport tables. The columns in this table represent additional information that describe the relationship between two nodes. Edge table shown in this example is a generic edge, and there is no specification that it is designed to connect two Airports and not some other nodes. In order to explicitly specify that Flightline connects Airport , we need to introduce the following edge constraint:
ALTER TABLE Flightline
ADD CONSTRAINT [Connecting airports]
                    CONNECTION (Airport TO Airport)
                    ON DELETE CASCADE;

This constraint specifies that you cannot use Flightline edge to connect nodes other than Airport. This constraint also defines what would happen with the edge if the node is deleted.

Loading graph data

Node tables are classic tables that can be loaded or read like any other classic table. Every node table has a hidden column called $NODE_ID that you will use only in some special scenarios. An Edge table has hidden “foreign key relationship” columns that are under the hood used to connect edges with associated nodes using $NODE_ID values. This is important to know when you import data in the edges because you need to fetch the $NODE_ID value of related nodes to bind them. Let us assume that the Airport nodes are already loaded and that we are importing a set of edges from the blob storage using OPENROWSET function. In order to load data in the edge table, we need to join loaded records with nodes, find the $NODE_ID values, and insert them into the edge table together with imported airline name:
INSERT INTO Flightline ($from_id, $to_id, Name)
SELECT f.$NODE_ID, t.$NODE_ID, a.Name
FROM OPENROWSET(
         BULK 'data/flightlines.csv',
         DATA_SOURCE = 'MyAzureBlobStorage',
         FORMATFILE='data/flightlines.fmt',
         FORMATFILE_DATA_SOURCE = 'MyAzureBlobStorage') as a
     JOIN Airport f ON f.Name = a.FromAirport
     JOIN Airport t ON t.Name = a.ToAirport;

The Flightlines CSV file contains information about source and destination airport and the name of the flight line between them. We need to join this data with Airports by Name column and get the $NODE_ID values that should be imported.

Querying graph data

Once we load data, we can use Cypher expressions (www.opencypher.org/), one of the most common ways to query graph data. The following query will traverse all paths from source airport node via airline edge to another airport node:
SELECT
      src.Name, line.Name, dest.Name
FROM
      Airport src, Flightline line, Airport dest
WHERE
      MATCH(src-(line)->dest)
AND
      src.Name='Belgrade';

This query will return all destinations from Belgrade to all other towns. MATCH clause defines that a path from source airport (src) to destination airport (dest) should be established via flight line table (line).

The main advantage of graph processing support in Azure SQL is the ability to query across the edges of the nodes. As an example, SHORTEST_PATH predicate enables you to find the shortest paths between two nodes. You can leverage this function to find the shortest route between two towns, as it is shown in the following code:
WITH routes AS (
   SELECT
      src.Name,
      STRING_AGG(dest.name, '->')
            WITHIN GROUP (GRAPH PATH) AS path,
      COUNT(dest.name)
            WITHIN GROUP (GRAPH PATH) AS stops,
      LAST_VALUE(line.name)
            WITHIN GROUP (GRAPH PATH) AS lastFlight,
      LAST_VALUE(dest.name)
            WITHIN GROUP (GRAPH PATH) AS destination
FROM
      Airport src,
            Flightline FOR PATH line,
            Airport FOR PATH dest
WHERE
      MATCH(SHORTEST_PATH(src(-(line)->dest)+))
AND
      src.Name='BEG'
)
SELECT TOP (10) path, stops, lastFlight, destination
FROM routes
WHERE destination IN ('JFK', 'SEA');

The SHORTEST_PATH clause within the MATCH clause will find the shortest path between starting and end location that are not directly connected. Aggregate STRING_AGG will concatenate all airport names on the path and display them with arrow -> separator. COUNT and LAST_VALUE will show the number of stops on itinerary and ending flight and town on the shortest route. Once the shortest path exploration is finished, we need to select destination towns in the final query.

Graph processing capabilities in Azure SQL enable you to reduce complexity of your models and queries that should analyze different paths and relationships between tables.

Spatial data

Representing spatial objects (places, roads, country borders) is something that doesn't ideally fit into a structured relational model in normal form. Although you can represent a road or a border as a set of small straight lines where every line is stored in a separate row with the ends connected to the lines that continue the road, this is not an efficient representation.

The queries that you would run against spatial objects usually have conditions like “is this place within the shape” or “how far is the place from the road.” These are not the typical queries that you would describe using standard SQL language.

Azure SQL has specialized functionalities that conform to the Open Geospatial Consortium (OGC) standards that enable you to implement applications that work with spatial data:
  • Specialized types that can be used to represent complex geometrical and geographical objects and shapes (Point, Line, Polygon). All shapes can be represented as geometry or geography models, which will be described more in detail soon.

  • Functionalities specialized for spatial querying such as finding the distance between two points (ST_DISTANCE), determining whether an area contains a specified point (ST_CONTAINS), and so on.

  • Specialized indexes that are optimized for spatial types of queries.

This set of capabilities enables you to create advanced queries that are specific for spatial domains.

Remember the airport and flight line model described in the previous section. Graph models that connect airports (nodes) using flight lines (edges) might be perfect to find the shortest route between two towns. However, imagine that you need to find all airlines that are crossing Nebraska or unnamed crossroads where two highways intersect. If there is no explicit relationship between highways and all crossroads or countries, it would be impossible to answer these questions.

In order to solve these problems, we would need to extend graph model with geographical data, as shown in the following script:
CREATE TABLE Airport (
    AirportID int PRIMARY KEY,
    Name NVARCHAR(100),
    Location GEOGRAPHY,
    CityID int FOREIGN KEY REFERENCES Application.City(CityID)
) AS NODE
GO
CREATE TABLE FlightLine (
    Name NVARCHAR(10),
    Route GEOGRAPHY
) AS EDGE;
Azure SQL has two main base types that can be used to represent geometrical and geographical figures:
  • GEOMETRY type represents data in a Euclidean (flat) coordinate system.

  • GEOGRAPHY type represents data in a round-earth coordinate system.

Geometry is perfect for representing relatively small objects like buildings or interiors; Geography is better suited to represent much bigger shapes, like river, city, or nation boundaries and in general anything that needs to work on close approximation of Earth surface to avoid errors.

Within these two types, as specified by Open Geospatial Consortium , you can create more specific types:
  • Point used to represent 2D places like towns

  • LineString and CircularString that can represent open or closed lines like roads or borders

  • Polygon and CurvePolygon used to represent areas like countries

  • MultiPoint, MultiLine, and MultiPolygon representing a set of disconnected geographical objects that logically belong together (an archipelago with a set of islands might be represented with MultiPolygon)

In Azure SQL, once you have created a Geometry or Geography column in a table, you can use any of these types to build the shape you need. You can even use more than one at the same time, using Collections or “Multi” types.

Querying spatial data

Spatial data types in Azure SQL have the built-in methods that enable you to easily query spatial data. Under assumption that all information about the flight lines and airports are populated, we can create easily find the routes crossing state of Nebraska:
DECLARE @nebraska GEOGRAPHY = (
      SELECT TOP (1) Border FROM Application.StateProvinces
      WHERE StateProvinceName = 'Nebraska'
);
SELECT *
FROM FlightLine
WHERE Route.STIntersects(@nebraska) = 1;

STIntersects method determines if two shapes intersect at some place. This method returns 1 if a geography instance intersects another geography instance and 0 otherwise.

STDistance method measures the distance between two objects and enables you to find the objects closer to some specific coordinate. The following query returns five airports closest to the current location of some object:
DECLARE @currentLocation GEOGRAPHY = 'POINT(-121.626 47.8315)';
SELECT TOP(5) *
FROM Airports
ORDER BY Location.STDistance(@currentLocation) ASC;

The spatial queries are enabling you to easily perform specific analysis to resolve problems where you would need to spend a lot of time dealing with the specific mathematical transformations, without having you to write them yourself or to use another more specialized solution to perform the calculation, so that you don’t have to move the data around, thus making your solution much more efficient.

Spatial indexes

In theory, STIntersects method might be implemented as a self-contained function with complex mathematical calculations that are trying to determine relationships between the figures. However, due to complexity of calculations, running that kind of function on many objects would be both time- and CPU-consuming. For efficient processing, Azure SQL uses a special type of Spatial indexes.

Spatial indexes internally create grid (shown in Figure 8-2) where cells may or may not overlap with the parts of figures that should be indexed.
../images/493913_1_En_8_Chapter/493913_1_En_8_Fig2_HTML.jpg
Figure 8-2

Indexing spatial objects in 4x4 grid

Azure SQL creates a grid, and for every spatial object that should be indexed, it records whether it fully or partially overlaps or doesn't overlap at all with the cells in the grid. This process is known as tessellation . With this technique, a STOverlaps method that needs to determine if two objects overlap will not immediately need to apply complex mathematical calculations to determine if there is some intersection between the objects. If an index is available, it will first use the index to check if there is at least one cell in the grid that belongs to both spatial objects or if the cell that belongs to one object also partially overlaps with another object. If this is true, then they overlap, and this is the faster way to determine if there is some interception. If there are no cells that at least partially overlap with both objects, then these objects do not overlap. If there are some cells that partially overlap with both objects, these objects might or might not overlap. Only in this case will Azure SQL apply complex spatial calculation, but not on the entire area of objects, just on smaller cells where they might potentially overlap. Although this might be CPU-consuming operation, it is performed on a small cell and probably the small part of objects that is within this cell. Therefore, this action would be few orders of magnitude faster than the naïve approach that would compare all parts of the objects.

Spatial indexes are created using the special CREATE SPATIAL INDEX syntax:
CREATE SPATIAL INDEX SI_Flightline_Routes
   ON Flightline(Route)
   USING GEOGRAPHY_GRID
   WITH (
   GRIDS = ( MEDIUM, LOW, MEDIUM, HIGH ),
    CELLS_PER_OBJECT = 64 );

Besides the spatial column that should be indexed, you can specify the characteristics of the grid that will be created to index the spatial values such as area that should be covered or density of tessellation grid used for indexing. More granular indexes will be bigger and need more time to scan all grid cells and determine whether the parts of the routes overlap with every cell. However, the bigger density of grids makes the worst-case scenario stage, where objects partially overlap much faster because smaller parts of the objects are processed using the complex math rules. This right size and parameters of the indexes depend on your data, and you might need to experiment and rebuild the index with different parameters to find what is the best fit for your data.

If you are unsure at the beginning, you can avoid the bounding box specification, and Azure SQL will try to guess the best bounding box and tessellation for you. Of course, the automatically defined values may not be the perfect ones in your scenario, so it is good to know that you can manually specify them if needed.

Geometry vs. Geography

As mentioned before, Azure SQL has two classes of spatial data types that are used in different scenarios:
  • Geometry data types are used to represent planar mathematical shapes in classic 2D coordinate system.

  • Geography data types are used to represent spherical objects and shapes projected into 2D plane.

The difference between geometry and geography types is one of the most important things that you need to understand to develop spatial applications.

Geography types are used to represent the objects placed in a classic 2D coordinate system, and you can imagine them as the objects that you could draw on the plain piece of paper. Distances and sizes of the objects are measured the same way you would measure distance of the objects drawn on a paper or a board. If you take a map and want to find the shortest flight trajectory between Belgrade, Serbia, Europe, and Seattle, Washington, US, you would probably use the straight horizontal line going via France and the US east coast. This is geometrically the shortest line between them. However, due to the rounded shape of the earth, the shortest trajectory (called geodesic) is going via Iceland, Greenland, and Canada. Geography data model is considering Earth’s actual shape and is able to find the real-world shortest distance and path.

Mapping the Earth surface to a 2D plane is the most difficult spatial problem. Famous mathematician Carl Gauss proved in his Theorema Egregium (Latin for "Remarkable Theorem") that spherical surfaces cannot be mapped to 2D planes without distortion. You might notice on some maps that the territories closer to the poles such as Greenland, Antarctica, north of Canada, and Russia might look stretched or sometime bigger than actual. This happens due to the fact that dense coordinates closer to poles must be “stretched” to project them in 2D coordinates. To make things even harder, the Earth surface is not spherical nor even ellipsoid. Irregular shape of Earth and proximity to poles force people to use different strategies of mapping to 2D plane. There are mapping rules that preserve correct distance shortest paths, shape, and vice versa, but in every strategy, something will be distorted. That’s why every geography object has an associated Spatial Reference Identifier (SRID) that describes what spatial transformation strategy is used to translate the object from the earth globe into the 2D plane. SRID describes what coordinates are used (latitude/longitude, easting/northing), unit of measure, where is coordinate root, and so on. Azure SQL will derive information from SRID to compare positions of the objects.

In the following example, you can see how to transform coordinates to the Geography line by specifying that the World Geodetic System 1984 (WGS84) with SRID 4326 is used for transformation:
DECLARE @g GEOGRAPHY;
SET @g = GEOGRAPHY::STGeomFromText('LINESTRING(-122.360 47.656, -122.343 47.656)', 4326);
SELECT @g.STSrid;

Some countries need to use multiple SRID in their territory, especially if their north and south borders are far like in Chile. In these scenarios, you would need to align SRID before comparing positions of the geography objects. Specifying different SRID would result in the different positions or shapes. Measuring the differences between spatial objects with coordinates determined using different SRID would lead to wrong results. Therefore, in Azure SQL spatial operations cannot be performed between spatial objects with different SRIDs.

WGS84 is the most commonly used standard and is the one also used on the GPS system in our phone or car. If you are unsure of which SRID to use, very good chances are that WGS84 will work perfectly for you.

The good news for you is that all these complex transformations are built in into Azure SQL. The only thing you need to do is to leverage functionalities and learn the basic principles that will help you to understand how to use Spatial features.

XML data

XML data type is the older brother of JSON. This feature was introduced in SQL Server Database engine between 2000 and 2005, while XML was the mainstream format for data exchange between different applications.

XML support in Azure SQL is similar to the JSON support described in the previous sections. If you need to parse XML data or format results of the queries as XML, you can use the following functionalities:
  • OPENXML table value function that can parse an XML document

  • FOR XML clause that can format results of the query as XML document

  • XML type with methods for processing values in XML documents

If you have understood the OPENJSON function and FOR JSON clause explained in the section about JSON support, then you probably understand the purpose of OPENXML function and FOR XML clause. The difference between these XML functionalities and matching JSON functionalities are trivial, so they will not be explained in more detail.

The key difference between JSON and XML support in Azure SQL is native XML type. Unlike JSON support where JSON text is stored in native NVARCHAR type, XML has a dedicated SQL type. XML is standardized type in many languages (e.g., System.Xml.Document in .NET framework), so it makes sense to have parity in SQL type. The key difference is that in JSON cases you are using string-like functions to parse JSON, while XML content is represented as an object where you can use various methods to extract data.

Querying XML data

XML type has the following member methods that you can use to extract value from and manipulate XML:
  • value(path, type) that returns a node or attribute from XML object and automatically converts it into a SQL type. You need to specify a standard XPath expression that targets a single value in the XML document.

  • query(path) – This method returns an object from the XML document on a specified XPath expression.

  • nodes(path) is very similar to OPENXML/OPENJSON functions, and it is used to transform an array of XML elements on the specified path to a set of rows that can be used in a FROM clause.

  • exists(path) is a method that checks if there’s an existing element on the specified path.

  • modify(path, type) method enables you to insert, delete, or replace values of some nodes in XML document.

XML querying capabilities will be explained using the following example:
DECLARE @i INT = 47;
DECLARE @x xml;
SET @x='<Family id="1804">
    <row id="17"><name>Robin</name></row>
    <row id="47"><name>Lana</name></row>
    <row id="81"><name>Merriam</name></row>
</Family>';
SELECT
  family_id = @x.value('(/Family/@id)[1]', 'int'),
  family_81_name = @x.value('(//row[@id=81]/name)[1]', 'varchar(20)'),
  family_name = @x.query('//row[@id=sql:variable("@i")]/name')
SELECT
  family_member = xrow.value('name[1]', 'varchar(20)'),
  family_member_id = xrow.value('@id[1]', 'varchar(20)'),
  family_member_xml = xrow.query('.')
FROM
  @x.nodes('/Family/row') AS Members(xrow)
WHERE
  xrow.value('@id[1]', 'int') < 50
AND
  xrow.exist('.[@id > 5]') = 1

The first query uses the value member function of @x variable to extract the values of family identifier and name of the family member with id 81 and the name of the member with identifier value equal to the variable @i.

The second query takes all /family/row nodes from XML document as a rowset under the condition that the id attribute of each row is less than 50 and greater than 5. Every node that satisfies this condition is returned as column named xrow. Method value() is used to extract the value that will be compared with 50 with SQL operator, while exist() method is used to directly push down predicate to XML variable. Finally, the methods value() and query() are used to get the name, identifier, and XML content from each returned row.

Another interesting feature is the ability to bind the values of SQL variables or columns in the XPath expressions. In the preceding example, you could see that the third expression in the first SELECT clause uses SQL variable @i from the outer script. This might be a flexible way to specify how to find the data.

Another functionality that might be handy is the ability to directly update XML documents without need to parse it, transform it into relational format, and then reconstruct XML again using FOR XML clause. Azure SQL provides modify() method where you can specify expression that will insert, delete, or replace value in the XML document:
SET @x.modify('insert <row id="109"><name>Danica</name></row>
              into (/Family)[1]') ;
SELECT @x;

Although XML functionalities in Azure SQL are not core scenarios that you will frequently use, given that XML is not a mainstream format anymore, you can still use them to resolve various problems that require querying and transforming XML data.

XPath and XQuery languages

Azure SQL supports XML standards that enable you to implement complex processing and querying over XML documents. XML support in Azure SQL is based on the following languages:
  • XPath (XML Path Language) is a query language for selecting nodes from an XML document.

  • XQuery (XML Query) is a query and functional programming language that queries and transforms collections of XML data.

XPath is an expression-based syntax used in value(), nodes(), and query() methods to specify the element within the XML document that should be located:
  • Hierarchical expressions – Use XPath to specify the path from the root of XML document to desired element within the document. As an example, XPath /Family/row/name is used to reference the elements <name> that are placed under the <row> element, which is placed under the <Family> element that is the root of XML document. There can be multiple elements that match the same XPath expression, so you should use indexing operator [] to specify the elements that match expression should be referenced.

  • Node and attribute references – XPath enables you to reference either node or their attributes. Any name that doesn't start with @ will be treated as a name of XML node, while the names starting with @ (e.g., @id in the preceding example) will be treated as an attribute.

  • Recursive expressions – In some cases, you don’t want to or cannot reference the entire path from the root, or you need to find elements that are positioned in the different locations of the document. Recursive operator // enables you to specify a “detached path” where XML functions will try to find any path that matches the expression right of the recursive operator. As an example, //row/name will find any <name> element within the <row> element that is placed anywhere in the XML document.

  • Predicates enable you to specify some condition that elements must meet to be matched with XPath expression. As an example, //row[@id=17]/name specifies that the XML methods should find any <name> element within the <row> element that has id attribute with value 17. The predicates are the easy way to filter out some nodes that don’t satisfy some condition.

XPath can be a very powerful language that you can use to declaratively specify criteria for selecting the information from XML documents.

You can also use other XPath features like namespaces that enable you to define the scopes of names and match only the names within the same namespace; seven-direction axis that enables you to reference parents, siblings, and descendants; or build-in functions that can help you transform the results within the expressions.

XPath is complemented with XQuery language that is also supported in Azure SQL. XQuery is a standardized language for querying and processing collections of elements in XML documents. XQuery defines a SQL-like syntax called FLWOR (pronounced “flower”) that represents FOR, LET, WHERE, ORDER BY, and RETURN operators that you can use to transform XML nodes. These operators enable you to select, transform, and return new objects based on existing data. XQuery enables you to use the values from the nodes that are processed using the curly braces ({...}) template. An example of FLWOR operations in XML query() method selecting nodes from XML variable and transforming content into new XML structure is shown in the following listing:
SELECT xrow.query(
'let $r := self::node()
  return <person id="{$r/@id}">{$r/name/text()}</person>')
FROM @x.nodes('/Family/row') AS Members(xrow);
------------------------------------------------------
<person id="17">Robin</person>
<person id="47">Lana</person>
<person id="81">Merriam</person>

The XML nodes() method emits three XML rows from the variable @x used in the previous example, and then the query method processes them using an XQuery expression. In the body of the XQuery expression, the current node is assigned to the variable $r, and the return statement creates a new XML node where id attribute and content of the name node are injected in template. As a result, this XQuery expression will return transformed XML shown below the query. As you can see, XQuery has a lot of power that can be used to implement complex processing of XML elements.

XML indexes

Due to the very specific nature of XML queries, standard B-tree indexes might not provide sufficient query performance boosts. Azure SQL provides several specialized types of indexes that enable efficient processing of XPath/XQuery expressions. The following types of indexes are available in Azure SQL:
  • The primary XML index is a pre-computed structure that contains the shredded values and nodes from the XML column. Azure SQL uses the values from primary index instead of invoking expensive parsing of XML type with value(), nodes(), or query() methods. This is very similar to automatic index on JSON documents that Azure Cosmos DB uses.

  • A secondary XML index improves the performance of the queries that search or filter XML documents using exists() method or returns multiple values from XML document using value() method.

  • Selective XML indexes index only specified paths in XML column. This is very similar to multiple B-tree indexes on the predefined XML expressions.

Selective XML indexes are the recommended approach for indexing based on the learnings from the multiple XML scenarios in SQL Server. The SQL Server team found that automatic indexing of all possible fields leads to large size of XML indexes, and on the other side, most of the indexed paths are not used. Therefore, selective XML indexes became the best choice and trade-off between usability, performance, and size.

In the following example, you can see how to create a selective XML Index on a simple table:
CREATE TABLE XmlDocs (
  id INT IDENTITY PRIMARY KEY,
  doc XML
);
GO
CREATE SELECTIVE XML INDEX sxi_docs
ON XmlDocs(doc)
FOR (
    path_price = '/row/info/price' AS SQL INT,
    path_name = '/row/info/name' AS SQL NVARCHAR(100)
)

In selective indexes, you can choose the paths that should be included in the index and specify their types. The queries that use value() function on doc column with XPath queries as defined in the index specifications would be able to leverage the sxi_doc index, gaining a huge performance boost, even while the index is very small.

Key-value pairs

Azure SQL doesn’t have a specialized structure that holds key-value pairs. The reason is simple: key-value maps can be implemented using the simple two-column table.

Azure SQL enables you to customize your two-column table and index the key column using various indexes. With memory-optimized tables, which will be discussed in detail in the next chapter, you can index the key column using B-tree or Hash indexes. In the following example, you can find a memory-optimized, lock-free, natively compiled, key-value table with a HASH index that enables faster access to the keys:
CREATE TABLE [Cache] (
    [key] BIGINT IDENTITY,
    value NVARCHAR(MAX),
    INDEX IX_Hash_Key HASH ([key]) WITH (BUCKET_COUNT = 100000)
) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_ONLY);

This structure enables fast retrieval of the keys using hash indexes, which is optimal for elementary get/put operations. Memory-optimized tables have optimistic lock-free data access, and SCHEMA_ONLY durability ensures faster updates because data is not persisted to disk. In addition, if the values are formatted as JSON format, we can use native JSON functions to filter and process data right in the database, as you learned at the beginning of this chapter.

One of the scenarios for key-value structures in Azure SQL is centralized caching. There is a well-known case study in SQL Server 2016 that showed how the customer replaced a distributed cache mechanism that was able to achieve 150K requests/sec on 19 distributed SQL Server nodes, with a memory-optimized table in a single server that increased performance to 1.2 million requests/sec. The targeted scenario was implementation of ASP.NET Session cache. Azure SQL uses the same technology as SQL Server, and the same technology can be used for caching in Azure cloud.

In many real-world projects, caching using a specialized engine is usually the preferred choice, but that means that you need to master another technology, and you need to figure out how to best integrate it with your solution. Usually this effort is a good choice as caching solutions are much cheaper than a full-blown Azure SQL database, but if you are already using Azure SQL in the first place, knowing that you have this ability right in the database can provide you an additional option that you may want to evaluate to simplify the overall architecture.

How to handle unstructured text?

The most difficult to handle but not so uncommon case is unstructured textual data. In some cases, you will have textual data that cannot be nicely organized in JSON or XML format, but you would need to implement some searches on that text. One common example is HTML code that is placed in the database. Ideally, HTML should be the same as XML if it conforms to XHTML specification, but in many cases, HTML might have some variation that breaks strict XML structure.

Azure SQL enables you to use LIKE clauses to determine if the text column matches some pattern. The following query finds all stock items where SearchDetails and Tags columns contain the text entered in some search text box in user interface:
SELECT si.StockItemID, si.StockItemName, si.Tags
FROM Warehouse.StockItems AS si
WHERE si.SearchDetails LIKE N'%' + @SearchText + N'%'
OR si.Tags LIKE N'%' + @SearchText + N'%'

LIKE predicate uses the percent sign (%) to match zero or more of any character and the underscore (_) matches any one character. These special characters in the pattern expression on the right side of LIKE operator enable you to define various patterns such as text beginning or ending with some text sequence. LIKE operator is a very handy tool that is commonly used for text searches on small datasets. Azure SQL can optimize and use indexes even when using the LIKE operator, especially if you are using LIKE to search all text that starts with some prefix. In such cases, an index and the LIKE operator can provide very good performance. If you need instead to do a more complex search, for example, looking for specific words contained somewhere in your text, especially if you are working with bigger text sets, you might want to consider some text indexing solution described in the next section to improve performance even more.

Indexing unstructured text

If you need to search huge amounts of unstructured textual data, you would need to use some kind of specific indexing. Azure has a generic Azure Cognitive Search index service that enables you to index various data sources. But again, using an external service would add a bit more complexity to your solution. While Azure Cognitive Search provides a great set of specialized features, if you don’t need all of those, you may be happy to know that Azure SQL uses a similar localized text search index called Full-Text Search (FTS) index. An FTS index is a structure that indexes unstructured text fields in specified tables. An example of FTS index created on three text fields in StockItems table is shown in the following listing:
CREATE FULLTEXT CATALOG [Main] AS DEFAULT;
GO
CREATE FULLTEXT INDEX
ON Warehouse.StockItems (SearchDetails, CustomFields, Tags)
KEY INDEX PK_Warehouse_StockItems
WITH CHANGE_TRACKING AUTO;
GO

An FTS index contains a set of text fragments (tokens) divided using a set of word breakers. The tokens in the FTS index have the keys of the origin rows where the text is found. FTS enables you to provide some simple description of text pattern and return the keys of the rows that match the criterion.

Querying unstructured text

Once you set up your FTS index, you can use the following functionalities to search rows using text match:
  • CONTAINS and FREETEXT that check whether the values in some columns match the predicate defined in the text predicate

  • Table-value functions CONTAINSTABLE and FREETEXTTABLE that return identifiers of the rows where text matches some criterion

The following example shows how to find all keys from Warehouse.StockItems table where the SearchDetails column contains text that matches search criterion that is defined in @SearchCriterion variable:
DECLARE @SearchCondition NVARCHAR(200) = 'blue car';
SELECT StockItemID = ft.[KEY], ft.[RANK]
FROM FREETEXTTABLE(Warehouse.StockItems, SearchDetails, @SearchCondition) AS ft
../images/493913_1_En_8_Chapter/493913_1_En_8_Figa_HTML.jpg
The KEY column is an identifier of a row in StockItems table that is used in the FTS index and represents the record that is returned by FREETEXTTABLE or CONTAINSTABLE functions. RANK column describes how well the row matched the selection criteria. This resultset can be joined with the original table using the KEY column to get more results. The following example shows how to find all stock items with the keys returned by FTS function:
DECLARE @SearchCondition NVARCHAR(200) = 'blue AND car';
SELECT
  si.StockItemID,
  si.StockItemName,
  ft.[RANK]
FROM
  Warehouse.StockItems AS si
INNER JOIN
 CONTAINSTABLE(Warehouse.StockItems, SearchDetails, @SearchCondition) AS ft ON si.StockItemID = ft.[KEY]
ORDER BY
  ft.[RANK];
../images/493913_1_En_8_Chapter/493913_1_En_8_Figb_HTML.jpg

Table-value functions CONTAINSTABLE and FREETEXTTABLE match text based on exact or fuzzy match. One difference between these functions is that CONTAINSTABLE does more exact matching, while FREETEXTTABLE uses fuzzy matching using thesaurus, synonyms, and inflectional forms. If the word "children" is provided as a search criterion, FREETEXTTABLE will also match rows containing "child", but CONTAINSTABLE will not. In CONTAINSTABLE, you need to explicitly specify the expression FORMSOF(INFLECTIONAL,children) to instruct Azure SQL to include inflectional forms of this word.

Another difference is that CONTAINSTABLE enables you to specify operators like AND, OR, or NEAR to define how you want to search text. In the previous examples, you might see that we have provided set of words "blue car" to FREETEXTTABLE, and this function will return all rows that have any of these words like in most web search engines. In the CONTAINSTABLE example, we need to explicitly specify operators like AND, OR, or NEAR to specify what should be searched, for example, "blue AND car" or "blue OR car".

Instead of functions FREETEXTTABLE and CONTAINSTABLE, you can use equivalent predicates FREETEXT and CONTAINS. These predicates can be used in WHERE clause of the query, and they are functionally equivalent with explicit JOIN with FREETEXTTABLE and CONTAINSTABLE, with the exception that you won’t have the RANK column available:
DECLARE @SearchCondition NVARCHAR(200) =
          'FORMSOF(INFLECTIONAL,children) OR car';
SELECT
  si.StockItemID,
  si.StockItemName,
  si.SearchDetails
FROM
  Warehouse.StockItems AS si
WHERE
  CONTAINS(SearchDetails, @SearchCondition);

If there is some stock item containing the word 'cars', it will not be returned in the result because CONTAINS uses exact match using the word 'car'. However, if we use FREETEXT with 'children cars' search expression, this predicate will use inflectional forms of both words.

Full-text search is a very powerful tool that can help you to implement very complex searches with a simple expression.

How to leverage unstructured indexes on semi-structured data?

Full-text search is not limited only to unstructured text. You can use FTS indexes to improve performance of some JSON search queries where you need to filter documents that have key-value pairs defined by the client. After all, an FTS index is very similar to a Generalized Inverted Index (GIN) that is used in many databases exactly to index JSON data.

Let’s imagine that we need to implement functionality that searches a large set of JSON documents using arbitrary key-value combinations. Adding a B-Tree index on every possible key would be inefficient, and a CLUSTERED COLUMNSTORE index on JSON data is designed for analytical use cases and thus is not a good solution for filtering.

If we know that value is a single word, we can leverage the fact that JSON is stored as text and that key and value are near to each other. In that case, FTS that filters all JSON texts that have required key and value near to each other might provide correct results. The query that finds all JSON document with pairs
 {"Color":"Silver","MakeFlag":true,"SafetyStockLevel":100}
is shown in the following listing:
SELECT si.StockItemID, si.StockItemName, si.Tags
FROM Warehouse.StockItems AS si
WHERE CONTAINS(CustomFields, 'NEAR((Color,Silver),1)
                      AND NEAR((MakeFlag,true),1)
                      AND NEAR((SafetyStockLevel,100),1)')

The NEAR operator in the CONTAINS predicate is a good choice for JSON scenarios where key of json property is near value. This kind of predicate will quickly find all text cells that have words "Color" and "Silver" close together, which is actually the case in JSON structure. In addition, CONTAINS clause enables us to specify complex predicates with AND, OR, and other relational predicates. However, FTS will not guarantee that "Color" is key and that "Silver" is value in the text, because it doesn’t understand the semantic of text parts in JSON structure. If you have some JSON document containing text "color silver", it will be returned by CONTAINS predicate, although this is not key-value pair.

In order to remove false-positive results, we can apply standard JSON predicates that double-check condition and guarantee the correctness of results:
SELECT ProductID, Name
FROM ProductCatalog
WHERE CONTAINS(Data, 'NEAR((Color,Silver),1)
AND NEAR((MakeFlag,true),1)
AND NEAR((SafetyStockLevel,100),1)')
AND JSON_VALUE(Data,'$.Color') = 'Silver'
AND JSON_VALUE(Data,'$.MakeFlag') = 'true'
AND JSON_VALUE(Data,'$.SafetyStockLevel') = '100'
This query leverages the best of FTS and JSON features to return the required result:
  • CONTAINS will quickly filter out most of the entries that don’t satisfy the condition and might significantly reduce the number of candidate rows that might contain the needed data. Without this part, we would end up with full table scan and applying the JSON functions on every row.

  • JSON_VALUE will perform an exact check on the smaller candidate set returned by FTS. These predicates guarantee that correct results will be returned, and we are sure that we don’t need to apply them on every document.

This example again shows how Azure SQL features nicely fit together and enable you to implement various scenarios.

Multi-model in Azure SQL: why and when

Azure SQL is a modern multi-model database platform that enables you to use different data formats and combine them in order to design the best data model that will match the requirements of your domain. Depending on your scenario, you can represent relations as classic foreign key relationships or graph nodes/edges. Semi-structured data can be stored in JSON, Spatial, or XML columns.

You just learned how you can use all these features. It’s now time to discuss why and when.

One of the biggest advantages of Azure SQL is interoperability between core database features and multi-model capabilities. You can easily combine Columnstore with graphs or JSON data to get high-performance analytics capabilities on graph/JSON data, built-in language processing rules to customize application for any market, use all features that T-SQL language provides to create any query or powerful report, and integrate it with a variety of tools that understand T-SQL.

With Azure SQL, you are getting the core functionalities that other NoSQL databases provide, plus a lot of standard relational database functionalities that can be easily integrated with NoSQL features. This is the most important reason why you should choose the multi-model capabilities of Azure SQL.

So, should you choose Azure SQL with its multi-model capabilities or some specialized NoSQL Database engine that has more advanced features in these areas? That’s a very interesting – and not easy – question to answer.

Azure SQL is not a NoSQL database. If you have a classic NoSQL scenario that requires advanced graph or document support and you don’t expect to have to deal with other data that would otherwise be better stored into a table, then you should surely evaluate full-fledged graph or document databases such as Azure Cosmos DB, MongoDB, Neo4j, and so on. These database engines are fully oriented on NoSQL scenarios and have implemented richer and more advanced NoSQL features. They address a very specific area and they are extremely good at it.

The most important question that you need to ask yourself is what kind of additional NoSQL functionalities would your application need. Both Azure SQL and NoSQL database engines provide similar levels of fundamental graph and document processing functionalities (e.g., inserting, modifying, indexing, and searching). If you need more than these basic features, Azure SQL gives you an ability to leverage advanced querying using T-SQL, Columnstore technology for analytic, built-in machine learning capabilities with R/Python support, collations, replication mechanisms, and other functionalities that are proven as necessary in most of the real-world application. If you believe that these functionalities might be important for your application, then Azure SQL is the right choice for you.

If you want to know more

Lots of concepts and technologies have been discussed in this chapter. As usual, if you want to know more, you can find more food for your brain here:
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.137.240