© Davide Mauri, Silvano Coriani, Anna Hoffman, Sanjay Mishra, Jovan Popovic 2021
D. Mauri et al.Practical Azure SQL Database for Modern Developershttps://doi.org/10.1007/978-1-4842-6370-9_4

4. Developing with Azure SQL – Foundations

Davide Mauri1 , Silvano Coriani2, Anna Hoffman3, Sanjay Mishra1 and Jovan Popovic4
(1)
Redmond, WA, USA
(2)
Legnano, Milano, Italy
(3)
Sherman Oaks, CA, USA
(4)
Belgrade, Serbia
 

Now that you know how to create an Azure SQL database, how to connect to it, and how to restore a sample database, it’s now time to start to see what Azure SQL offers to a modern developer.

In this chapter, we will see all the major features that can help you to create modern applications, using the sample database WideWorldImporters as our playground.

As usual for this book, we’ll keep a very pragmatic approach, assuming that you have a basic idea of a table: an entity composed of columns, each one with its name and own data type, that contains rows of data. Columns define the shape, the schema, of data. To be stored in a table, data must adhere to its schema, meaning that column data types and constraints must be observed; otherwise, the database will reject the data. This approach has been recently defined as schema-on-write: consistency with the schema is checked when data is written into the database. This ensures data is coherent with what the applications using it are expecting. Some other databases offer a schema-on-read approach , where all data can be stored, and schema is checked only when someone tries to read the data. Both approaches have, as usual, pros and cons; since data is useful only when coherent and meaningful, no matter what is the approach you choose, you still must be sure that served data is logically correct and sound. With a schema-on-write approach, Azure SQL works with you to make sure this happens, as it will check for you that data is always consistent with the defined schema. If not, it will return an error. This behavior may be seen as a bit rigid, and sometimes it is, so that’s why Azure SQL also supports schema-on-write to some extent, by allowing natively to store and manipulate JSON documents. Keep in mind that with a schema-on-read, it’s up to you to write code right inside your application to validate data before using it. In most of the common cases, that’s quite a big incumbency that would be better taken care of by a specialized entity, like Azure SQL, alleviating a lot of your work as a developer and, again, helping to have a good separation of concerns. Data consistency and integrity is mainly a matter for a database that can take care of it in a centralized way for all solutions accessing its data.

A deeper discussion on tables will come in Chapter 6, but given that is pretty common for a developer to start to work on an existing database, we felt it was better to start understanding what we can do with the data we already have.

And there are lots of things we can do! For this reason, this chapter aims to give you a full overview of all the features you should know to make sure you are really using all the potential that Azure SQL has.

As there are so many features for developers packed in Azure SQL, in this chapter you’ll find just the tip of the iceberg. Make sure you use the code samples included with the book to get more details on features and to see them in action. Use the referenced resources to dive deeper into all the features and their options.

Remember, this book is like a diving board that aims to give you a boost to better dive in the data ocean!

Pushing compute to data

I strongly recommend pushing as much of the computation needs you have to the data instead of bringing data to you. Moving data is a heavy, complex, resource-consuming task. Moving computing logic is much easier, as it is way more compact and lightweight. Also, with the usage of Views, Functions, or Stored Procedures that we will discuss in a moment, you don’t even have to move source code or compiled binaries around.

But the main benefit of pushing compute to data is that you can really take advantage of all the money you are paying for using a database like Azure SQL. If you would have had to implement all the querying, filtering, and optimization logic in your code, then why use a database at all? Just use a much cheaper file storage and you would have saved a lot of money.

Well, that could be the impression, but in reality, you would still have to spend time to actually implement those features you are renouncing when not using Azure SQL. So, it wouldn’t really be that much cheaper.

It turns out it would actually be much more expensive if you also take into account that you would probably need to maintain and evolve those features. And it will be just one or a few of you, while for Azure SQL, there are literally hundreds of engineers working on it every day, with just one objective: improve its performance and capabilities as much as possible. By pushing compute to data, you can get the benefit of all those engineers working behind the scenes. This can affect you and the performance of your application: let me give you a very practical sample of that.

If you need to find distinct values in a table that holds 50 million rows of data, until not so many years ago, you would have had to wait something like 5 seconds. Not bad, right? Still, 5 seconds, if you are creating an API to serve a mobile app, is a lot in today’s world. In the last few years, the engine powering Azure SQL has been improved by adding columnstore table and vector (or batch) executions by taking advantage of SIMD CPU instruction set and some advanced optimization techniques like the Grouped Aggregate Pushdown. With all these improvements, the same query now takes a single-digit millisecond – single digit – from 5000 milliseconds originally.

Can you imagine how much effort and costs you would have had to go through if instead of pushing the calculation to the database, you would have moved the data to the application to reach the same improvement? Instead, you could use all those resources for something unique to your use case or business problem, something that the database cannot solve for you.

Push compute to data: it’s really the best approach to have a scalable, modern, performant solution.

Declarative vs. imperative

Azure SQL allows you to use a different approach to create solutions to manipulate data, as opposed to the language you are used to work with every day. If you are using C#, Java, Python, or NodeJS, in fact, you are most probably used to work following the imperative paradigm. With the imperative paradigm, you tell the system exactly how you want a process, more precisely, an algorithm, to be applied. For example, if you have a list of numbers and you want to count how many distinct values there are in that list, you may implement the solution using a loop, iterating over the values, where you compare the current value to a list of values you have already seen before, and adding the current value to that list only if it is not yet there.

If you use a declarative approach instead, you only tell the system what you want, not how you want to have it done. Basically, you describe where you want to be, using the data structure you already have as a starting point. It’s really not much different, conceptually, than setting your navigation system to bring you from point “A” to point “B.”

So, using the same sample mentioned before, with Azure SQL (and any relational database), you would write something like
SELECT DISTINCT Number FROM Values;

We’ll discuss the details of the preceding line of code in the next sections, but you can already understand intuitively what it is doing.

Python and C# provide some declarative support. C# supports Language-Integrated Query (LINQ), and Python uses List Comprehensions, so you may have some experience, and that will help you a lot in using Azure SQL. On the other hand, if you don’t have experience with that yet, by reading this book, you'll get familiar with the power and beauty of the declarative paradigm.

Relational databases, and Azure SQL in particular, bring the idea of the declarative approach to the next level. Azure SQL will not only take care of the implementation details for you, so that you can focus on what you want and not how you can get it, but it will do so by taking into account an impressive number of additional factors. For example, how much data is in a table, how much memory it will need to manipulate it, if existing indexes can help, and even how data is distributed within a table.

If you think about it, a modern navigation system works exactly in the same way: it not only allows you to specify the start and the destination, but it will also take into account real-time traffic, roadblocks, and other conditions so that it can try to give you the best solution within a given finite amount of time, not just a solution.

In modern development, this is vital to make sure your application is always responsive, even if the data it uses keeps changing in value and size: adapting to changes is a key factor also for applications, not just for developers.

Query and data manipulation

The language you will use to interact with data in an Azure SQL database is, as one could guess, SQL (Structured Query Language). SQL is a standard that is adopted across many relational and non-relational databases. Learning it well is useful both to use Azure SQL and to take advantage of many other systems like Apache Spark, just to name another very well-known data platform. Every database usually implements the ANSI/ISO SQL standard to a certain extent, with some variations to exploit specific features available in the platform. The implemented SQL is therefore called a dialect. For Azure SQL, the dialect is the same used for SQL Server: Transact-SQL or T-SQL for brevity.

To manipulate data, all you need is a handful of commands, usually referred to as DML – Data Manipulation Language  – that cover all the needed functionalities:
  • SELECT

  • INSERT

  • UPDATE

  • DELETE

  • MERGE

As you can recognize, if you are familiar with the CRUD – Create, Read, Update, Delete – set of functions common in application development, only the first four SQL commands are really needed to implement CRUD operation; a fifth, MERGE, has been added so that insert, update, and delete can be done all together in just one command, simplifying code by leaving a lot of the details to the query engine and thus making it more understandable and in many cases also more performant.

Retrieving data

The SELECT command allows you to get a set of data back from the database. It requires the list of columns you want to get, the table name, and optionally, but almost always used, a filter to limit the returned rows to only those you are interested in, for example:
SELECT
      InvoiceID,
      InvoiceDate,
      DeliveryInstructions,
      ConfirmedDeliveryTime
FROM
      Sales.Invoices
WHERE
      CustomerID = 998
ORDER BY
      ConfirmedDeliveryTime;

This query will look into Sales.Invoices table , return all the rows that have the value “998” in column CustomerID, and limit the results to contain only columns InvoiceID, InvoiceDate, DeliveryInstructions, and ConfirmedDeliveryTime.

Even such a simple SELECT statement is quite interesting. First, it is worth noting that there is no inherent or natural order. If no ORDER BY is specified, data is returned without any specific order.

For performance reasons, the data will likely be returned in the order Azure SQL accesses it, which is something that may change over time; if you rely on data to be in a specific order, you must specify the order by clause. Second, unless specified otherwise with an option called COLLATION , database objects are not case sensitive. This also applies to data within a table. It’s always possible to change this, but it may have performance implications: a bit more discussion on this will be done in the next Chapters.

Let’s move to a more complex sample now. Let’s say that our customer 998 is browsing her order history to look for a specific item that she received in the first quarter of 2016. By using a set of filters on a fictitious mobile app, this could be the T-SQL statement we need to execute:
SELECT
      il.InvoiceLineID AS LineID,
      i.InvoiceID,
      il.[Description],
      il.Quantity,
      il.UnitPrice,
      il.UnitPrice * il.Quantity AS TotalPrice,
      i.ConfirmedDeliveryTime
FROM
      Sales.Invoices AS i
INNER JOIN
      Sales.InvoiceLines AS il ON i.InvoiceID = il.InvoiceID
WHERE
      i.CustomerID = 998
AND
      il.[Description] LIKE N'%red shirt%'
AND
      CAST(i.ConfirmedDeliveryTime AS DATE) BETWEEN '2016-01-01' AND '2016-03-31';
When executed in the WideWorldImporters database, the result will be the one visible in Figure 4-1.
../images/493913_1_En_4_Chapter/493913_1_En_4_Fig1_HTML.jpg
Figure 4-1

Sample SELECT result

In the sample query just used, even if simple, there are a few notable things that are very common in day-to-day use.

Caution

Instead of specifying the columns you want to have as result, you may use the star character * to include all the columns in the table automatically. While this may be useful while executing queries to just take a look at the data or to do some data exploration, it is not a best practice when creating something that will be used by an application. Column’s order may change at any time, and columns may be added or removed, if someone changes the table definition. Always specify the column you want to be returned, so that you will have a deterministic result. If you don’t need all the columns, you will also avoid wasting network bandwidth and improve transfer speed.

Two-part names

Objects, like tables, have their own names like the Invoices table. Since the same name could be used in a different context, it is a good practice, though not required, to always qualify the object name with the schema it belongs to. A schema is a way to organize objects that belong to the same group, for example, Sales. All objects related to Sales would be placed in the same schema. Besides removing any ambiguity, schemas also help to simplify management especially from a security perspective, as permissions can be given to schemas, and they will affect all objects belonging to the same schema.

Aliasing

You have the ability to alias the name of the tables using the AS keyword, so that if a column is present in more than one table, you can just use the alias to prefix the column name and specifically decide from which table it must be taken from. Some column names will repeat quite frequently (think of, for example, Name, ID, Description, and so on), so a way to eliminate ambiguity is needed. An alias allows you to do that, also removing the need to use the full table name as a column prefix. Less code, nicer results.

Besides tables, columns can also be aliased. You may want to change the column name when returned to you, as it happens for the column InvoiceLineID which is returned as LineID instead. As you may have noticed, that column has been prefixed anyway with the alias of the table where it exists. It was not needed as there are no other columns with the same name in the tables used in the query, but prefixing also helps to quickly understand where a column is coming from. It is just a good practice to make the code more understandable and human-friendly.

Columns must be aliased when they would not have a name at all. For example, when a resulting column is created from an expression, as it happens for the column TotalPrice. In the sample, the expression is really simple, but keep in mind that they can be much more complex and apply complex data transformations to column values.

Quoted identifiers

Objects may have almost any name you like (well, there are some naming rules: https://aka.ms/rdddi), but if you use a name that may clash with a keyword, you just have to use the square brackets to make sure the query engine can understand that you are referring to a column or a table and not to something else, like a system function.

Joins

The sample also shows how you can return values from different tables. Using the INNER JOIN statement, we’re saying that for each value that exists in table Sales.Invoices, column InvoiceID, we want to get all the rows with the same value in the same column but in table Sales.InvoiceLines. Invoice data has been saved into two different tables to avoid data duplication and to create easier access to each invoice line, but since the CustomerID is available only in the Sales.Invoices table, a join operation is needed to return the desired data.

Choosing if it is better to keep everything in fewer tables and then deal with data duplication challenges or if it is better to decompose data into different tables to avoid duplication and then have to put data together when needed is quite an important topic to understand and falls under the definition of database modeling, and more specifically, is a process called normalization. There are entire books dedicated to this discussion and what are the pros and the cons of normalization and denormalization, so you’ll find more references at the end of this chapter. Keep in mind that the normalization process is about organizing data to minimize redundancy. Having duplicated information in a database brings a lot of not-so-obvious effects and a lot of challenges, and therefore is absolutely important to understand, even if you’re using non-relational databases: all the normalization concepts will apply anyway. It’s not by chance, in fact, that normalization is a concept that you can find also in NoSQL manuals.

There are different kinds of joins that allow you to exactly define what are the rows that must be returned in the results. Let’s say, for example, that you want to have a list of all customers with all their related invoices and, as you can guess, you have two separate tables to hold the two different sets of data: Customers and Invoices. It could happen that you have a new customer that hasn’t received any invoices yet. With an inner join, such a customer would not be included in the result of our hypothetical query. If you want all customers, no matter if they have an invoice or not, you need to use a left join. If instead you want to return all Invoices, even those that do not belong to any customer, you would have to use a right join. The terms left and right refer to the position of the table with respect to the JOIN clause (Figure 4-2).
../images/493913_1_En_4_Chapter/493913_1_En_4_Fig2_HTML.jpg
Figure 4-2

Join types

The Full Join is all the three kinds together. You’ll get rows from both the tables participating in the join.

Joins can be used along with other joins, so that you can represent complex relationships. If we would like to expand the original code sample to return also customers’ data, we would have to change it so that Customers, Invoices, and InvoiceLines would be joined together:
SELECT
      c.CustomerName,
      il.InvoiceLineID AS LineID,
      i.InvoiceID,
      il.[Description],
      il.Quantity,
      il.UnitPrice,
      il.UnitPrice * il.Quantity AS TotalPrice,
      i.ConfirmedDeliveryTime
FROM
      Sales.Customers AS c
INNER JOIN
      Sales.Invoices AS i ON i.CustomerID = c.CustomerID
INNER JOIN
      Sales.InvoiceLines AS il ON i.InvoiceID = il.InvoiceID
WHERE
      i.CustomerID = 998
AND
      il.[Description] LIKE N'%red shirt%'
AND
      CAST(i.ConfirmedDeliveryTime AS DATE) BETWEEN '2016-01-01' AND '2016-03-31';

And the result would be the following:

../images/493913_1_En_4_Chapter/493913_1_En_4_Figa_HTML.jpg

As you can see, the CustomerName and CustomerID are repeated for each row of Sales.Invoice related to that customer, which is in turn repeated for each Sales.InvoiceLine related to that invoice.

This allows you to have the full set of information needed by your application on each line, but without having to store duplicate data in the database. While this may seem complex at the very beginning and maybe also counterproductive as you may have instead saved all the data into just one JSON document per line, so that you wouldn’t even need to put the pieces together when needed, it actually helps a lot to simplify things.

By deconstructing complex data in smaller pieces, you allow each piece to be used and reused together with some other set of data. It’s the same concept of code reuse just applied to data.

As mentioned before, the process of decomposing data into smaller pieces without losing information is called normalization and is not specifically tied to relational databases. On the contrary, it is helpful in many different areas and technologies, as it is nothing more than a process that helps to avoid duplication of information, which usually brings more challenges than what it solves. Not surprisingly, it is referenced both in NoSQL documentation and even Object Modeling documents. Learning it will be helpful in a variety of cases, which goes from development to data science: you’ll find several resources to dive deeper into these concepts and the end of this chapter.

Filtering

Quite often you’ll want to get only a few rows, or even just one, from the database. To filter out the rows you don’t need, the WHERE clause is what you must use. As you have seen in the sample query, the WHERE clause can contain one or more filter predicates that allow you to specify what are the values you are interested in for any column available to the query. Any column from tables specified in the FROM clause or the JOIN can be used.

Values used in the WHERE clause can be literals, variables, sets of data coming from other queries, or results of a function call. This allows you to exactly target the data you want to work on.

Subqueries

A subquery is a query that is nested within another query. As it happens for the Linux shell or Windows PowerShell, where you can pipe the result of one command into another one to easily build complex transformations, Azure SQL allows you to do something conceptually similar:
SELECT
      OrderId,
      OrderDate,
      (SELECT COUNT(*) FROM Sales.[OrderLines] AS ol
WHERE
      ol.[OrderID] = o.OrderId) AS OrderSize
FROM
      (SELECT * FROM Sales.[Orders] WHERE SalespersonPersonID = 2) AS o
WHERE
      o.[CustomerID] IN
      (
           SELECT
                c.CustomerID
           FROM
                Sales.[Customers] AS c
           WHERE
                [CustomerName] = 'Daniel Martensson'
      )
AND
      OrderDate >= '2015-01-01'
ORDER BY
      [o].[OrderID];

In the preceding sample, there are three subqueries. The first one is used in the FROM clause, and it is used to limit the orders to only those generated by a specific salesperson. Then there is another subquery in the WHERE clause that limits the orders only to those done by a specific customer. And, finally, the third subquery is in the SELECT list, used to enrich the result with the count of lines in each order.

If you already have some experience with the SQL language, you may recognize that the sample query could have been written also using some joins, instead of using the subqueries in the FROM and in the WHERE clause. The Azure SQL Query Optimizer is smart enough to realize that too, and usually, there are no performance differences if a query is written in one way or the other.

Common Table Expressions

Common Table Expressions, or CTEs for short, are an improved alternative to subqueries. As you may have guessed, if you have a very complex SQL statement with many subqueries, it becomes very hard to read.

What happens, also, if the same subquery should be used twice in the same query? You’ll end up duplicating code. But, as we strive to be better developers, we don’t want to have any code duplication, if possible.

CTEs are exactly what we need to organize our code so that it can be more easily understood and, above all, changed and maintained. A CTE is a temporary query definition that will exist only for the duration of the SELECT, INSERT, UPDATE, or MERGE statement that uses it.

Here’s the same code showed in the subquery paragraph, rewritten so that it can use CTEs instead of subqueries (where it makes sense to do so):
WITH cteOrders AS
(
     SELECT * FROM Sales.[Orders] WHERE SalespersonPersonID = 2
),
cteCustomers AS
(
     SELECT
           c.CustomerID
     FROM
           Sales.[Customers] AS c
     WHERE
           [CustomerName] = 'Daniel Martensson'
)
SELECT
     OrderId,
     OrderDate,
     (SELECT COUNT(*) FROM Sales.[OrderLines] AS ol WHERE ol.[OrderID] = o.OrderId) AS OrderSize
FROM
     cteOrders AS o
INNER JOIN
     cteCustomers c ON [c].[CustomerID] = [o].[CustomerID]
AND
     OrderDate >= '2015-01-01'
ORDER BY
     [o].[OrderID];

As you can see, after the WITH statement, two CTEs are defined, one named cteOrders and the other one cteCustomers. Each time in the query where we reference one of those names, it would be like if we would have put a subquery there. The difference is that code is much easier to understand. I personally like to think of CTEs as a nice way to clearly define what are the sets of data I’ll need to work with in my main query: I can define these sets on the top, so that it will also be easier understanding how they operate on data, and then I can refer to them as much as I need and wherever I need without having to type the same code twice.

Remember

CTE lifetime is scoped to the execution of the query that has defined them. If you need to reuse the query encapsulated into the CTE with other queries, you may want to create a VIEW or a FUNCTION. More on this in the next chapters.

CTEs can also refer to CTEs so you can create very complex queries while keeping code very clean and easy to read. Also, CTEs can even refer to themselves so that you can create recursive queries. This can become quite handy when you have to work with data structured in an hierarchical way, a tree, for example, and you want to traverse the entire tree, but you don’t know how deep the tree is, so you need to have some “smart” algorithm that will stop only when there is no more data to process, not just after a certain amount of iterations.

A recursive CTE sample is available in accompanying code.

Union

In case you have two queries and you want to concatenate their results to return just one resultset to the user, you can use UNION statement:
WITH cteContacts AS
(
     SELECT
           [CustomerID],
           [PrimaryContactPersonID] AS ContactPersonId,
           'Primary' AS [ContactType]
     FROM
           Sales.[Customers]
     UNION
     SELECT
           [CustomerID],
           [AlternateContactPersonID],
           'Alternate' AS [ContactType]
     FROM
           Sales.[Customers]
)
SELECT
     [ContactPersonId],
     [ContactType]
FROM
     [cteContacts] c
WHERE
     c.CustomerId = 42

The code will return the result of the two queries on Sales.Customer as a single result. To work without errors, the UNION requires that the two resultsets must have compatible schema.

Keep in mind that UNION will also remove any duplicate values from the resulting resultset, so it could be quite an expensive process. If you already know in advance that you can’t have any duplicate values or you don’t care about duplicate values, then you can use UNION ALL instead, which is more lightweight as it doesn’t have to look for duplicates and remove them from the result.

Semicolon

All Azure SQL statements should be terminated by a semicolon. Though this is not mandatory now, it is a best practice as it’s part of the standard ANSI-SQL 92 and it will be required in future versions. Already today, in fact, some commands require it to function correctly. For example, the WITH statement of a CTE must be the first statement of the line. In other words, this means that the previous command must be terminated by a semicolon.

Unicode strings

You may have noticed that the string N‘%red shirt%’ is prefixed with a capital N. This prefix tells Azure SQL that string is a Unicode string.

Adding data

INSERT INTO is the command used to add data into a table. It’s very easy to use as it needs only three things to work: the table where you want data to be added, the columns you’ll be targeting, and the values to be added.

The following code, for example, will add two rows to the table Warehouse.Colors:
INSERT INTO [Warehouse].[Colors]
      ([ColorID], [ColorName], [LastEditedBy])
VALUES
      (50, ‘Deep Sea Blue’, 1),
      (99, 'Out of space', 1);

The table may or may not have other columns, but we must provide values for the columns we specify. As you’ll learn later, a table may have some specific constraints in order to make some columns mandatory, for example, the email for a User table; if no constraints are in place, for all the columns existing in the table but not specified in the INSERT statement, a default or a NULL value will be used.

In addition to specifying the values manually, as I did in the previous sample, INSERT can also take values from a SELECT statement, for example:
INSERT INTO
      [Warehouse].[Colors]
      ([ColorID], [ColorName], [LastEditedBy])
SELECT
      ColorID, ColorName, LastEditedBy
FROM
      [External].[Colors]

In this case, you may have data in another table, named External.Colors that you want to move into Warehouse.Colors. By using INSERT FROM … SELECT, you can move data with just one command.

Modifying data

To change existing data in a table, you can use the UPDATE command. Similarly to the INSERT command, you have to specify the table that contains the data you want to update and the new values for each column you want to update. Here’s an example:
UPDATE
      [Warehouse].[Colors]
SET
      [ColorName] = N'Unknown',
      [LastEditedBy] = 2
WHERE
      [ColorID] = 99;

The WHERE part is optional but almost always specified, as it allows you to limit the scope of the changes that would otherwise be applied to all rows in the table. As explained in the “Retrieving data” section, the WHERE clause will make sure that you can exactly target only the rows you want to update.

The N character tells Azure SQL that the text being used to update the table is a Unicode text. You’ll notice that if you don’t specify it, everything will work fine. This happens because Azure SQL performs an implicit conversion, automatically casting the string to be a Unicode string. While in this case there is no harm done, implicit conversion can badly impact performances, so make sure data types are correct, especially when using WHERE predicates or JOIN clauses.

Removing data

To remove data from a table, you can use the DELETE command. For this command, only the table must be specified: DELETE will remove the entire row from the specified table, so there is no need to specify the columns as it happens for the other DML commands.

The WHERE clause is also, obviously, supported, and with that, you can make sure you remove only the rows you want to remove by specifying a predicate that targets only those.

Without a WHERE clause, all rows will be deleted. Be aware! No warning will be issued, and the table will be wiped out:
DELETE FROM
     [Warehouse].[Colors]
WHERE
     [ColorID] = 99;

Azure SQL automatically performs backups of your data, and it allows you to restore it to any point in time within the last 7 up to 35 days, depending on the offer you are using. There is no additional cost for this. Even if you delete everything by mistake, you’ll be able to easily restore it right before the unwanted action happened.

Merging data

Merge is a command that allows you to execute inserts, updates, and deletes at the same time, so that one set of data can be merged into an existing one. The table that contains the data you want to merge into another table is your source table, while the other one is the target table. By merging a source table into a target table, this is what may happen, depending on what you specify in the MERGE command:
  • All rows that exist in the source but not in the target will be inserted in the target table.

  • All rows that exist in both tables will be updated in the source table.

  • All rows that exist in the target table and not in the source will be deleted from the target table.

I said “may” as you are in total control of if and how the insert, update, and delete operations will happen and if they will happen at all:
MERGE INTO
     [Warehouse].[Colors] AS [target]
USING
     (VALUES
           (50, 'Deep Sea Blue'),
           (51, 'Deep Sea Light Blue'),
           (52, 'Deep Sea Dark Blue')
     ) [source](Id, [Name])
ON
     [target].[ColorID] = [source].[Id]
WHEN MATCHED THEN
     UPDATE SET [target].[ColorName] = [source].[Name]
WHEN NOT MATCHED THEN
     INSERT ([ColorID], [ColorName], [LastEditedBy]) VALUES ([source].Id, [source].[Name], 1)
WHEN NOT MATCHED BY SOURCE AND [target].[ColorID] BETWEEN 50 AND 100 THEN
     DELETE
;

After the MERGE INTO, there is the target table. USING tells what the source is. In this case, the code is using a Table-Valued Constructor to create a table on the fly and aliasing it with the name source. As such, a table is completely volatile and will be gone once the statement has completed the execution; it also requires you to have the names of its columns specified, as there is no metadata available anywhere to figure out how those are named. Data types will be automatically inferred from the provided values.

The ON part is very similar to how JOIN uses it, and it defines the rules needed by Azure SQL to understand how to match rows coming from the source with rows in the destination. In the sample case, we are using Id from source and ColorId from the destination. After that, you have to tell MERGE what to do when there is match and also when there isn’t one:
  • WHEN MATCHED – If there is a match between rows in source and target table, the code will update the ColorName column in the target table using the related value from the Color column in the source table.

  • WHEN NOT MATCHED – If the target table doesn’t have any rows with a ColorID that exists also in the source table, those rows will be taken from the source and inserted into the target.

  • WHEN NOT MATCHED BY SOURCE – If the target table contains some rows with a ColorID that doesn’t exist in the source table, those will be deleted from the target. The sample is adding an additional predicate to better define the scope. Not only must there be ColorID in the target that doesn’t exist in the source, but also the ColorID values must be between 50 and 100. This means that all rows in the source table that have ColorID values from 0 to 49, for example, will not be deleted as outside the scope of the defined rule.

Additional useful features

So far, the basic concepts have been discussed. They are more than enough to start to create great solutions, but there are a few more features that you want to start to use right away, as they can be helpful in several different scenarios to simplify code and to improve performances and concurrency of your solution.

Output clause: inserted and deleted virtual tables

Returning to the application the result of an INSERT, UPDATE, or DELETE (or MERGE) operation is a very common requirement. For example, let’s say that in the API solution you’re creating, you accept updates to your entities via an HTTP PUT method. The API you created will apply the new data received via PUT to the database using an UPDATE or MERGE statement. To follow an established good practice, you want to return the full entity as a result of the PUT request, so that the caller can have the fully updated entity without the need to issue a dedicated GET request.

This would mean to do something like the following:
UPDATE
    [Warehouse].[Colors]
SET
    [ColorName] = 'Unknown'
WHERE
    [ColorID] = 99;
SELECT
    [ColorID],
    [ColorName],
    [LastEditedBy],
    [ValidFrom],
    [ValidTo]
FROM
    [Warehouse].[Colors];

The problem with the preceding code is that, on a highly concurrent system, it could happen that between the UPDATE and the subsequent SELECT, another connection could apply some changes to the data, which is a behavior that in general you want to avoid. A deeper discussion on this will be done in the chapter dedicated to transactions, but in the meantime, you can see that the problem comes from the fact that we have two different commands that we want to execute one after the other, without anything in between, just like if they were just one logical operation.

A very elegant, scalable, and performant way to solve this issue is to ask the UPDATE command to also generate an output, so everything will be executed as one command and the problem would be solved right at the root.

The OUTPUT statement does exactly this:
UPDATE
    [Warehouse].[Colors]
SET
    [ColorName] = 'Unknown'
OUTPUT
    [Inserted].[ColorID],
    [Inserted].[ColorName],
    [Inserted].[LastEditedBy],
    [Inserted].[ValidFrom],
    [Inserted].[ValidTo],
WHERE
    [ColorID] = 99;

The result will be the equivalent of the code with the separate UPDATE and SELECT statement, but without the described potential issue.

Inserted is a virtual table that exists only for the duration of the statement that uses it. The Inserted virtual table gives access to the data as it is as after it has been modified; there is also a Deleted virtual table that gives access to data as it was before the modification took place.

As an UPDATE statement can be thought of as a logical pair of DELETE/INSERT statements, the UPDATE allows you to use both virtual tables. An INSERT statement will give you access only to the Inserted virtual table and, of course, the DELETE statement only to the Deleted virtual table. The MERGE statement, obviously, will give access to both Inserted and Deleted virtual tables too.

Identity and sequences

When you need to create an Id number to be assigned to an entity or a row, usually to easily uniquely identify it, you may do it in your application or you can rely on the database to do it. Using the database will make sure that by default no two equal Ids can be generated simplifying your code a lot.

In Azure SQL, there are two ways you can use to achieve this. One is using the IDENTITY feature, which exists mostly for backward compatibility purposes. When you create a table, you can elect one integer column to be an identity column. This means that Azure SQL will generate values for that column automatically for you, every time a new row is inserted:
CREATE TABLE dbo.SampleID
(
     Id INT IDENTITY(1,1) NOT NULL,
     OtherColums NVARCHAR(10) NULL
);

IDENTITY(1,1) means that numbers will be generated starting by one and incremented by one.

While it worked nicely for many years, it has two main disadvantages. The first is that you cannot manually provide a value for that column when executing an INSERT command, unless you temporarily disable the identity behavior using a specific SET option. Not exactly user-friendly. The second, more importantly, is that if you have more than one table with an identity column (and this could be pretty common), they won’t be aware of the existence of each other and thus different tables will have rows with the same Id values. This may not be a big deal, but sometimes you want to have rows that could be uniquely identified throughout all the database or at least among a group of logically related tables.

A sequence is exactly what you need to overcome old limitations and to get all the flexibility you need. A sequence is created at the database level and can be used anywhere you need:
CREATE SEQUENCE dbo.BookSequence
AS BIGINT
START WITH 1
INCREMENT BY 1;
CREATE TABLE dbo.SampleID
(
  Id INT NOT NULL DEFAULT(NEXT VALUE FOR dbo.BookSequence),
  OtherColums NVARCHAR(10) NULL
);

With this approach, you can decide if you want to use the automatically generated value or provide one of your own when writing your INSERT statement, without having to set any specific option before executing it.

Sequences also offer in general better performances and more control on how numbers are generated. You can even reserve numbers if you need to.

So, if you are starting to create a new database, the recommendation is to use one or more SEQUENCE to generate your Ids. Just keep in mind that neither IDENTITY nor SEQUENCE will give you any guarantee that no duplicate numbers will be generated: you can always reset the number generator and start from an already generated number. The SEQUENCE command even allows you to automatically restart from the beginning once a certain value has been reached, as sometimes this ability to cycle among a set of defined numbers could be very useful.

A sequence can be dropped using the DROP SEQUENCE command:
DROP SEQUENCE dbo.BookSequence;

Top and Offset/Fetch

Sometimes, especially if you are just exploring data, you don’t really need to get all the data in a table. This is particularly true on big tables and even more as we’re talking about a database in the cloud. There is no point moving around huge amounts of data if you’re not really using it. Just the first 100 rows, for example, could be good enough to peek at the data you have to work with.

Another reason you want to limit the number of rows returned to a specific amount is because you want to paginate the data. For example, supporting pagination would be a common requirement if you are implementing a REST API that must expose an ODATA endpoint.

In Azure SQL, you have two options to make sure that only the requested number of rows are returned: TOP and the pair OFFSET/FETCH .

TOP is the easiest to use but also the most limited one. OFFSET/FETCH is a bit more complex but gives you more flexibility and makes pagination very easy to implement:
SELECT TOP (50)
     *
FROM
     [Sales].[Orders]
ORDER BY
     [ExpectedDeliveryDate] DESC;

The preceding code will return the first 50 ordered by ExpectedDeliveryDate in descending order.

The next code sample will do the same, but thanks to the offset option, it will skip the first 50 rows and will return the next 50. Basically, if you have a page size set to 50 for the pagination feature you’re implementing, the code is effectively returning the second page of data. If OFFSET would have been set to 0, the would have produced the same results as TOP, but as you can see, OFFSET/FETCH provides a bit more flexibility. From a performance point of view, they are exactly the same:
SELECT
     *
FROM
     [Sales].[Orders]
ORDER BY
     [ExpectedDeliveryDate] DESC
OFFSET
     50 ROWS
FETCH
     NEXT 50 ROWS ONLY

Aggregations

Azure SQL offers extensive support for aggregations so that you can efficiently write queries that can aggregate and analyze datasets using the most advanced optimization techniques. You can get great performance while keeping the complexity of the code you need to write at the lowest level possible.

Grouping data

The GROUP BY clause can be used in a SELECT statement to apply an aggregation function, for example, COUNT or MAX (but there are many more) to all the groups that exist in a table. A group is defined as a set of rows that have the same value for the specified columns. For example, the following code returns which and how many products a warehouse has in stock, grouped by SupplierID and ColorID:
SELECT
     [SupplierID],
     [ColorID],
     COUNT(*) AS ProductsInStock,
     SUM(QuantityPerOuter) AS ProductsQuantity
FROM
     [Warehouse].[StockItems]
GROUP BY
     [SupplierID], [ColorID]
ORDER BY
     [SupplierID], [ColorID]
Here’s a sample result.
../images/493913_1_En_4_Chapter/493913_1_En_4_Fig3_HTML.jpg
Figure 4-3

Data aggregated by Supplier and Color

Of course, you may want to JOIN the resulting data with other tables to return not only the IDs but also the name of the supplier and the color.

You can easily do that by starting to put together what you have learned so far, using a Common Table Expression and couple of JOIN to elegantly solve the problem and return to your application one resultset with all the data needed for doing its job.

A fully commented code that shows such a query is available in the code accompanying the book.

Multiple grouping

A very interesting feature that Azure SQL provides is the ability to perform aggregations on different groups at the same time. It may sound confusing, so an example will help. Using the same sample done before, let’s now make it more realistic. You need to return data that can be used to create a matrix report where the end user can analyze how many products there are in the warehouse per supplier and per color. The rows and columns will contain suppliers and colors, respectively. At the intersection of a row and a column, one can find the number of products, with that color and from that supplier, in stock. Since it is a matrix report, the user expects to have the total number of products per supplier on the rightmost column and the total number of products per color on the last row.
../images/493913_1_En_4_Chapter/493913_1_En_4_Fig4_HTML.jpg
Figure 4-4

A typical matrix report

This means that you must group data using four different settings:
  • Products and Suppliers

  • Products

  • Supplier

  • All (count all products)

This would usually mean four different queries: on large datasets performances will be affected for sure as you must read and aggregate data four times.

I’m sure as a developer you are already thinking of using some caching to avoid this waste of resources, and move some calculations within your code so that you don’t have to read the same data four times – clever, but that would increase the complexity of your code. If possible, we would like to avoid that. Luckily, such additional complexity is not needed, and you can keep your code simple and lean. All you need to do is ask Azure SQL to do this multiple concurrent aggregation for you, using the GROUPING SET feature:
SELECT
     [SupplierID],
     [ColorID],
     COUNT(*) AS ProductsInStock,
     SUM(QuantityPerOuter) AS ProductsQuantity,
     GROUPING(ColorID) as IsAllColors,
     GROUPING(SupplierID) as IsAllSuppliers
FROM
     [Warehouse].[StockItems]
GROUP BY
     GROUPING SETS
     (
           ([SupplierID], [ColorID]),
           ([SupplierID]),
           ([ColorID]),
           ()
     )
ORDER BY
     TotalPerColorID, TotalPerSupplierID
;
The result of the preceding query will contain data grouped by all the four defined grouping sets, so that you will have all the needed data without any added complexity – and with much better performances.
../images/493913_1_En_4_Chapter/493913_1_En_4_Figb_HTML.jpg
Figure 4-5

Grouping Sets in action

The function GROUPING present in the SELECT list is used to help the end user to understand if a row is representing the total for a specific value. For example, a row with IsAllColors equal to one is a row that must be used in the rightmost part of the matrix, as it represents the total product available for a specific supplier, no matter the color.

Windowing Functions

Windowing Functions and all the related features are probably one of the most powerful constructs that you can use in Azure SQL to manipulate data. The amount of problems that can be elegantly solved with them is amazing, and there are entire books dedicated only to explaining their usage in detail. The goal of this section is to make you familiar with their usage so that you can start to take advantage of their power right away.

While extremely powerful, Windowing Functions are quite easy to understand. In very simple terms, they allow you to access data that sits before and after the row you are currently processing.

The easiest practical example is provided by the calculation of a running total. A running total is defined as “a total that is continually adjusted to take account of items as they are added.”

Here’s an example.
../images/493913_1_En_4_Chapter/493913_1_En_4_Fig5_HTML.jpg
Figure 4-6

Calculating the Running Total for an Order

To calculate the running total for the third line, which equals to 27, Azure SQL, from a logical perspective, has to get all the values of all the previous rows for the Quantity column and sum them together. Same goes for when it has to calculate the Running Total value for the fourth row and so on.

These additional rows that are needed to be taken into consideration for the calculation represent the window of data on which Azure SQL is operating. For the Running Total, that window starts at the very first line of the table and ends at the current row.

Once the window is defined, you can tell Azure SQL which function you want to use on the data available in the window. Aggregate functions like SUM are common. And in fact, SUM is exactly what you need to implement a running total. The code needed to generate the result shown in the previous picture is the following:
SELECT
     [OrderLineID],
     [Description],
     [Quantity],
     SUM(Quantity) OVER (ORDER BY [OrderLineID] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS RunningTotal
FROM
     [Sales].[OrderLines]
WHERE
     [OrderID] = 37
Another common aggregation function is the AVG that will give you an easy way to calculate moving averages. You just need to define how big is the size of the window you want to use for calculating the average, for example:
AVG(Quantity) OVER (ORDER BY [OrderLineID] ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS MovingAvg
The size of the window is called a frame. Besides aggregates, you can also use analytical functions that increase the power of windowing functions incredibly. For example, you can use the function LAG to access a value existing in a row that is preceding the current one. Again, a practical sample will make this very easy to understand. Let’s say you want to calculate how much time passes between two consecutive orders for a specific customer.
../images/493913_1_En_4_Chapter/493913_1_En_4_Fig6_HTML.jpg
Figure 4-7

Calculating elapsed days between orders

For each row, you would need to access the previous row, take the order date, and compare it with the one in the current row. Thanks to windowing functions, this is very easy:
SELECT
     [OrderID],
     [OrderDate],
     DATEDIFF(
           [DAY],
           LAG(OrderDate, 1) OVER (ORDER BY [OrderDate]),
           [OrderDate]
     ) AS ElapsedDays
FROM
     [Sales].[Orders]
WHERE
     [CustomerID] = 832
ORDER BY
     [OrderDate]

This is a very simple and elegant code that can also be nicely optimized by the Azure SQL engine; it also saves you from a huge amount of code that you would have had to write otherwise.

So far, we focused on usage of just one specific Customer Id or Order Id. But, going back to the running total sample, what if we want to calculate the running total for all the orders in the entire table? Of course, we don’t want to mix lines of Order Id 37 with Order Id 39. Windowing function can take care of this for us too. We just need to specify how to partition the calculation process. In this context, partitioning allows us to tell Azure SQL when it should start a new calculation from scratch. You can think of it as something like “as long as you are processing values that belong to the same group, keep accumulating values.” Let’s modify the running total sample to use two orders instead of only one.
../images/493913_1_En_4_Chapter/493913_1_En_4_Fig7_HTML.jpg
Figure 4-8

Partitioning calculations per order

As you can see, the running total must start from scratch again once the Order ID changes from 37 to 39, since we want to calculate the Running Total for each order.

The code to do so is very similar to the original one, just with the addition of PARTITION BY [OrderID], that tells Azure SQL to keep accumulating values as long as the value in the OrderID column doesn’t change. When it changes, then Azure SQL must start a new calculation:
SELECT
     [OrderID],
     [OrderLineID],
     [Description],
     [Quantity],
     SUM(Quantity) OVER (
           PARTITION BY [OrderID]
           ORDER BY [OrderLineID] ROWS BETWEEN
                UNBOUNDED PRECEDING AND
                CURRENT ROW
     ) AS RunningTotal
FROM
     [Sales].[OrderLines]
WHERE
     [OrderID] in (37, 39)

Windowing functions can help a lot in both simplifying your code and having great performances, and we just have seen a very small part of what they can do, so make sure to check them out and to use them whenever appropriate.

Bulk operations

If you need to load a lot of data into Azure SQL, where a lot means hundreds of thousands, millions, or even billions of rows, you need to use a specific API called Bulk Copy API . As it allows only to insert data into a destination table, it is sometimes also referred to as Bulk Insert API or Bulk Load API.

This API is directly called by client libraries, and it allows you to massively load data with extreme speeds, easily loading tens of thousands of rows, and more, per second. Of course, I’m talking of performances within Azure, from a Web API application or a VM to the Azure SQL database being used in the back end. If the API or the application is in another cloud or running on premises, the speed at which you can load data will depend on the local network speed.

To run a bulk load, .NET provides the SqlBulkCopy class , while Java has the SQLServerBulkCopy class. Here’s an excerpt of a sample written using .NET:
using(var conn = new SqlConnection(Environment.GetEnvironmentVariable("CS_AzureSQL")))
{
    conn.Open();
    var bc = new SqlBulkCopy(conn)
    bc.DestinationTableName = "dbo.BulkLoadedUsers";
    bc.BatchSize = 10000;
    bc.WriteToServer(userDataTable);
}

Just to give you an idea of how fast this is, and you can test it yourself as the sample is part of code accompanying the book, if executed from an Azure VM in the same region of the database you are loading data into, it will load 100,000 rows in 0.91 seconds or 110,000 rows per second, even using a small BC_Gen5_2 database. Guess we can call it fast, right?

If you want to know more

In this chapter, you learned a lot, providing you the foundations to understand why pushing compute to data is important, along with the knowledge needed to do it efficiently, starting from understanding the declarative approach power up to manipulating data using Windowing Functions.

We also discussed briefly about normalization and why that is important and how that idea is useful not only to relational databases. Here’s a list of resources to get deeper into everything you have just learned:
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.104.120