Core Transact-SQL SELECT statement elements

The simplest query to retrieve the data that you can write includes the SELECT and the FROM clauses. In the SELECT clause, you can use the star character (*), literally SELECT *, to denote that you need all columns from a table in the result set. The following code switches to the WideWorldImportersDW database context and selects all data from the Dimension.Customer table:

USE WideWorldImportersDW; 
SELECT * 
FROM Dimension.Customer;

The code returns 403 rows, all customers with all columns.

Using SELECT * is not recommended in production. Such queries can return an unexpected result when the table structure changes, and is also not suitable for good optimization.

Better than using SELECT * is to explicitly list only the columns you need. This means you are returning only a projection on the table. The following example selects only four columns from the table:

SELECT [Customer Key], [WWI Customer ID], 
  [Customer], [Buying Group] 
FROM Dimension.Customer;

Here is the shortened result, limited to the first three rows only:

You can see that the column names in the WideWorldImportersDW database include spaces. Names that include spaces are called delimited identifiers. In order to make SQL Server properly understand them as column names, you must enclose delimited identifiers in parentheses. However, if you prefer to have names without spaces, or if you use computed expressions in the column list, you can add column aliases. The following query returns completely the same data as the previous one, just with columns renamed by aliases to avoid delimited names:

SELECT [Customer Key] AS CustomerKey, 
  [WWI Customer ID] AS CustomerId, 
  [Customer], 
  [Buying Group] AS BuyingGroup 
FROM Dimension.Customer;

You might have noticed in the result set returned from the last two queries that there is also a row in the table for an unknown customer. You can filter this row with the WHERE clause:

SELECT [Customer Key] AS CustomerKey, 
  [WWI Customer ID] AS CustomerId, 
  [Customer],  
  [Buying Group] AS BuyingGroup 
FROM Dimension.Customer 
WHERE [Customer Key] <> 0;

In a relational database, you typically have data spread in multiple tables. Each table represents a set of entities of the same kind, such as customers in the examples you have seen so far. In order to get result sets that are meaningful for the business your database supports, most of the time you need to retrieve data from multiple tables in the same query. You need to join two or more tables based on some conditions. The most frequent kind of a join is the inner join. Rows returned are those for which the condition in the join predicate for the two tables joined evaluates to true. Note that in a relational database, you have three-valued logic, because there is always a possibility that a piece of data is unknown. You mark the unknown with the NULL keyword. A predicate can thus evaluate to true, false, or NULL. For an inner join, the order of the tables involved in the join is not important. In the following example, you can see the Fact.Sale table joined with an INNER JOIN to the Dimension.Customer table:

SELECT c.[Customer Key] AS CustomerKey, 
  c.[WWI Customer ID] AS CustomerId, 
  c.[Customer],  
  c.[Buying Group] AS BuyingGroup, 
  f.Quantity, 
  f.[Total Excluding Tax] AS Amount, 
  f.Profit 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS c 
    ON f.[Customer Key] = c.[Customer Key];

In the query, you can see that table aliases are used. If a column's name is unique across all tables in the query, then you can use it without the table name. If not, you need to use the table name in front of the column, to avoid ambiguous column names, in the format table.column. In the previous query, the [Customer Key] column appears in both tables. Therefore, you need to precede this column name with the table name of its origin to avoid ambiguity. You can shorten the two-part column names by using table aliases. You specify table aliases in the FROM clause. Once you specify table aliases, you must always use the aliases; you can't refer to the original table names in that query anymore. Please note that a column name might be unique in the query at the moment when you write the query. However, somebody could add a column later with the same name in another table involved in the query. If the column name is not preceded by an alias or by the table name, you would get an error when executing the query because of the ambiguous column name. In order to make the code more stable and more readable, you should always use table aliases for each column in the query.

The previous query returns 228,265 rows. It is always recommended that you know, at least approximately, the number of rows your query should return. This number is the first control of the correctness of the result set, or in other words, whether the query is written in a logically correct way. The query returns the unknown customer and the orders associated with this customer—or, more precisely, associated to this placeholder for an unknown customer. Of course, you can use the WHERE clause to filter the rows in a query that joins multiple tables, as you would for a single table query. The following query filters the unknown customer rows:

SELECT c.[Customer Key] AS CustomerKey, 
  c.[WWI Customer ID] AS CustomerId, 
  c.[Customer],  
  c.[Buying Group] AS BuyingGroup, 
  f.Quantity, 
  f.[Total Excluding Tax] AS Amount, 
  f.Profit 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS c 
    ON f.[Customer Key] = c.[Customer Key] 
WHERE c.[Customer Key] <> 0;

The query returns 143,968 rows. You can see that a lot of sales are associated with the unknown customer.

Of course, the Fact.Sale table cannot be joined to the Dimension.Customer table. The following query joins it to the Dimension.Date table. Again, the join performed is an inner join:

SELECT d.Date, f.[Total Excluding Tax], 
  f.[Delivery Date Key] 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Date AS d 
    ON f.[Delivery Date Key] = d.Date;

The query returns 227,981 rows. The query that joined the Fact.Sale table to the Dimension.Customer table returned 228,265 rows. It looks as if not all Fact.Sale table rows have a known delivery date, and not all rows can match the Dimension.Date table rows. You can use an outer join to check this.

With an outer join, you preserve the rows from one or both tables, even if they don't have a match in the other table. The result set returned includes all of the matched rows like you get from an inner join plus the preserved rows. Within an outer join, the order of the tables involved in the join might be important. If you use LEFT OUTER JOIN, then the rows from the left table are preserved. If you use RIGHT OUTER JOIN, then the rows from the right table are preserved. Of course, in both cases, the order of the tables involved in the join is important. With a FULL OUTER JOIN, you preserve the rows from both tables, and the order of the tables is not important. The following query preserves the rows from the Fact.Sale table, which is on the left side of the join to the Dimension.Date table. In addition, the query sorts the result set by the invoice date descending using the ORDER BY clause:

SELECT d.Date, f.[Total Excluding Tax], 
  f.[Delivery Date Key], f.[Invoice Date Key] 
FROM Fact.Sale AS f 
  LEFT OUTER JOIN Dimension.Date AS d 
    ON f.[Delivery Date Key] = d.Date 
ORDER BY f.[Invoice Date Key] DESC;

The query returns 228,265 rows. Here is the partial result of the query:

    Date       Total Excluding Tax  Delivery Date Key Invoice Date Key
    ---------- -------------------- ----------------- ----------------
    NULL       180.00               NULL              2016-05-31
    NULL       120.00               NULL              2016-05-31
    NULL       160.00               NULL              2016-05-31
    ...          ...                    ...                 ...
    2016-05-31 2565.00              2016-05-31        2016-05-30
    2016-05-31 88.80                2016-05-31        2016-05-30
    2016-05-31 50.00                2016-05-31        2016-05-30

For the last invoice date (2016-05-31), the delivery date is NULL. The NULL in the Date column from the Dimension.Date table is there because the data from this table is unknown for the rows with an unknown delivery date in the Fact.Sale table.

Joining more than two tables is not tricky if all of the joins are inner joins. The order of joins is not important. However, you might want to execute an outer join after all of the inner joins. If you don't control the join order with the outer joins, it might happen that a subsequent inner join filters out the preserved rows of an outer join. You can control the join order with parentheses. The following query joins the Fact.Sale table with an inner join to the Dimension.Customer, Dimension.City, Dimension.[Stock Item], and Dimension.Employee tables, and with a left outer join to the Dimension.Date table:

SELECT cu.[Customer Key] AS CustomerKey, cu.Customer, 
  ci.[City Key] AS CityKey, ci.City,  
  ci.[State Province] AS StateProvince, ci.[Sales Territory] AS SalesTeritory, 
  d.Date, d.[Calendar Month Label] AS CalendarMonth,  
  d.[Calendar Year] AS CalendarYear, 
  s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product, s.Color, 
  e.[Employee Key] AS EmployeeKey, e.Employee, 
  f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit 
FROM (Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS cu 
    ON f.[Customer Key] = cu.[Customer Key] 
  INNER JOIN Dimension.City AS ci 
    ON f.[City Key] = ci.[City Key] 
  INNER JOIN Dimension.[Stock Item] AS s 
    ON f.[Stock Item Key] = s.[Stock Item Key] 
  INNER JOIN Dimension.Employee AS e 
    ON f.[Salesperson Key] = e.[Employee Key]) 
  LEFT OUTER JOIN Dimension.Date AS d 
    ON f.[Delivery Date Key] = d.Date;

The query returns 228,265 rows. Note that with the usage of the parentheses, the order of joins is defined in the following way:

Perform all inner joins, with an arbitrary order among them
Execute the left outer join after all of the inner joins

So far, I have tacitly assumed that the Fact.Sale table has 228,265 rows, and that the previous query needed only one outer join of the Fact.Sale table with the Dimension.Date to return all of the rows. It would be good to check this number in advance. You can check the number of rows by aggregating them using the COUNT(*) aggregate function. The following query introduces that function:

SELECT COUNT(*) AS SalesCount 
FROM Fact.Sale;

Now you can be sure that the Fact.Sale table has exactly 228,265 rows.

You will often need to aggregate data in groups. This is the point where the GROUP BY clause becomes handy. The following query aggregates the sales data for each customer:

SELECT c.Customer, 
  SUM(f.Quantity) AS TotalQuantity, 
  SUM(f.[Total Excluding Tax]) AS TotalAmount, 
  COUNT(*) AS SalesCount 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS c 
    ON f.[Customer Key] = c.[Customer Key] 
WHERE c.[Customer Key] <> 0 
GROUP BY c.Customer;

The query returns 402 rows, one for each known customer. In the SELECT clause, you can have only the columns used for grouping, or aggregated columns. You need to get a scalar, a single aggregated value for each row and for each column not included in the GROUP BY list.

Sometimes you need to filter aggregated data. For example, you might need to find only frequent customers, defined as customers with more than 400 rows in the Fact.Sale table. You can filter the result set on the aggregated data by using the HAVING clause, as the following query shows:

SELECT c.Customer, 
  SUM(f.Quantity) AS TotalQuantity, 
  SUM(f.[Total Excluding Tax]) AS TotalAmount, 
  COUNT(*) AS SalesCount 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS c 
    ON f.[Customer Key] = c.[Customer Key] 
WHERE c.[Customer Key] <> 0 
GROUP BY c.Customer 
HAVING COUNT(*) > 400;

The query returns 45 rows for 45 most frequent known customers. Note that you can't use column aliases from the SELECT clause in any other clause introduced in the previous query. The SELECT clause logically executes after all other clauses from the query, and the aliases are not known yet. However, the ORDER BY clause executes after the SELECT clause, and therefore the columns aliases are already known and you can refer to them. The following query shows all of the basic SELECT statement clauses used together to aggregate the sales data over the known customers, filters the data to include the frequent customers only, and sorts the result set descending by the number of rows of each customer in the Fact.Sale table:

-- Note: can use column aliases in ORDER BY 
SELECT c.Customer, 
  SUM(f.Quantity) AS TotalQuantity, 
  SUM(f.[Total Excluding Tax]) AS TotalAmount, 
  COUNT(*) AS SalesCount 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS c 
    ON f.[Customer Key] = c.[Customer Key] 
WHERE c.[Customer Key] <> 0 
GROUP BY c.Customer 
HAVING COUNT(*) > 400 
ORDER BY SalesCount DESC;

The query returns 45 rows. Here is the shortened result set:

Table of Contents for Core Transact-SQL SELECT statement elements

Create new playlist

Sign In

Sign Up

Table of Contents for
Core Transact-SQL SELECT statement elements