Chapter 11. Query expressions and LINQ to Objects

This chapter covers

  • Streaming sequences of data

  • Deferred execution

  • Standard query operators

  • Query expression translation

  • Range variables and transparent identifiers

  • Projecting, filtering, and sorting

  • Joining and grouping

You may well be tired of all the hyperbole around LINQ by now. We’ve seen some examples in chapters 1 and 3, and you’ve almost certainly read some examples and articles on the Web. This is where we separate myth from reality:

  • LINQ isn’t going to turn the most complicated query into a one-liner.

  • LINQ isn’t going to mean you never need to look at raw SQL again.

  • LINQ isn’t going to magically imbue you with architectural genius.

Given all that, LINQ is still going to change how most of us think about code. It’s not a silver bullet, but it’s a very powerful tool to have in your development armory. We’ll explore two distinct aspects of LINQ: the framework support, and the compiler translation of query expressions. The latter can look odd to start with, but I’m sure you’ll learn to love them.

Query expressions are effectively “preprocessed” by the compiler into “normal” C# 3, which is then compiled in a perfectly ordinary way. This is a neat way of integrating queries into the language without changing its semantics all over the place: it’s syntactic sugar in its purest form. Most of this chapter is a list of the preprocessing translations performed by the compiler, as well as the effects achieved when the result uses the Enumerable extension methods.

You won’t see any SQL or XML here—all that awaits us in chapter 12. However, with this chapter as a foundation you should be able to understand what the more exciting LINQ providers do when we meet them. Call me a spoilsport, but I want to take away some of their magic. Even without the air of mystery, LINQ is still very cool.

First let’s consider what LINQ is in the first place, and how we’re going to explore it.

Introducing LINQ

A topic as large as LINQ needs a certain amount of background before we’re ready to see it in action. In this section we’ll look at what LINQ is (as far as we can discern it), a few of the core principles behind it, and the data model we’re going to use for all the examples in this chapter and the next. I know you’re likely to be itching to get into the code, so I’ll keep it fairly brief. Let’s start with an issue you’d think would be quite straightforward: what counts as LINQ?

What’s in a name?

LINQ has suffered from the same problem that .NET had early in its life: it’s a term that has never been precisely defined. We know what it stands for: Language INtegrated Query—but that doesn’t actually help much. LINQ is fundamentally about data manipulation: it’s a means of accessing data from different sources in a unified manner. It allows you to express logic about that manipulation in the language of your choice—C# in our case—even when the logic needs to be executed in an entirely different environment. It also attempts to eliminate (or at least vastly reduce) the impedance mismatch[1]—the difficulties introduced when you need to integrate two environments with very different data models, such as the object-oriented model of .NET and the relational data model of SQL.

All of the C# 3 features we’ve seen so far (with the exception of partial methods and automatic properties) contribute to the LINQ story, so how many of these do you need to use before you can consider yourself to be using LINQ? Do you need to be using query expressions, which are the final genuine language feature in C# 3? If you use the extension methods of Enumerable but do it without query expressions, does that count? What about plain lambda expressions without any querying going on?

As a concrete example of this question, let’s consider four slightly different ways of achieving the same goal. Suppose we have a List<Person> (where Person has properties Name and Age as in chapter 8) and we wish to apply a transformation so that we get a sequence of strings, just the names of the people in the list. Using lambda expressions in preference to anonymous methods, we can still use the standard ConvertAll method of List<T>, as follows:

var names = people.ConvertAll(p => p.Name);

Alternatively we can use the Select method of the Enumerable class, because List<T> implements IEnumerable<T>. The result of this will be an IEnumerable<string> rather than a List<string>, but in many cases that’s fine. The code becomes

var names = Enumerable.Select(people, p => p.Name);

Knowing that Select is an extension method, we can simplify things a little bit:

var names = people.Select(p => p.Name);

Finally, we could use a query expression:

var names = from p in people select p.Name;

Four one-liners, all of which accomplish much the same goal.[2] Which of them count as LINQ? My personal answer is that the first isn’t really using LINQ (even though it uses a lambda expression), but the rest are LINQ-based solutions. The second form certainly isn’t idiomatic, but the last two are both perfectly respectable LINQ ways of achieving the same aim, and in fact all of the last three samples compile to the same IL.

In the end, there are no bonus points available for using LINQ, so the question is moot. However, it’s worth being aware that when a fellow developer says they’re using LINQ to solve a particular problem, that statement could have a variety of meanings.

The long and the short of it is that LINQ is a collection of technologies, including the language features of C#3 (and VB9) along with the framework libraries provided as part of .NET 3.5. If your project targets .NET 3.5, you can use LINQ as much or as little as you like. You can use just the support for in-memory querying, otherwise known as LINQ to Objects, or providers that target XML documents, relational databases, or other data sources. The only provider we’ll use in this chapter is LINQ to Objects, and in the next chapter we’ll see how the same concepts apply to the other providers.

There are a few concepts that are vital to LINQ. We’ve seen them tangentially in chapter 10, but let’s look at them a little more closely.

Fundamental concepts in LINQ

Most of this chapter is dedicated to exactly what the C# 3 compiler does with query expressions, but it won’t make much sense until we have a better understanding of the ideas underlying LINQ as a whole. One of the problems with reducing the impedance mismatch between two data models is that it usually involves creating yet another model to act as the bridge. This section describes the LINQ model, beginning with its most important aspect: sequences.

Sequences

You’re almost certainly familiar with the concept of a sequence: it’s encapsulated by the IEnumerable and IEnumerable<T> interfaces, and we’ve already looked at those fairly closely in chapter 6 when we studied iterators. Indeed, in many ways a sequence is just a slightly more abstract way of thinking of an iterator. A sequence is like a conveyor belt of items—you fetch them one at a time until either you’re no longer interested or the sequence has run out of data. There are three other fairly obvious examples: a Stream represents a sequence of bytes, a TextReader represents a sequence of characters, and a DataReader represents a sequence of rows from a database[3].

The key difference between a sequence and other collection data structures such as lists and arrays is that when you’re reading from a sequence, you don’t generally know how many more items are waiting, or have access to arbitrary items—just the current one. Indeed, some sequences could be never-ending: you could easily have an infinite sequence of random numbers, for example. Only one piece of data is provided at a time by the sequence, so you can implement an infinite sequence without having infinite storage. Lists and arrays can act as sequences, of course—just as List<T> implements IEnumerable<T>—but the reverse isn’t always true. You can’t have an infinite array or list, for example.

Sequences are the bread and butter of LINQ. When you read a query expression, it’s really helpful to think of the sequences involved: there’s always at least one sequence to start with, and it’s usually transformed into other sequences along the way, possibly being joined with yet more sequences. We’ll see examples of this as we go further into the chapter, but I can’t emphasize enough how important it is. Examples of LINQ queries are frequently provided on the Web with very little explanation: when you take them apart by looking at the sequences involved, things make a lot more sense. As well as being an aid to reading code, it can also help a lot when writing it. Thinking in sequences can be tricky—it’s a bit of a mental leap sometimes—but if you can get there, it will help you immeasurably when you’re working with LINQ.

Note

Think in sequences!

As a simple example, let’s take another query expression running against a list of people. We’ll apply the same transformation as before, but with a filter involved that keeps only adults in the resulting sequence:

var adultNames = from person in people
                 where person.Age >= 18
                 select person.Name;

Figure 11.1 shows this query expression graphically, breaking it down into its individual steps. I’ve included a number of similar figures in this chapter, but unfortunately for complicated queries there is simply not enough room on the printed page to show as much data as we might like. More detailed diagrams are available on the book’s website.

A simple query expression broken down into the sequences and transformations involved

Figure 11.1. A simple query expression broken down into the sequences and transformations involved

Each arrow represents a sequence—the description is on the left side, and some sample data is on the right. Each box is a transformation from our query expression. Initially, we have the whole family (as Person objects); then after filtering, the sequence only contains adults (again, as Person objects); and the final result has the names of those adults as strings. Each step simply takes one sequence and applies an operation to produce a new sequence. The result isn’t the strings “Holly” and “Jon”—instead, it’s an IEnumerable<string>, which, when asked for its elements one by one, will first yield “Holly” and then “Jon.”

This example was straightforward to start with, but we’ll apply the same technique later to more complicated query expressions in order to understand them more easily. Some advanced operations involve more than one sequence as input, but it’s still a lot less to worry about than trying to understand the whole query in one go.

So, why are sequences so important? They’re the basis for a streaming model for data handling—one that allows us to process data only when we need to.

Deferred Execution and Streaming

When the query expression shown in figure 11.1 is created, no data is processed. The original list of people isn’t accessed at all. Instead, a representation of the query is built up in memory. Delegate instances are used to represent the predicate testing for adulthood and the conversion from a person to that person’s name. It’s only when the resulting IEnumerable<string> is asked for its first element that the wheels start turning.

This aspect of LINQ is called deferred execution. When the first element of the result is requested, the Select transformation asks the Where transformation for its first element. The Where transformation asks the list for its first element, checks whether the predicate matches (which it does in this case), and returns that element back to Select. That in turn extracts the name and returns it as the result.

That’s all a bit of a mouthful, but a sequence diagram makes it all much clearer. I’m going to collapse the calls to MoveNext and Current to a single fetch operation: it makes the diagram a lot simpler. Just remember that each time the fetch occurs, it’s effectively checking for the end of the sequence as well. Figure 11.2 shows the first few stages of our sample query expression in operation, when we print out each element of the result using a foreach loop.

Sequence diagram of the execution of a query expression

Figure 11.2. Sequence diagram of the execution of a query expression

As you can see in figure 11.2, only one element of data is processed at a time. If we decided to stop printing output after writing “Holly,” we would never execute any of the operations on the other elements of the original sequence. Although several stages are involved here, processing data in a streaming manner like this is efficient and flexible. In particular, regardless of how much source data there is, you don’t need to know about more than one element of it at any one point in time. That’s the difference between using List.ConvertAll and Enumerable.Select: the former creates a whole new in-memory list, whereas the latter just iterates through the original sequence, yielding a single converted element at a time.

This is a best-case scenario, however. There are times where in order to fetch the first result of a query, you have to evaluate all of the data from the source. We’ve already seen one example of this in the previous chapter: the Enumerable.Reverse method needs to fetch all the data available in order to return the last original element as the first element of the resulting sequence. This makes Reverse a buffering operation—which can have a huge effect on the efficiency (or even feasibility) of your overall operation. If you can’t afford to have all the data in memory at one time, you can’t use buffering operations.

Just as the streaming aspect depends on which operation you perform, some transformations take place as soon as you call them, rather than using deferred execution. This is called immediate execution. Generally speaking, operations that return another sequence (usually an IEnumerable<T> or IQueryable<T>) use deferred execution, whereas operations that return a single value use immediate execution.

The operations that are widely available in LINQ are known as the standard query operators—let’s take a brief look at them now.

Standard Query Operators

LINQ’s standard query operators are a collection of transformations that have well-understood meanings. LINQ providers are encouraged to implement as many of these operators as possible, making the implementation obey the expected behavior. This is crucial in providing a consistent query framework across multiple data sources. Of course, some LINQ providers may expose more functionality, and some of the operators may not map appropriately to the target domain of the provider—but at least the opportunity for consistency is there.

C# 3 has support for some of the standard query operators built into the language via query expressions, but they can always be called manually. You may be interested to know that VB9 has more of the operators present in the language: as ever, there’s a trade-off between the added complexity of including a feature in the language and the benefits that feature brings. Personally I think the C# team has done an admirable job: I’ve always been a fan of a small language with a large library behind it.

We’ll see some of these operators in our examples as we go through this chapter and the next, but I’m not aiming to give a comprehensive guide to them here: this book is primarily about C#, not the whole of LINQ. You don’t need to know all of the operators in order to be productive in LINQ, but your experience is likely to grow over time. The appendix gives a brief description of each of the standard query operators, and MSDN gives more details of each specific overload. When you run into a problem, check the list: if it feels like there ought to be a built-in method to help you, there probably is!

Having mentioned examples, it’s time to introduce the data model that most of the rest of the sample code in this chapter will use.

Defining the sample data model

In section 10.3.5 I gave a brief example of bug tracking as a real use for extension methods and lambda expressions. We’ll use the same idea for almost all of the sample code in this chapter—it’s a fairly simple model, but one that can be manipulated in many different ways to give useful information. It’s also a domain that most professional developers are familiar with, while not involving frankly tedious relationships between customers and orders. You can find further examples using the same model in the downloadable source code, along with detailed comments. Some of these are more complicated than the ones presented in this chapter and provide good exercises for understanding larger query expressions.

Our fictional setting is SkeetySoft, a small software company with big ambition. The founders have decided to attempt to create an office suite, a media player, and an instant messaging application. After all, there are no big players in those markets, are there?

The development department of SkeetySoft consists of five people: two developers (Deborah and Darren), two testers (Tara and Tim), and a manager (Mary). There’s currently has a single customer: Colin. The aforementioned products are SkeetyOffice, SkeetyMediaPlayer, and SkeetyTalk, respectively. We’re going to look at defects logged during August 2007, using the data model shown in figure 11.3.

Class diagram of the SkeetySoft defect data model

Figure 11.3. Class diagram of the SkeetySoft defect data model

As you can see, we’re not recording an awful lot of data. In particular, there’s no real history to the defects, but there’s enough here to let us demonstrate the query expression features of C# 3. For the purposes of this chapter, all the data is stored in memory. We have a class named SampleData with properties AllDefects, AllUsers, AllProjects, and AllSubscriptions, which each return an appropriate type of IEnumerable<T>. The Start and End properties return DateTime instances for the start and end of August respectively, and there are nested classes Users and Projects within SampleData to provide easy access to a particular user or project. The one type that may not be immediately obvious is NotificationSubscription: the idea behind this is to send an email to the specified address every time a defect is created or changed in the relevant project.

There are 41 defects in the sample data, created using C#3 object initializers. All of the code is available on the book’s website, but I won’t include the sample data itself in this chapter.

Now that the preliminaries are dealt with, let’s get cracking with some queries!

Simple beginnings: selecting elements

Having brought up some general LINQ concepts beforehand, I’ll introduce the concepts that are specific to C#3 as they arise in the course of the rest of the chapter. We’re going to start with a simple query (even simpler than the ones we’ve seen so far in the chapter) and work up to some quite complicated ones, not only building up your expertise of what the C#3 compiler is doing, but also teaching you how to read LINQ code.

All of our examples will follow the pattern of defining a query, and then printing the results to the console. We’re not interested in binding queries to data grids or anything like that—it’s all important, but not directly relevant to learning C#3.

We can use a simple expression that just prints out all our users as the starting point for examining what the compiler is doing behind the scenes and learning about range variables.

Starting with a source and ending with a selection

Every query expression in C# 3 starts off in the same way—stating the source of a sequence of data:

from element in source

The element part is just an identifier, with an optional type name before it. Most of the time you won’t need the type name, and we won’t have one for our first example. Lots of different things can happen after that first clause, but sooner or later you always end with a select clause or a group clause. We’ll start off with a select clause to keep things nice and simple. The syntax for a select clause is also easy:

select expression

The select clause is known as a projection. Combining the two together and using the most trivial expression we can think of gives a simple (and practically useless) query, as shown in listing 11.1.

Example 11.1. Trivial query to print the list of users

var query = from user in SampleData.AllUsers
            select user;

foreach (var user in query)
{
    Console.WriteLine(user);
}

The query expression is the part highlighted in bold. I’ve overridden ToString for each of the entities in the model, so the results of listing 11.1 are as follows:

User: Tim Trotter (Tester)
User: Tara Tutu (Tester)
User: Deborah Denton (Developer)
User: Darren Dahlia (Developer)
User: Mary Malcop (Manager)
User: Colin Carton (Customer)

You may be wondering how useful this is as an example: after all, we could have just used SampleData.AllUsers directly in our foreach statement. However, we’ll use this query expression—however trivial—to introduce two new concepts. First we’ll look at the general nature of the translation process the compiler uses when it encounters a query expression, and then we’ll discuss range variables.

Compiler translations as the basis of query expressions

The C#3 query expression support is based on the compiler translating query expressions into “normal” C# code. It does this in a mechanical manner that doesn’t try to understand the code, apply type inference, check the validity of method calls, or any of the normal business of a compiler. That’s all done later, after the translation. In many ways, this first phase can be regarded as a preprocessor step. The compiler translates listing 11.1 into listing 11.2. before doing the real compilation.

Example 11.2. The query expression of listing 11.1 translated into a method call

var query = SampleData.AllUsers.Select(user => user);

foreach (var user in query)
{
   Console.WriteLine(user);
}

The C#3 compiler translates the query expression into exactly that code before properly compiling it further. In particular, it doesn’t assume that it should use Enumerable. Select, or that List<T> will contain a method called Select. It merely translates the code and then lets the next phase of compilation deal with finding an appropriate method—whether as a straightforward member or as an extension method. The parameter can be a suitable delegate type or an Expression<T> for an appropriate type T.

This is where it’s important that lambda expressions can be converted into both delegate instances and expression trees. All the examples in this chapter will use delegates, but we’ll see how expression trees are used when we look at the other LINQ providers in chapter 12. When I present the signatures for some of the methods called by the compiler later on, remember that these are just the ones called in LINQ to Objects—whenever the parameter is a delegate type (which most of them are), the compiler will use a lambda expression as the argument, and then try to find a method with a suitable signature.

It’s also important to remember that wherever a normal variable (such as a local variable within the method) appears within a lambda expression after translation has been performed, it will become a captured variable in the same way that we saw back in chapter 5. This is just normal lambda expression behavior—but unless you understand which variables will be captured, you could easily be confused by the results of your queries.

The language specification gives details of the query expression pattern, which must be implemented for all query expressions to work, but this isn’t defined as an interface as you might expect. It makes a lot of sense, however: it allows LINQ to be applied to interfaces such as IEnumerable<T> using extension methods. This chapter tackles each element of the query expression pattern, one at a time.

Listing 11.3 proves how the compiler translation works: it provides a dummy implementation of both Select and Where, with Select being a normal instance method and Where being an extension method. Our original simple query expression only contained a select clause, but I’ve included the where clause to show both kinds of methods in use. Unfortunately, because it requires a top-level class to contain the extension method it can’t be represented as a snippet but only as a full listing.

Example 11.3. Compiler translation calling methods on a dummy LINQ implementation

Compiler translation calling methods on a dummy LINQ implementation

Running listing 11.3 prints “Where called” and then “Select called” just as we’d expect, because the query expression has been translated into this code:

var query = source.Where(dummy => dummy.ToString()=="Ignored")
                  .Select(dummy => "Anything");

Of course, we’re not doing any querying or transformation here, but it shows how the compiler is translating our query expression. If you’re puzzled as to why we’ve selected "Anything" instead of just dummy, it’s because a projection of just dummy (which is a “do nothing” projection) would be removed by the compiler in this particular case. We’ll look at that later in section 11.3.2, but for the moment the important idea is the overall type of translation involved. We only need to learn what translations the C# compiler will use, and then we can take any query expression, convert it into the form that doesn’t use query expressions, and then look at what it’s doing from that point of view.

Notice how we don’t implement IEnumerable<T> at all in Dummy<T>. The translation from query expressions to “normal” code doesn’t depend on it, but in practice almost all LINQ providers will expose data either as IEnumerable<T> or IQueryable<T> (which we’ll look at in chapter 12). The fact that the translation doesn’t depend on any particular types but merely on the method names and parameters is a sort of compile-time form of duck typing. This is similar to the same way that the collection initializers presented in chapter 8 find a public method called Add using normal overload resolution rather than using an interface containing an Add method with a particular signature. Query expressions take this idea one step further—the translation occurs early in the compilation process in order to allow the compiler to pick either instance methods or extension methods. You could even consider the translation to be the work of a separate preprocessing engine.

Note

Why from ... where ... select instead of select ... from ... where ? Many developers find the order of the clauses in query expressions confusing to start with. It looks just like SQL—except back to front. If you look back to the translation into methods, you’ll see the main reason behind it. The query expression is processed in the same order that it’s written: we start with a source in the from clause, then filter it in the where clause, then project it in the select clause. Another way of looking at it is to consider the diagrams throughout this chapter. The data flows from top to bottom, and the boxes appear in the diagram in the same order as their corresponding clauses appear in the query expression. Once you get over any initial discomfort due to unfamiliarity, you may well find this approach appealing—I certainly do.

So, we know that a source level translation is involved—but there’s another crucial concept to understand before we move on any further.

Range variables and nontrivial projections

Let’s look back at our original query expression in a bit more depth. We haven’t examined the identifier in the from clause or the expression in the select clause. Figure 11.4 shows the query expression again, with each part labeled to explain its purpose.

A simple query expression broken down into its constituent parts

Figure 11.4. A simple query expression broken down into its constituent parts

The contextual keywords are easy to explain—they specify to the compiler what we want to do with the data. Likewise the source expression is just a normal C# expression—a property in this case, but it could just as easily have been a method call, or a variable.

The tricky bits are the range variable declaration and the projection expression. Range variables aren’t like any other type of variable. In some ways they’re not variables at all! They’re only available in query expressions, and they’re effectively present to propagate context from one expression to another. They represent one element of a particular sequence, and they’re used in the compiler translation to allow other expressions to be turned into lambda expressions easily.

We’ve already seen that our original query expression was turned into

SampleData.AllUsers.Select(user => user)

The left side of the lambda expression—the part that provides the parameter name—comes from the range variable declaration. The right side comes from the select clause. The translation is as simple as that (in this case). It all works out OK because we’ve used the same name on both sides. Suppose we’d written the query expression like this:

from user in SampleData.AllUsers
select person

In that case, the translated version would have been

A simple query expression broken down into its constituent parts
SampleData.AllUsers.Select(user => person)

At that point the compiler would have complained because it wouldn’t have known what person referred to. Now that we know how simple the process is, however, it becomes easier to understand a query expression that has a slightly more complicated projection. Listing 11.4 prints out just the names of our users.

Example 11.4. Query selecting just the names of the users

var query = from user in SampleData.AllUsers
            select user.Name;

foreach (string name in query)
{
   Console.WriteLine(name);
}

This time we’re using user.Name as the projection, and we can see that the result is a sequence of strings, not of User objects. The translation of the query expression follows the same rules as before, and becomes

SampleData.AllUsers.Select(user => user.Name)

The compiler allows this, because the Select extension method as applied to AllUsers effectively has this signature, acting as if it were a member of IEnumerable<T>:

IEnumerable<TResult> Select<TResult> (Func<T,TResult> selector)

Note

Extension methods pretending to be instance methods—Just to be clear, Select isn’t a member of IEnumerable<T> itself, and the real signature has an IEnumerable<T> parameter at the start. However, the translation process doesn’t care whether the result is a call to a normal instance method or an extension method. I find it easier to pretend it’s a normal method just for the sake of working out the query translation. I’ve used the same convention for the remainder of the chapter.

The type inference described in chapter 9 kicks in, converting the lambda expression into a Func<User,TResult> by deciding that the user parameter must be of type User, and then inferring that the return type (and thus TResult) should be string. This is why lambda expressions allow implicitly typed parameters, and why there are such complicated type inference rules: these are the gears and pistons of the LINQ engine.

Note

Why do you need to know all this? You can almost ignore what’s going on with range variables for a lot of the time. You may well have seen many, many queries and understood what they achieve without ever knowing about what’s going on behind the scenes. That’s fine for when things are working (as they tend to with examples in tutorials), but it’s when things go wrong that it pays to know about the details. If you have a query expression that won’t compile because the compiler is complaining that it doesn’t know about a particular identifier, you should look at the range variables involved.

So far we’ve only seen implicitly typed range variables. What happens when we include a type in the declaration? The answer lies in the Cast and OfType standard query operators.

Cast, OfType, and explicitly typed range variables

Most of the time, range variables can be implicitly typed; in .NET 3.5, you’re likely to be working with generic collections where the specified type is all you need. What if that weren’t the case, though? What if we had an ArrayList, or perhaps an object[] that we wanted to perform a query on? It would be a pity if LINQ couldn’t be applied in those situations. Fortunately, there are two standard query operators that come to the rescue: Cast and OfType. Only Cast is supported directly by the query expression syntax, but we’ll look at both in this section.

The two operators are similar: both take an arbitrary untyped sequence and return a strongly typed sequence. Cast does this by casting each element to the target type (and failing on any element that isn’t of the right type) and OfType does a test first, skipping any elements of the wrong type.

Listing 11.5 demonstrates both of these operators, used as simple extension methods from Enumerable. Just for a change, we won’t be using our SkeetySoft defect system for our sample data—after all, that’s all strongly typed! Instead, we’ll just use two ArrayList objects.

Example 11.5. Using Cast and OfType to work with weakly typed collections

ArrayList list = new ArrayList { "First", "Second", "Third"};
IEnumerable<string> strings = list.Cast<string>();
foreach (string item in strings)
{
   Console.WriteLine(item);
}

list = new ArrayList { 1, "not an int", 2, 3};
IEnumerable<int> ints = list.OfType<int>();
foreach (int item in ints)
{
   Console.WriteLine(item);
}

The first list has only strings in it, so we’re safe to use Cast<string> to obtain a sequence of strings. The second list has mixed content, so in order to fetch just the integers from it we use OfType<int>. If we’d used Cast<int> on the second list, an exception would have been thrown when we tried to cast “not an int” to int. Note that this would only have happened after we’d printed “1”—both operators stream their data, converting elements as they fetch them.

When you introduce a range variable with an explicit type, the compiler uses a call to Cast to make sure the sequence used by the rest of the query expression is of the appropriate type. Listing 11.6 shows this, with a projection using the Substring method to prove that the sequence generated by the from clause is a sequence of strings.

Example 11.6. Using an explicitly typed range variable to automatically call Cast

ArrayList list = new ArrayList { "First", "Second", "Third"};
var strings = from string entry in list
              select entry.Substring(0, 3);
foreach (string start in strings)
{
   Console.WriteLine(start);
}

The output of listing 11.6 is “Fir,” “Sec,” “Thi”—but what’s more interesting is the translated query expression, which is

list.Cast<string>().Select(entry => entry.Substring(0,3));

Without the cast, we wouldn’t be able to call Select at all, because the extension method is only defined for IEnumerable<T> rather than IEnumerable. Even when you’re using a strongly typed collection, you might still want to use an explicitly typed range variable, though. For instance, you could have a collection that is defined to be a List<ISomeInterface> but you know that all the elements are instances of MyImplementation. Using a range variable with an explicit type of MyImplementation allows you to access all the members of MyImplementation without manually inserting casts all over the code.

We’ve covered a lot of important conceptual ground so far, even though we haven’t achieved any impressive results. To recap the most important points briefly:

  • LINQ is based on sequences of data, which are streamed wherever possible.

  • Creating a query doesn’t immediately execute it: most operations use deferred execution.

  • Query expressions in C# 3 involve a preprocessing phase that converts the expression into normal C#, which is then compiled properly with all the normal rules of type inference, overloading, lambda expressions, and so forth.

  • The variables declared in query expressions don’t act like anything else: they are range variables, which allow you to refer to data consistently within the query expression.

I know that there’s a lot of somewhat abstract information to take in. Don’t worry if you’re beginning to wonder if LINQ is worth all this trouble. I promise you that it is. With a lot of the groundwork out of the way, we can start doing genuinely useful things—like filtering our data, and then ordering it.

Filtering and ordering a sequence

You may be surprised to learn that these two operations are some of the simplest to explain in terms of compiler translations. The reason is that they always return a sequence of the same type as their input, which means we don’t need to worry about any new range variables being introduced. It also helps that we’ve seen the corresponding extension methods in chapter 10.

Filtering using a where clause

It’s remarkably easy to understand the where clause. The format is just

where filter-expression

The compiler translates this into a call to the Where method with a lambda expression, which uses the appropriate range variable as the parameter and the filter expression as the body. The filter expression is applied as a predicate to each element of the incoming stream of data, and only those that return true are present in the resulting sequence. Using multiple where clauses results in multiple chained Where calls—only elements that match all of the predicates are part of the resulting sequence. Listing 11.7 demonstrates a query expression that finds all open defects assigned to Tim.

Example 11.7. Query expression using multiple where clauses

User tim = SampleData.Users.TesterTim;

var query = from bug in SampleData.AllDefects
            where bug.Status != Status.Closed
            where bug.AssignedTo == tim
            select bug.Summary;

foreach (var summary in query)
{
   Console.WriteLine(summary);
}

The query expression in listing 11.7 is translated into

SampleData.AllDefects.Where (bug => bug.Status != Status.Closed)
          .Where (bug => bug.AssignedTo == tim)
          .Select(bug => bug.Summary)

The output of listing 11.7 is as follows:

Installation is slow
Subtitles only work in Welsh
Play button points the wrong way
Webcam makes me look bald
Network is saturated when playing WAV file

Of course, we could write a single where clause that combined the two conditions as an alternative to using multiple where clauses. In some cases this may improve performance, but it’s worth bearing the readability of the query expression in mind, too. Once more, this is likely to be a fairly subjective matter. My personal inclination is to combine conditions that are logically related but keep others separate. In this case, both parts of the expression deal directly with a defect (as that’s all our sequence contains), so it would be reasonable to combine them. As before, it’s worth trying both forms to see which is clearer.

In a moment, we’ll start trying to apply some ordering rules to our query, but first we should look at a small detail to do with the select clause.

Degenerate query expressions

While we’ve got a fairly simple translation to work with, let’s revisit a point I glossed over earlier in section 11.2.2 when I first introduced the compiler translations. So far, all our translated query expressions have included a call to Select. What happens if our select clause does nothing, effectively returning the same sequence as it’s given? The answer is that the compiler removes that call to Select—but only if there are other operations being performed within the query expression. For example, the following query expression just selects all the defects in the system:

from defect in SampleData.AllDefects
select defect

This is known as a degenerate query expression. The compiler deliberately generates a call to Select even though it seems to do nothing:

SampleData.Select (defect => defect)

There’s a big difference between this and the simple expression SampleData, however. The items returned by the two sequences are the same, but the result of the Select method is just the sequence of items, not the source itself. The result of a query expression is never the same object as the source data, unless the LINQ provider has been poorly coded. This can be important from a data integrity point of view—a provider can return a mutable result object, knowing that changes to the returned data set won’t affect the “master” even in the face of a degenerate query.

When other operations are involved, there’s no need for the compiler to keep “noop” select clauses. For example, suppose we change the query expression in listing 11.7 to select the whole defect rather than just the name:

from bug in SampleData.AllDefects
where bug.Status != Status.Closed
where bug.AssignedTo == SampleData.Users.TesterTim
select bug

We now don’t need the final call to Select, so the translated code is just this:

SampleData.AllDefects.Where (bug => bug.Status != Status.Closed)
                     .Where (bug => bug.AssignedTo == tim)

These rules rarely get in the way when you’re writing query expressions, but they can cause confusion if you decompile the code with a tool such as Reflector—it can be surprising to see the Select call go missing for no apparent reason.

With that knowledge in hand, let’s improve our query so that we know what Tim should work on next.

Ordering using an orderby clause

It’s not uncommon for developers and testers to be asked to work on the most critical defects before they tackle more trivial ones. We can use a simple query to tell Tim the order in which he should tackle the open defects assigned to him. Listing 11.8 does exactly this using an orderby clause, printing out all the details of the bugs, in descending order of priority.

Example 11.8. Sorting by the severity of a bug, from high to low priority

User tim = SampleData.Users.TesterTim;

var query = from bug in SampleData.AllDefects
            where bug.Status != Status.Closed
            where bug.AssignedTo == tim
            orderby bug.Severity descending
            select bug;

foreach (var bug in query)
{
   Console.WriteLine("{0}: {1}", bug.Severity, bug.Summary);
}

The output of listing 11.8 shows that we’ve sorted the results appropriately:

Showstopper: Webcam makes me look bald
Major: Subtitles only work in Welsh
Major: Play button points the wrong way
Minor: Network is saturated when playing WAV file
Trivial: Installation is slow

However, you can see that we’ve got two major defects. Which order should those be tackled in? Currently no clear ordering is involved. Let’s change the query so that after sorting by severity in descending order, we sort by “last modified time” in ascending order. This means that Tim will test the bugs that have been fixed a long time ago before the ones that were fixed recently. This just requires an extra expression in the orderby clause, as shown in listing 11.9.

Example 11.9. Ordering by severity and then last modified time

User tim = SampleData.Users.TesterTim;

var query = from bug in SampleData.AllDefects
            where bug.Status != Status.Closed
            where bug.AssignedTo == tim
            orderby bug.Severity descending, bug.LastModified
            select bug;

foreach (var bug in query)
{
   Console.WriteLine("{0}: {1} ({2:d})",
                     bug.Severity, bug.Summary, bug.LastModified);
}

The results of listing 11.9 are shown here. Note how the order of the two major defects has been reversed.

Showstopper: Webcam makes me look bald (08/27/2007)
Major: Play button points the wrong way (08/17/2007)
Major: Subtitles only work in Welsh (08/23/2007)
Minor: Network is saturated when playing WAV file (08/31/2007)
Trivial: Installation is slow (08/15/2007)

So, that’s what the query expression looks like—but what does the compiler do? It simply calls the OrderBy and ThenBy methods (or OrderByDescending/ThenByDescending for descending orders). Our query expression is translated into

SampleData.AllDefects.Where (bug => bug.Status != Status.Closed)
                     .Where (bug => bug.AssignedTo == tim)
                     .OrderByDescending (bug => bug.Severity)
                     .ThenBy (bug => bug.LastModified)

Now that we’ve seen an example, let’s look at the general syntax of orderby clauses. They’re basically the contextual keyword orderby followed by one or more orderings. An ordering is just an expression (which can use range variables) optionally followed by descending, which has the obvious meaning. The translation for the first ordering is a call to OrderBy or OrderByDescending, and any other orderings are translated using a call to ThenBy or ThenByDescending, as shown in our example.

The difference between OrderBy and ThenBy is quite simple: OrderBy assumes it has primary control over the ordering, whereas ThenBy understands that it’s subservient to one or more previous orderings. For LINQ to Objects, ThenBy is only defined as an extension method for IOrderedEnumerable<T>, which is the type returned by OrderBy (and by ThenBy itself, to allow further chaining).

Note

Warning: only use one orderby clause!

It’s very important to note that although you can use multiple orderby clauses, each one will start with its own OrderBy or OrderByDescending clause, which means the last one will effectively “win.” There may be some reason for including multiple orderby clauses, but it would be very unusual. You should almost always use a single clause containing multiple orderings instead.

As noted in chapter 10, applying an ordering requires all the data to be loaded (at least for LINQ to Objects)—you can’t order an infinite sequence, for example. Hopefully the reason for this is obvious—you don’t know whether you’ll see something that should come at the start of the resulting sequence until you’ve seen all the elements, for example.

We’re about halfway through learning about query expressions, and you may be surprised that we haven’t seen any joins yet. Obviously they’re important in LINQ just as they’re important in SQL, but they’re also complicated. I promise we’ll get to them in due course, but in order to introduce just one new concept at a time, we’ll detour via let clauses first. That way we can learn about transparent identifiers before we hit joins.

Let clauses and transparent identifiers

Most of the rest of the operators we still need to look at involve transparent identifiers. Just like range variables, you can get along perfectly well without understanding transparent identifiers, if you only want to have a fairly shallow grasp of query expressions. If you’ve bought this book, I hope you want to know C# 3 at a deeper level, which will (among other things) enable you to look compilation errors in the face and know what they’re talking about.

You don’t need to know everything about transparent identifiers, but I’ll teach you enough so that if you see one in the language specification you won’t feel like running and hiding. You’ll also understand why they’re needed at all—and that’s where an example will come in handy. The let clause is the simplest transformation available that uses transparent identifiers.

Introducing an intermediate computation with let

A let clause simply introduces a new range variable with a value that can be based on other range variables. The syntax is as easy as pie:

let identifier = expression

To explain this operator in terms that don’t use any other complicated operators, I’m going to resort to a very artificial example. Suspend your disbelief, and imagine that finding the length of a string is a costly operation. Now imagine that we had a completely bizarre system requirement to order our users by the lengths of their names, and then display the name and its length. Yes, I know it’s somewhat unlikely. Listing 11.10 shows one way of doing this without a let clause.

Example 11.10. Sorting by the lengths of user names without a let clause

var query = from user in SampleData.AllUsers
            orderby user.Name.Length
            select user.Name;

foreach (var name in query)
{
   Console.WriteLine("{0}: {1}", name.Length, name);
}

That works fine, but it uses the dreaded Length property twice—once to sort the users, and once in the display side. Surely not even the fastest supercomputer could cope with finding the lengths of six strings twice! No, we need to avoid that redundant computation. We can do so with the let clause, which evaluates an expression and introduces it as a new range variable. Listing 11.11 achieves the same result as listing 11.10, but only uses the Length property once per user.

Example 11.11. Using a let clause to remove redundant calculations

var query = from user in SampleData.AllUsers
            let length = user.Name.Length
            orderby length
            select new { Name = user.Name, Length = length };

foreach (var entry in query)
{
   Console.WriteLine("{0}: {1}", entry.Length, entry.Name);
}

Listing 11.11 introduces a new range variable called length, which contains the length of the user’s name (for the current user in the original sequence). We then use that new range variable for both sorting and the projection at the end. Have you spotted the problem yet? We need to use two range variables, but the lambda expression passed to Select only takes one parameter! This is where transparent identifiers come on the scene.

Transparent identifiers

In listing 11.11, we’ve got two range variables involved in the final projection, but the Select method only acts on a single sequence. How can we combine the range variables? The answer is to create an anonymous type that contains both variables but apply a clever translation to make it look as if we’ve actually got two parameters for the select and orderby clauses. Figure 11.5 shows the sequences involved.

Sequences involved in listing 11.11, where a let clause introduces the length range variable

Figure 11.5. Sequences involved in listing 11.11, where a let clause introduces the length range variable

The let clause achieves its objectives by using another call to Select, creating an anonymous type for the resulting sequence, and effectively creating a new range variable whose name can never be seen or used in source code. Our query expression from listing 11.11 is translated into something like this:

SampleData.AllUsers.Select(user => new { user,
                                         length=user.Name.Length })
                   .OrderBy(z => z.length)
                   .Select(z => new { Name=z.user.Name,
                                      Length=z.length })

Each part of the query has been adjusted appropriately: where the original query expression referenced user or length directly, if the reference occurs after the let clause, it’s replaced by z.user or z.length. The choice of z as the name here is arbitrary—it’s all hidden by the compiler.

If you read the C#3 language specification on let clauses, you’ll see that the translation it describes is from one query expression to another. It uses an asterisk (*) to represent the transparent identifier introduced. The transparent identifier is then erased as a final step in translation. I won’t use that notation in this chapter, as it’s hard to come to grips with and unnecessary at the level of detail we’re going into. Hopefully with this background, the specification won’t be quite as impenetrable as it might be otherwise, should you need to refer to it.

The good news is that we can now take a look at the rest of the translations making up C# 3’s query expression support. I won’t go into the details of every transparent identifier introduced, but I’ll mention the situations in which they occur. Let’s look at the support for joins first.

Joins

If you’ve ever read anything about SQL, you probably have an idea what a database join is. It takes two tables and creates a result by matching one set of rows against another set of rows. A LINQ join is similar, except it works on sequences. Three types of joins are available, although not all of them use the join keyword in the query expression. We’ll start with the join that is closest to a SQL inner join.

Inner joins using join clauses

Inner joins involve two sequences. One key selector expression is applied to each element of the first sequence and another key selector (which may be totally different) is applied to each element of the second sequence. The result of the join is a sequence of all the pairs of elements where the key from the first element is the same as the key from the second element.

Note

Terminology clash! Inner and outer sequences—The MSDN documentation for the Join method used to evaluate inner joins unhelpfully calls the sequences involved inner and outer. This has nothing to do with inner joins and outer joins—it’s just a way of differentiating between the sequences. You can think of them as first and second, left and right, Bert and Ernie—anything you like that helps you. I’ll use left and right for this chapter. Aside from anything else, it makes it obvious which sequence is which in diagram form.

The two sequences can be anything you like: the right sequence can even be the same as the left sequence, if that’s useful. (Imagine finding pairs of people who were born on the same day, for example.) The only thing that matters is that the two key selector expressions must result in the same type of key.[4] You can’t join a sequence of people to a sequence of cities by saying that the birth date of the person is the same as the population of the city—it doesn’t make any sense.

The syntax for an inner join looks more complicated than it is:

[query selecting the left sequence]
join right-range-variable in right-sequence
on left-key-selector equals right-key-selector

Seeing equals as a contextual keyword rather than using symbols can be slightly disconcerting, but it makes it easier to distinguish the left key selector from the right key selector. Often (but not always) at least one of the key selectors is a trivial one that just selects the exact element from that sequence.

Let’s look at an example from our defect system. Suppose we had just added the notification feature, and wanted to send the first batch of emails for all the existing defects. We need to join the list of notifications against the list of defects, where their projects match. Listing 11.12 performs just such a join.

Example 11.12. Joining the defects and notification subscriptions based on project

var query = from defect in SampleData.AllDefects
            join subscription in SampleData.AllSubscriptions
               on defect.Project equals subscription.Project
            select new { defect.Summary, subscription.EmailAddress };


foreach (var entry in query)
{
   Console.WriteLine("{0}: {1}", entry.EmailAddress, entry.Summary);
}

Listing 11.12 will show each of the media player bugs twice—once for “[email protected]” and once for “[email protected]” (because the boss really cares about the media player project).

In this particular case we could easily have made the join the other way round, reversing the left and right sequences. The result would have been the same entries but in a different order. The implementation in LINQ to Objects returns entries so that all the pairs using the first element of the left sequence are returned (in the order of the right sequence), then all the pairs using the second element of the left sequence, and so on. The right sequence is buffered, but the left sequence is streamed—so if you want to join a massive sequence to a tiny one, it’s worth using the tiny one as the right sequence if you can.

One error that might trip you up is putting the key selectors the wrong way round. In the left key selector, only the left sequence range variable is in scope; in the right key selector only the right range variable is in scope. If you reverse the left and right sequences, you have to reverse the left and right key selectors too. Fortunately the compiler knows that it’s a common mistake and suggests the appropriate course of action.

Just to make it more obvious what’s going on, figure 11.6 shows the sequences as they’re processed.

The join from listing 11.12 in graphical form, showing two different sequences (defects and subscriptions) used as data sources

Figure 11.6. The join from listing 11.12 in graphical form, showing two different sequences (defects and subscriptions) used as data sources

Often you want to filter the sequence, and filtering before the join occurs is more efficient than filtering it afterward. At this stage the query expression is simpler if the left sequence is the one requiring filtering. For instance, if we wanted to show only defects that are closed, we could use this query expression:

from defect in SampleData.AllDefects
where defect.Status == Status.Closed
join subscription in SampleData.AllSubscriptions
   on defect.Project equals subscription.Project
select new { defect.Summary, subscription.EmailAddress }

We can perform the same query with the sequences reversed, but it’s messier:

from subscription in SampleData.AllSubscriptions
join defect in (from defect in SampleData.AllDefects
                where defect.Status == Status.Closed
                select defect)
   on subscription.Project equals defect.Project
select new { defect.Summary, subscription.EmailAddress }

Notice how you can use one query expression inside another—indeed, the language specification describes many of the compiler translations in these terms. Nested query expressions are useful but hurt readability as well: it’s often worth looking for an alternative, or using a variable for the right-hand sequence in order to make the code clearer.

Note

Are inner joins useful in LINQ to Objects? Inner joins are used all the time in SQL. They are effectively the way that we navigate from one entity to a related one, usually joining a foreign key in one table to the primary key on another. In the object-oriented model, we tend to navigate from one object to another via references. For instance, retrieving the summary of a defect and the name of the user assigned to work on it would require a join in SQL—in C# we just use a chain of properties. If we’d had a reverse association from Project to the list of NotificationSubscription objects associated with it in our model, we wouldn’t have needed the join to achieve the goal of this example, either. That’s not to say that inner joins aren’t useful sometimes even within object-oriented models—but they don’t naturally occur nearly as often as in relational models.

Inner joins are translated by the compiler into calls to the Join method. The signature of the overload used for LINQ to Objects is as follows (when imagining it to be an instance method of IEnumerable<T>):

IEnumerable<TResult> Join<TInner,TKey,TResult> (
   IEnumerable<TInner> inner,
   Func<T,TKey> outerKeySelector,
   Func<T,TKey> innerKeySelector,
   Func<T,TInner,TResult> resultSelector
)

The first three parameters are self-explanatory when you’ve remembered to treat inner and outer as right and left, respectively, but the last one is slightly more interesting. It’s a projection from two elements (one from the left sequence and one from the right sequence) into a single element of the resulting sequence. When the join is followed by anything other than a select clause, the C# 3 compiler introduces a transparent identifier in order to make the range variables used in both sequences available for later clauses, and creates an anonymous type and simple mapping to use for the resultSelector parameter.

However, if the next part of the query expression is a select clause, the projection from the select clause is used directly as the resultSelector parameter—there’s no point in creating a pair and then calling Select when you can do the transformation in one step. You can still think about it as a “join” step followed by a “select” step despite the two being squished into a single method call. This leads to a more consistent mental model in my view, and one that is easier to reason about. Unless you’re looking at the generated code, just ignore the optimization the compiler is performing for you.

The good news is that having learned about inner joins, our next type of join is much easier to approach.

Group joins with join ... into clauses

We’ve seen that the result sequence from a normal join clause consists of pairs of elements, one from each of the input sequences. A group join looks similar in terms of the query expression but has a significantly different outcome. Each element of a group join result consists of an element from the left sequence (using its original range variable), and also a sequence of all the matching elements of the right sequence, exposed as a new range variable specified by the identifier coming after into in the join clause.

Note

Result contains embedded subsequences

Let’s change our previous example to use a group join. Listing 11.13 again shows all the defects and the notifications required for each one, but breaks them out in a per-defect manner. Pay particular attention to how we’re displaying the results, and to the nested foreach loop.

Example 11.13. Joining defects and subscriptions with a group join

var query = from defect in SampleData.AllDefects
            join subscription in SampleData.AllSubscriptions
               on defect.Project equals subscription.Project
            into groupedSubscriptions
            select new { Defect=defect,
                         Subscriptions=groupedSubscriptions };

foreach (var entry in query)
{
   Console.WriteLine(entry.Defect.Summary);
   foreach (var subscription in entry.Subscriptions)
   {
      Console.WriteLine ("  {0}", subscription.EmailAddress);
   }
}

The Subscriptions property of each entry is the embedded sequence of subscriptions matching that entry’s defect. Figure 11.7 shows how the two initial sequences are combined.

Sequences involved in the group join from listing 11.13. The short arrows indicate embedded sequences within the result entries. In the output, some entries contain multiple email addresses for the same bug.

Figure 11.7. Sequences involved in the group join from listing 11.13. The short arrows indicate embedded sequences within the result entries. In the output, some entries contain multiple email addresses for the same bug.

One important difference between an inner join and a group join—and indeed between a group join and normal grouping—is that for a group join there’s a one-to-one correspondence between the left sequence and the result sequence, even if some of the elements in the left sequence don’t match any elements of the right sequence. This can be very important, and is sometimes used to simulate a left outer join from SQL. The embedded sequence is empty when the left element doesn’t match any right elements. As with an inner join, a group join buffers the right sequence but streams the left one.

Listing 11.14 shows an example of this, counting the number of bugs created on each day in August. It uses a DateTimeRange (as described in chapter 6) as the left sequence, and a projection that calls Count on the embedded sequence in the result of the group join.

Example 11.14. Counting the number of bugs raised on each day in August

var dates = new DateTimeRange(SampleData.Start, SampleData.End);

var query = from date in dates
            join defect in SampleData.AllDefects
               on date equals defect.Created.Date
            into joined
            select new { Date=date, Count=joined.Count() };
foreach (var grouped in query)
{
   Console.WriteLine("{0:d}: {1}", grouped.Date, grouped.Count);
}

Count itself uses immediate execution, iterating through all the elements of the sequence it’s called on—but we’re only calling it in the projection part of the query expression, so it becomes part of a lambda expression. This means we still have deferred execution: nothing is evaluated until we start the foreach loop.

Here is the first part of the results of listing 11.14, showing the number of bugs created each day in the first week of August:

08/01/2007: 1
08/02/2007: 0
08/03/2007: 2
08/04/2007: 1
08/05/2007: 0
08/06/2007: 1
08/07/2007: 1

The compiler translation involved for a group join is simply a call to the GroupJoin method, which has the following signature:

IEnumerable<TResult> GroupJoin<TInner,TKey,TResult> (
   IEnumerable<TInner> inner,
   Func<T,TKey> outerKeySelector,
   Func<TInner,TKey> innerKeySelector,
   Func<T,IEnumerable<TInner>,TResult> resultSelector
)

The signature is exactly the same as for inner joins, except that the resultSelector parameter has to work with a sequence of right-hand elements, not just a single one. As with inner joins, if a group join is followed by a select clause the projection is used as the result selector of the GroupJoin call; otherwise, a transparent identifier is introduced. In this case we have a select clause immediately after the group join, so the translated query looks like this:

dates.GroupJoin(SampleData.AllDefects,
                date => date,
                defect => defect.Created.Date,
                (date, joined) => new { Date=date,
                                        Count=joined.Count() })

Our final type of join is known as a cross join—but it’s not quite as straightforward as it might seem at first.

Cross joins using multiple from clauses

So far all our joins have been equijoins—a match has been performed between elements of the left and right sequences. Cross joins don’t perform any matching between the sequences: the result contains every possible pair of elements. They’re achieved by simply using two (or more) from clauses. For the sake of sanity we’ll only consider two from clauses for the moment—when there are more, just mentally perform a cross join on the first two from clauses, then cross join the resulting sequence with the next from clause, and so on. Each extra from clause adds its own range variable.

Listing 11.15 shows a simple (but useless) cross join in action, producing a sequence where each entry consists of a user and a project. I’ve deliberately picked two completely unrelated initial sequences to show that no matching is performed.

Example 11.15. Cross joining users against projects

var query = from user in SampleData.AllUsers
            from project in SampleData.AllProjects
            select new { User=user, Project=project };

foreach (var pair in query)
{
   Console.WriteLine("{0}/{1}",
                     pair.User.Name,
                     pair.Project.Name);
}

The output of listing 11.15 begins like this:

Tim Trotter/Skeety Media Player
Tim Trotter/Skeety Talk
Tim Trotter/Skeety Office
Tara Tutu/Skeety Media Player
Tara Tutu/Skeety Talk
Tara Tutu/Skeety Office

Figure 11.8 shows the sequences involved to get this result.

Sequences from listing 11.15, cross joining users and projects. All possible combinations are returned in the results.

Figure 11.8. Sequences from listing 11.15, cross joining users and projects. All possible combinations are returned in the results.

If you’re familiar with SQL, you’re probably quite comfortable so far—it looks just like a Cartesian product obtained from a query specifying multiple tables. Indeed, most of the time that’s exactly how cross joins are used. However, there’s more power available when you want it: the right sequence used at any particular point in time can depend on the “current” value of the left sequence. When this is the case, it’s not a cross join in the normal sense of the term. The query expression translation is the same whether or not we’re using a true cross join, so we need to understand the more complicated scenario in order to understand the translation process.

Before we dive into the details, let’s see the effect it produces. Listing 11.16 shows a simple example, using sequences of integers.

Example 11.16. Cross join where the right sequence depends on the left element

var query = from left in Enumerable.Range(1, 4)
            from right in Enumerable.Range(11, left)
            select new { Left=left, Right=right };

foreach (var pair in query)
{
   Console.WriteLine("Left={0}; Right={1}",
                     pair.Left, pair.Right);
}

Listing 11.16 starts with a simple range of integers, 1 to 4. For each of those integers, we create another range, beginning at 11 and having as many elements as the original integer. By using multiple from clauses, the left sequence is joined with each of the generated right sequences, resulting in this output:

Left=1; Right=11
Left=2; Right=11
Left=2; Right=12
Left=3; Right=11
Left=3; Right=12
Left=3; Right=13
Left=4; Right=11
Left=4; Right=12
Left=4; Right=13
Left=4; Right=14

The method the compiler calls to generate this sequence is SelectMany. It takes a single input sequence (the left sequence in our terminology), a delegate to generate another sequence from any element of the left sequence, and a delegate to generate a result element given an element of each of the sequences. Here’s the signature of the method, again written as if it were an instance method on IEnumerable<T>:

public IEnumerable<TResult> SelectMany<TCollection, TResult> (
   Func<T,IEnumerable<TCollection>> collectionSelector,
   Func<T,TCollection,TResult> resultSelector
)

As with the other joins, if the part of the query expression following the join is a select clause, that projection is used as the final argument; otherwise, a transparent identifier is introduced to make both the left and right sequences’ range variables available.

Just to make this all a bit more concrete, here’s the query expression of listing 11.16, as the translated source code:

Enumerable.Range(1, 4)
          .SelectMany (left => Enumerable.Range(11, left),
                       (left, right) => new {Left=left,
                                             Right=right})

One interesting feature of SelectMany is that the execution is completely streamed—it only needs to process one element of each sequence at a time, because it uses a freshly generated right sequence for each different element of the left sequence. Compare this with inner joins and group joins: they both load the right sequence completely before starting to return any results. You should bear in mind the expected size of sequence, and how expensive it might be to evaluate it multiple times, when considering which type of join to use and which sequence to use as the left and which as the right.

This behavior of flattening a sequence of sequences, one produced from each element in an original sequence, can be very useful. Consider a situation where you might want to process a lot of log files, a line at a time. We can process a seamless sequence of lines, with barely any work. The following pseudo-code is filled in more thoroughly in the downloadable source code, but the overall meaning and usefulness should be clear:

var query = from file in Directory.GetFiles(logDirectory, "*.log")
            from line in new FileLineReader(file)
            let entry = new LogEntry(line)
            where entry.Type == EntryType.Error
            select entry;

In just five lines of code we have retrieved, parsed, and filtered a whole collection of log files, returning a sequence of entries representing errors. Crucially, we haven’t had to load even a single full log file into memory all in one go, let alone all of the files—all the data is streamed.

Having tackled joins, the last items we need to look at are slightly easier to understand. We’re going to look at grouping elements by a key, and continuing a query expression after a group ... by or select clause.

Groupings and continuations

One common requirement is to group a sequence of elements by one of its properties. LINQ makes this easy with the group ... by clause. As well as describing this final type of clause, we’ll also revisit our earliest one (select) to see a feature called query continuations that can be applied to both groupings and projections. Let’s start with a simple grouping.

Grouping with the group ... by clause

Grouping is largely intuitive, and LINQ makes it simple. To group a sequence in a query expression, all you need to do is use the group ... by clause, with this syntax:

group projection by grouping

This clause comes at the end of a query expression in the same way a select clause does. The similarities between these clauses don’t end there: the projection expression is the same kind of projection a select clause uses. The outcome is somewhat different, however.

The grouping expression determines what the sequence is grouped by—the key of the grouping. The overall result is a sequence where each element is itself a sequence of projected elements, and also has a Key property, which is the key for that group; this combination is encapsulated in the IGrouping<TKey,TElement> interface, which extends IEnumerable<TElement>.

Let’s have a look at a simple example from the SkeetySoft defect system: grouping defects by their current assignee. Listing 11.17 does this with the simplest form of projection, so that the resulting sequence has the assignee as the key, and a sequence of defects embedded in each entry.

Example 11.17. Grouping defects by assignee—trivial projection

Grouping defects by assignee—trivial projection

Listing 11.17 might be useful in a daily build report, to quickly see what defects each person needs to look at. We’ve filtered out all the defects that don’t need any more attention Grouping defects by assignee—trivial projection and then grouped using the AssignedTo property. Although this time we’re just using a property, the grouping expression can be anything you like—it’s just applied to each entry in the incoming sequence, and the sequence is grouped based on the result of the expression. Note that grouping cannot stream the results, although it streams the input, applying the key selection and projection to each element and buffering the grouped sequences of projected elements.

The projection we’ve applied in the grouping Grouping defects by assignee—trivial projection is trivial—it just selects the original element. As we go through the resulting sequence, each entry has a Key property, which is of type User Grouping defects by assignee—trivial projection, and each entry also implements IEnumerable<Defect>, which is the sequence of defects assigned to that user Grouping defects by assignee—trivial projection.

The results of listing 11.17 start like this:

Darren Dahlia
  (Showstopper) MP3 files crash system
  (Major) Can't play files more than 200 bytes long
  (Major) DivX is choppy on Pentium 100
  (Trivial) User interface should be more caramelly

After all of Darren’s defects have been printed out, we see Tara’s, then Tim’s, and so on. The implementation effectively keeps a list of the assignees it’s seen so far, and adds a new one every time it needs to. Figure 11.9 shows the sequences generated throughout the query expression, which may make this ordering clearer.

Sequences used when grouping defects by assignee. Each entry of the result has a Key property and is also a sequence of defect entries.

Figure 11.9. Sequences used when grouping defects by assignee. Each entry of the result has a Key property and is also a sequence of defect entries.

Within each entry’s subsequence, the order of the defects is the same as the order of the original defect sequence. If you actively care about the ordering, consider explicitly stating it in the query expression, to make it more readable.

If you run listing 11.17, you’ll see that Mary Malcop doesn’t appear in the output at all, because she doesn’t have any defects assigned to her. If you wanted to produce a full list of users and defects assigned to each of them, you’d need to use a group join like the one used in listing 11.14.

The compiler always uses a method called GroupBy for grouping clauses. When the projection in a grouping clause is trivial—in other words, when each entry in the original sequence maps directly to the exact same object in a subsequence—the compiler uses a simple method call, which just needs the grouping expression, so it knows how to map each element to a key. For instance, the query expression in listing 11.17 is translated into this nonquery expression:

SampleData.AllDefects.Where(defect => defect.AssignedTo != null)
                     .GroupBy(defect => defect.AssignedTo)

When the projection is nontrivial, a slightly more complicated version is used. Listing 11.18 gives an example of a projection so that we only capture the summary of each defect rather than the Defect object itself.

Example 11.18. Grouping defects by assignee—projection retains just the summary

var query = from defect in SampleData.AllDefects
            where defect.AssignedTo != null
            group defect.Summary by defect.AssignedTo;

foreach (var entry in query)
{
   Console.WriteLine(entry.Key.Name);
   foreach (var summary in entry)
   {
      Console.WriteLine("  {0}", summary);
   }
   Console.WriteLine();
}

I’ve highlighted the differences between listing 11.18 and listing 11.17 in bold. Having projected a defect to just its summary, the embedded sequence in each entry is just an IEnumerable<string>. In this case, the compiler uses an overload of GroupBy with another parameter to represent the projection. The query expression in listing 11.18 is translated into the following expression:

SampleData.AllDefects.Where(defect => defect.AssignedTo != null)
                     .GroupBy(defect => defect.AssignedTo,
                              defect => defect.Summary)

There are more complex overloads of GroupBy available as extension methods on IEnumerable<T>, but they aren’t used by the C# 3 compiler when translating query expressions. You can call them manually, of course—if you find you want more powerful grouping behavior than query expressions provide natively, then they’re worth looking into.

Grouping clauses are relatively simple but very useful. Even in our defect-tracking system, you could easily imagine wanting to group defects by project, creator, severity, or status, as well as the assignee we’ve used for these examples.

So far, we’ve ended each query expression with a select or group ... by clause, and that’s been the end of the expression. There are times, however, when you want to do more with the results—and that’s where query continuations are used.

Query continuations

Query continuations provide a way of using the result of one query expression as the initial sequence of another. They apply to both group ... by and select clauses, and the syntax is the same for both—you simply use the contextual keyword into and then provide the name of a new range variable. That range variable can then be used in the next part of the query expression.

The C# 3 specification explains this in terms of a translation from one query expression to another, changing

first-query into identifier
second-query-body

into

from identifier in (first-query)
second-query-body

An example will make this a lot clearer. Let’s go back to our grouping of defects by assignee, but this time imagine we only want the count of the defects assigned to each person. We can’t do that with the projection in the grouping clause, because that only applies to each individual defect. We want to project an assignee and the sequence of their defects into the assignee and the count from that sequence, which is achieved using the code in listing 11.19.

Example 11.19. Continuing a grouping with another projection

var query = from defect in SampleData.AllDefects
            where defect.AssignedTo != null
            group defect by defect.AssignedTo into grouped
            select new { Assignee=grouped.Key,
                         Count=grouped.Count() };

foreach (var entry in query)
{
   Console.WriteLine("{0}: {1}",
                     entry.Assignee.Name,
                     entry.Count);
}

The changes to the query expression are highlighted in bold. We can use the grouped range variable in the second part of the query, but the defect range variable is no longer available—you can think of it as being out of scope. Our projection simply creates an anonymous type with Assignee and Count properties, using the key of each group as the assignee, and counting the sequence of defects associated with each group. The results of listing 11.19 are as follows:

Darren Dahlia: 14
Tara Tutu: 5
Tim Trotter: 5
Deborah Denton: 9
Colin Carton: 2

Following the specification, the query expression from listing 11.19 is translated into this one:

from grouped in (from defect in SampleData.AllDefects
                 where defect.AssignedTo != null
                 group defect by defect.AssignedTo)
select new { Assignee=grouped.Key, Count=grouped.Count() }

The rest of the translations are then performed, resulting in the following code:

SampleData.AllDefects
          .Where (defect => defect.AssignedTo != null)
          .GroupBy(defect => defect.AssignedTo)
          .Select(grouped => new { Assignee=grouped.Key,
                                   Count=grouped.Count() })

An alternative way of understanding continuations is to think of two separate statements. This isn’t as accurate in terms of the actual compiler translation, but I find it makes it easier to see what’s going on. In this case, the query expression (and assignment to the query variable) can be thought of as the following two statements:

var tmp = from defect in SampleData.AllDefects
          where defect.AssignedTo != null
          group defect by defect.AssignedTo;

var query = from grouped in tmp
            select new { Assignee=grouped.Key,
                         Count=grouped.Count() };

Of course, if you find this easier to read there’s nothing to stop you from breaking up the original expression into this form in your source code. Nothing will be evaluated until you start trying to step through the query results anyway, due to deferred execution.

Let’s extend this example to see how multiple continuations can be used. Our results are currently unordered—let’s change that so we can see who’s got the most defects assigned to them first. We could use a let clause after the first continuation, but listing 11.20 shows an alternative with a second continuation after our current expression.

Example 11.20. Query expression continuations from group and select

var query = from defect in SampleData.AllDefects
            where defect.AssignedTo != null
            group defect by defect.AssignedTo into grouped
            select new { Assignee=grouped.Key,
                         Count=grouped.Count() } into result
            orderby result.Count descending
            select result;

foreach (var entry in query)
{
   Console.WriteLine("{0}: {1}",
                     entry.Assignee.Name,
                     entry.Count);
}

The changes between listing 11.19 and 11.20 are highlighted in bold. We haven’t had to change any of the output code as we’ve got the same type of sequence—we’ve just applied an ordering to it. This time the translated query expression is as follows:

SampleData.AllDefects
          .Where (defect => defect.AssignedTo != null)
          .GroupBy(defect => defect.AssignedTo)
          .Select(grouped => new { Assignee=grouped.Key,
                                   Count=grouped.Count() })
          .OrderByDescending(result => result.Count);

By pure coincidence, this is remarkably similar to the first defect tracking query we came across, in section 10.3.5. Our final select clause effectively does nothing, so the C# compiler ignores it. It’s required in the query expression, however, as all query expressions have to end with either a select or a group ... by clause. There’s nothing to stop you from using a different projection or performing other operations with the continued query—joins, further groupings, and so forth. Just keep an eye on the readability of the query expression as it grows.

Summary

In this chapter, we’ve looked at how LINQ to Objects and C# 3 interact, focusing on the way that query expressions are first translated into code that doesn’t involve query expressions, then compiled in the usual way. We’ve seen how all query expressions form a series of sequences, applying a transformation of some description at each step. In many cases these sequences are evaluated using deferred execution, fetching data only when it is first required.

Compared with all the other features of C# 3, query expressions look somewhat alien—more like SQL than the C# we’re used to. One of the reasons they look so odd is that they’re declarative instead of imperative—a query talks about the features of the end result rather than the exact steps required to achieve it. This goes hand in hand with a more functional way of thinking. It can take a while to click, and it’s certainly not suitable for every situation, but where declarative syntax is appropriate it can vastly improve readability, as well as making code easier to test and also easier to parallelize.

Don’t be fooled into thinking that LINQ should only be used with databases: plain in-memory manipulation of collections is common, and as we’ve seen it’s supported very well by query expressions and the extension methods in Enumerable.

In a very real sense, you’ve seen all the new features of C# 3 now! Although we haven’t looked at any other LINQ providers yet, we now have a clearer understanding of what the compiler will do for us when we ask it to handle XML and SQL. The compiler itself doesn’t know the difference between LINQ to Objects, LINQ to SQL, or any of the other providers: it just follows the same rules blindly. In the next chapter we’ll see how these rules form the final piece of the LINQ jigsaw puzzle when they convert lambda expressions into the expression trees so that the various clauses of query expressions can be executed on different platforms.



[1] http://en.wikipedia.org/wiki/Object-Relational_impedance_mismatch explains this for the specific object-relational case, but impedance mismatches are present in other situations, such as XML representations and web services.

[2] There’s actually a big difference between ConvertAll and Select, as we’ll see in a minute, but in many cases either could be used.

[3] Interestingly, none of these actually implements IEnumerable<T>.

[4] It is also valid for there to be two key types involved, with an implicit conversion from one to the other. One of the types must be a better choice than the other, in the same way that the compiler infers the type of an implicitly typed array. In my experience you rarely need to consciously consider this detail.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.34.25