Goals of this chapter:
• Define the different types of operators and what makes them unique.
• Define the common patterns used in building each type of operator.
• Understand the best practices and patterns used for error handling.
LINQ to Objects is designed to be extended. This chapter discusses how new operators can be written to perform any custom requirement and then be used in exactly the same way as the standard query operators Microsoft has provided.
Although the standard query operators supplied are comprehensive, there is often the need to build more operators to satisfy a programming problem at hand. LINQ to Object operators are in their purest form, extension methods that add functionality to any IEnumerable<T>
collection type. Although writing new operators might seem ambitious at first, the implementation code for the standard query operators supplied by Microsoft rarely span more than a few lines, and it is unlikely your custom operators will be any more complex.
Query operators for LINQ to Objects fall into four categories based on their return type and action. The basic operator categories are
• Single element operator—Those that return a single element from a sequence (for example, First
, Last
, ElementAt
).
• Sequence operator—Those that return a sequence of elements (Where
, Select
).
• Aggregate operator—Those that return an aggregate result value (Count
, Min
, Max
, Average
).
• Grouping operator—Those that return groupings of elements from a sequence (GroupBy
, GroupJoin
).
This chapter looks at each type of operator in detail and constructs (and tests) a sample operator for each operator type.
The role of a single element operator is to add an extension method to any type that implements IEnumerable<T>
and return a single element from that collection. The built-in standard query operators First
, Last
and ElementAt
are examples of this category of operator, returning the first element, the last element, or the element at a given index position, respectively.
Understanding how Microsoft constructed the standard query operators is the first step to understanding how to write custom operators. To explore Microsoft’s approach to operator construction, let’s build our own version of the Last
operator before moving onto building a more complex operator.
To learn how to build an operator that returns a single element, let’s first look at how the built-in standard query operator Last
is implemented. The shortest compilable implementation for the Last
operator is shown in Listing 7-1.
Listing 7-1. Shortest compilable Last
operator implementation
The code in Listing 7-1 satisfies the compiler and is completely functional. It simply takes an IEnumerable<T>
collection as its source, iterates to the last element, and returns that element as its result. The one error condition trapped by this implementation is the case where there are no elements in the source collection. Throwing an InvalidOperationException
is the standard pattern used by Microsoft’s operator implementations, and custom operators should follow this pattern for consistency (omitting the InvalidOperationException
in the previous code would cause an error in compilation because not all code paths return a value, so it is not really optional).
Another error condition to be handled is when the source collection is itself null (not empty, but uninitialized). The pattern used by Microsoft in this case is to throw an ArgumentNullException
and to test the source argument for this condition at the beginning of any operator, as the following code demonstrates:
This implementation is fully functional and follows all of the error condition patterns that Microsoft employs in their operators. Once the error conditions are satisfied, performance improvement can be explored.
The current implementation iterates the entire source collection to get to the last element, all 100,000 of them if that is the size of the source collection. If the collection has a high element count, this could be considered a performance issue. For many collection types, the last element can be retrieved with a single statement. Many collections that implement IEnumerable<T>
also implement the interface IList<T>
. IList<T>
collections allow access by element index position, a zero-based count from the first element. If a collection implements IList<T>
, then custom operators should first exhaust that avenue of processing before using the slower IEnumerable<T>
enumeration algorithm approach. The code shown in Listing 7-2 demonstrates how to use index position if possible (otherwise using the enumeration pattern of choice).
Listing 7-2. Last
operator implementation with recommended error handling and performance optimizations
This implementation first tries to cast the source collection to the IList<T>
interface; if it is successful, it uses that interface to resolve the last element in a single statement. If the collection does not implement IList<T>
, the do-while enumeration pattern using the IEnumerable<T>
interface is employed. The same error-handling patterns are followed, and this implementation is equivalent to the patterns used by Microsoft in its operator implementations.
Throwing an exception when there are no source collection elements (an empty sequence) may not be desirable behavior when using single element return operators. It is difficult dealing with exceptions from within a query expression (nowhere to put the try-catch
statement). Many operators implement a variant of these operators that return a null value or a default value you specify. The operators FirstOrDefault
and LastOrDefault
are examples that carry out the basic function of their partners, but instead of throwing an exception when the source sequence is empty, they return the default value of the type by calling the default
keyword on the specific underlying generic type. For our Last
implementation, we would add an operator called LastOrDefault
, and the code for this operator would be (the only alteration from the code for the final Last
operator is the operator’s name and the last line):
To demonstrate how to build a custom operator that returns a single element from a source sequence, let’s look at an operator that returns an element at random. There are often times when picking a candidate at random from a collection is necessary, either for test data or deriving a random sample from a customer list to provide specific advertising or business offers. This new operator will be called RandomElement
and will return one element at random from a given sequence (pseudo-random to be exact—it can be difficult to get the randomness you need for a small collection or when retrieving more than one in a tight loop).
The RandomElement
operator has the following requirements:
• Take an input source IEnumerable<T>
sequence and return one element at random.
• Allow control over the random number generator by optionally accepting a seed value.
• Throw an ArgumentNullException
if the source sequence is null.
• Throw an InvalidOperationException
if the source sequence is empty.
As seen when implementing the Last
operator, although it is certain that sequences you extend implement the IEnumerable<T>
interface, many sequences also implement the IList<T>
and ICollection<T>
interfaces, which can be used to improve performance. Any collection type that implements IList
allows elements to be accessed by index position (for example, list[2]
will return the third element), and any collection implementing ICollection
exposes a Count
property that immediately retrieves element count. These performance enhancements should be used if possible.
The code shown in Listing 7-3 uses the built-in .NET Framework random number generator to find an index position that is within a sequence’s bounds and returns that element. It first tries to use the IList
indexer to retrieve the element (because it is faster), and if that interface isn’t supported by the collection, this operator uses a slower enumeration looping pattern (completely valid code, just slower).
Listing 7-3. Sample RandomElement
operator returns an element at random from a source sequence and takes an optional seed value for the random number generator
Although it is completely up to the implementer to decide behavior for empty and null source collections, following the same pattern used for the standard query operators is highly recommended, as it makes consuming your custom operators the same as the other LINQ operators. The general conventions are as follows:
ArgumentNullException
.InvalidOperationException
.InvalidOperationException
, but returns the value default(T)
instead.Listing 7-4 shows the basic unit tests written to confirm correct behavior of the RandomElement
operator. (I used NUnit—see www.nunit.org—but any unit testing framework would have worked.) The basic test cases for specific input sequence types that should be confirmed at a minimum are as follows:
• The source sequence is null
.
• The source sequence has no elements (is empty).
• The source sequence implements IList
.
• The source sequence implements IEnumerable<T>
.
These tests, shown in Listing 7-4, are the minimum required for the RandomElement
operator, and lightly confirm both the IList<T>
and IEnumerable
implementations with a single element source and a two-element source. Pay attention to boundary conditions when writing these tests; my first attempt at coding this operator never returned the last element—the unit tests allowed me to find and correct this error quickly.
Listing 7-4. Basic set of unit tests for the RandomElement
operator—these tests are written for the NUnit testing framework (www.nunit.org); more tests are included in the sample code.
The code for our RandomElement
operator in Listing 7-3 satisfies the requirements listed earlier; however, there is one more requirement that will make this single element operator conform to the patterns common in the standard query operators. We need to add a variation called RandomElementOrDefault
to cater to empty source sequences without throwing an exception. This is necessary to allow the single element operators to be used on sequences that might ordinarily be empty, and rather than throwing an exception, the operator should return null for reference types or zero for numeric value types.
To add this variation of our new operator, I simply cut and pasted the code from Listing 7-3, changed its name to RandomElementOrDefault
, and then altered the empty source check to return default(T);
rather than throw new InvalidOperationException
. The remaining part of the operator is unchanged (although good coding practice would have you refactor out the common code into a separate method). The additional operator begins with the following code:
Following are a few tips for developing new single element operators:
• Remember to test and make use of any performance efficiencies offered by collections that implement IList<T>
and ICollection
interfaces in addition to IEnumerable<T>
if present on the source collection. Make certain to support the IEnumerable<T>
sequences first before looking for performance optimizations.
• Implement an additional variation of these operators (suffixed with OrDefault
) that returns a legal default value for a given type, rather than throw an InvalidOperationException
(change the throw new InvalidOperationException
to return default(T);)
• Keep your extension methods in a consistent namespace so that consumers of your operators only need to add a single namespace using
clause to get access to your custom operators.
The role of a sequence operator is to return an IEnumerable<T>
sequence based on the elements of a source sequence that have been reordered, filtered, or transformed in some way. The Where
and Select
standard query operators are examples of sequence operators.
Sequence operators make use of the yield return
and yield break
keywords introduced in C# 2.0. They enable you to write code that implements an iterator
pattern, which allows consumers to use the familiar foreach
pattern (the foreach
uses the IEnumerable
interface methods, GetEnumerator
and MoveNext
, behind the scenes; you can call these methods directly to gain a small performance improvement if necessary).
Enumerators implemented using the yield return
keyword can be thought of as suspending execution after each yield return
until the next element is requested. When the next element is requested, execution begins at the point and in the exact state the loop was previously suspended in. The simplest example of this is shown in Listing 7-5, which simply writes out 1, 2, and 3 in the Console window as shown in Output 7-1.
Listing 7-5. Simple yield return example—execution begins at the point after each yield return each time around the loop—see Output 7-1
Output 7-1
The job of any sequence operator is to return elements using the yield return
keyword, until there are no more candidates in a sequence. The classic sequence operator to look to for guidance is the Where
operator included with the standard query operators.
The Where
operator extends any collection that implements IEnumerable<T>
and loops through the elements in the source sequence, testing each element as it goes. If an element passes the predicate function (predicate is a function that takes a single argument and returns true or false based on the value of the parameter), that element is returned. Otherwise, that element is skipped. The yield return
statement will only ever return one element and then suspend execution until the next element is requested, either manually (calling the MoveNext
method) or by using a foreach
loop.
The Where
operator (with the argument error checking code removed for clarity) is implemented in the following general form:
To demonstrate how to write a sequence operator, we can write an operator called TakeRange
, which will return elements in a sequence after an element satisfies the start
predicate in a sequence until an element satisfies the end
predicate in the same sequence.
The TakeRange
operator has the following requirements:
• Take an input source and return elements that fall between the first element that passes a “start” predicate until the first element that passes the “end” predicate.
• Allow the start predicate to be optional. In this case all elements in the sequence are returned until the end predicate is satisfied.
• Not fail if the start or end predicate is never satisfied; just return an empty sequence.
• Throw an ArgumentNullException
if the source sequence is null.
• Throw an ArgumentNullException
if either the start or end predicate is null in the appropriate overloads.
The TakeRange
operator will extend any collection that implements IEnumerable<T>
, as the LINQ to Objects standard query operators always do. Due to this knowledge, the only certainty we have about the source collections is that it can be enumerated using the enumerator pattern. (Call members defined in the IEnumerable<T>
interface or use the foreach
statement.) Listing 7-6 shows the functional part of the TakeRange
operator, the code that creates the iterator and carries out the predicate tests.
Listing 7-6. The iterator implementation for the TakeRange
operator—see Listing 7-7 for the actual extension method operator implementation
The convention when writing sequence operators is to separate the iterator code from the extension method implementation. This allows the same functional iterator to be called from multiple extension methods that have different overload signatures. The requirements described earlier for the TakeRange
operator will require two extension method signatures—one that takes a start and end predicate and one that takes only an end predicate (it will always start from the first element).
The actual extension methods that expose the TakeRange
operator to our collections are shown in Listing 7-7. When only one predicate is passed, it is assumed to be the end predicate, and all elements from the start of the source sequence until this end predicate passes are returned—this is just the way we have specified the operator to work.
Listing 7-7. The TakeRange
extension method declarations—both extension methods call the same iterator method as shown in Listing 7-6
Although optional, it is highly recommended to stay consistent with the error handling patterns followed by the standard query operators. The general conventions for sequence type operators are as follows:
ArgumentNullException
.InvalidOperationException
.ArgumentNullException
.Listing 7-8 shows the final code that makes up the TakeRange
operator. It follows the accepted practices for input argument errors, demonstrates how to factor out duplicate code into a separate iterator method, and demonstrates how to expose multiple overloaded operators.
Listing 7-8. The TakeRange
operator extension methods showing the error handling following the traditional pattern used in the Microsoft standard query operators
To ensure a custom operator works as expected and stays that way in future releases, a solid set of unit tests should be written. To test the TakeRange
operator and to demonstrate its basic usage, Listing 7-9 shows the minimum set of unit tests to prove the operator is functional. These tests confirm the following conditions when calling this operator:
• The operator works consistently with normal inputs. (It actually works!)
• The operator works consistently at the boundaries. (How does it handle the first or last element in the source?)
• The operator throws an exception when called from a null source sequence.
• The operator throws an exception if called with null predicates.
Listing 7-9. The following unit tests are a starting point for testing the TakeRange
operator
Following are a few tips for developing new sequence operators:
• Separate the iterator code from the extension method functions so you can provide many overloads that allow optional input arguments all calling the same main method implementation
• Test input arguments for null values and throw an ArgumentNullException("arg");
if these input arguments are essential for correct operation.
• Follow the same patterns and error handling that Microsoft used in their operators. (I’ve followed as many as possible throughout this book.)
The role of an aggregate operator is to traverse a source sequence and carry out a function (often arithmetic) on the elements. Example aggregate standard query operators are Sum
, Min
, and Max
, which return the sum of values, the smallest value, and the largest value, respectively.
To understand the basics of an aggregate style operator, let’s look at the implementation of the Min
standard query operator. Listing 7-10 shows the error handling and functional code to find the minimum integer in a sequence.
Listing 7-10. Implementation of the Min
standard query operator
Listing 7-10 first checks the source sequence for a null condition and throws an exception if that is the case. If the source sequence is not null, each element is compared to the current minimum element found so far (by value), except the first element. When flag
is false, the first element hasn’t yet been evaluated, and if the sequence is empty (no first element ever found), an InvalidOperationException
is thrown; otherwise, the first element’s value is assigned to the current minimum variable on the first iteration step. After iterating and testing each element’s value against the current minimum found, the lowest value is returned.
The code shown previously for the Min
operator is just one of the many overloads declared for this operator in the standard query operator (there are 22 total). Of the full set of Min
operators provided, 11 are similar to the Min
operator just described, and 11 also take a selector expression. The full set declared in the standard query operators is shown in Listing 7-11.
Listing 7-11. The full set of operator overloads for the Min
operator, showing the argument types and return types supported by aggregate operators
When building your own aggregate operators, you should consider a similar set of overloads, as laborious and clipboard-intensive as it is initially writing them. My approach is to start with the int32
variation and fully unit-test that implementation before duplicating the code with the rest of the numeric and nullable numeric types. Even with this set of overloads, some of the standard query operators still don’t fulfill every need—I found this out when using the Sum
operator, as is described in the next section.
Early on when working with LINQ to Objects I found the built-in Sum
operator that sums the values for a set of integer values returned the result as an integer type. This didn’t suit my requirements, and my sequences quickly overflowed the integer maximum boundary of the return type. A custom operator was needed to return the arithmetic sum of integer values with a result type of long
(overloading the built-in Min
operators didn’t solve the issue. Because I was only changing the return type and not an argument type, C# compiler’s overload resolution wasn’t satisfied—hence the name LongSum
).
The LongSum
operator has the following requirements:
IEnumerable<int>
collection as a source and return the arithmetic sum as type long
.Sum
operators. Create two extension method overloaded signatures, one that takes a selector expression and one that doesn’t.ArgumentNullException
if the source sequence is null.The obvious goal is to make using the new LongSum
extension method identical to using the Sum
extension methods. Listing 7-12 shows the full code for implementing the LongSum
operator. The code iterates the source collection using a foreach
loop and simply sums the element values as it goes. If a Selector
expression argument is passed into the operator, that Selector
expression is first evaluated before calling the other LongSum
overload with simply a source sequence of integer values.
Listing 7-12. The code for the LongSum operator, which takes a sequence of int
values and returns the mathematical sum up to the maximum value allowed in a Long
type
To test the LongSum
operator, Listing 7-13 shows the basic set of unit tests written to confirm correct behavior and demonstrate its usage. The basic set of unit tests you should consider for any custom aggregate style operators are:
• The operator works with normal inputs; positive and negative numbers, and all method overloads.
• The operator throws an exception when called on a null sequence (this is the standard error-handling pattern).
• The operator returns a safe value when called on an empty sequence. Return zero or an exception? The choice is up to the implementer.
Listing 7-13. Basic set of unit tests for the LongSum
operator
The following provides some tips for developing new aggregate style operators:
• For completeness, you should create the full set of numeric overloads when you create your own operators. Refer to the set shown in Listing 7-11 for the Min
operator as a good starting list for your operators.
• Provide an overload for each operator that takes a selector expression. Aggregate functions are often used in the select project of a grouped query result, where you want to define the numeric sequence you want aggregated. To implement these overloads, call the Select
standard query operator passing in the select argument before calling your custom operator, for example:
• Test input arguments for null values and throw an ArgumentNullException("arg");
if these input arguments are essential for correct operation.
• Look at the source code for the standard query operators by using Red Gate Reflector (http://www.red-gate.com/products/reflector/). Follow the same patterns and error handling that Microsoft uses in their operators.
Grouping operators return collections of the source elements (or variations thereof) grouped by some algorithm. Examples of grouping category operators from the standard query operators are GroupBy
and ToLookup
.
The general steps for building a new grouping operator are:
IGrouping<TKey, TElement>
list implementation to hold your groups. In a perfect world we could reuse Microsoft’s Grouping
type implementation, however it is marked as internal. I recommend copying it as a template for your own grouping list implementation (which is what I did in this example, and provided in Listing 7-14).IGrouping<TKey, TElement>
types.The ToLookup
standard query operator almost universally caters to any grouping requirement (for instance, it is the basis for the GroupBy
operator). If you can find an implementation by using the ToLookup
operator, you can save a lot of time. For educational purposes, we are going to build a grouping operator from scratch.
Each group result from a grouping operator should follow the IGrouping<TKey,TElement>
interface. This collection implementation should be read-only from the consuming application code and only populated by our grouping operator code. Listing 7-14 shows a basic implementation for a suitable grouping collection class. This implementation is a close facsimile of the Grouping
type used by Microsoft in the standard query operator’s; however, we need to duplicate that code because that class is marked internal. You will notice that any property setter or method call that can alter the contents of the collection throws a NotSupportedException
(the collection is read-only). You can iterate, access elements by index, and retrieve the key property for the group—but that is essentially all.
Listing 7-14. Example grouping element collection implementation—predominately read-only
To demonstrate writing a grouping operator, we will write an example operator called Segment
. The Segment
operator divides the elements in a source sequence into equal-sized groups (equal element count) where it can, except the last group if the elements can’t be distributed equally.
The Segment
operator has the following requirements:
• To take any collection implementing IEnumerable<T>
as an input source and divide it into multiple groups with the same number of elements.
• If the number of elements doesn’t allow all groups to be equally sized, the last group and all other groups of equal size should be reduced.
• If there are more segments than elements, put one element in as many as possible and return empty groups for the remaining.
• Each group should have a Key
property with the sequential segment number; that is, 1 for the first group, 2 for the second, and so on.
• Throw an ArgumentNullException
if the source sequence is null.
• Throw an ArgumentOutOfRangeException
if the number of segments is zero or less than zero.
A quick example of how the Segment
operator should work is shown in Listing 7-15. The Console output from this example is shown in Output 7-2.
Listing 7-15. Calling the Segment
operator to divide a string array into five groups—see Output 7-2
Output 7-2
A real-world example (why this operator was first built) of segmenting data into groups is when generating groups for multivariate testing. For instance, imagine you wanted to send an email to all contacts in your customer list. You want to test variations in order to determine what version gets the greatest response. The Segment
operator can be used to build these groups, as the code shown in Listing 7-16 demonstrates by creating two test groups. The Console output from this example is shown in Output 7-3.
Listing 7-16. Calling the Segment
operator to divide contacts into two groups—see Output 7-3
Output 7-3
The code to implement the Segment
operator is shown in Listing 7-17. This implementation first checks the input arguments and then calls a private implementation that encapsulates the building and iteration functionality. Separating the iterator function to support reuse with multiple overloaded variations of the Segment
operator is considered a good practice (as mentioned in the TakeRange
operator example). The return type of this operator is an IGrouping<TKey,TElement>
(declared publicly in the System.Linq
namespace, so it will be in scope); instances of our grouping collections are constructed as Grouping<int, T>[]
’s, using the collection type shown in Listing 7-14.
Listing 7-17. The Segment
operator code, which segments the contents of the source sequence into equal-sized groups
To confirm this segment operator matches the previously stated requirements and to confirm it stays that way throughout multiple future revisions, Listing 7-18 shows some basic unit tests. In writing this operator, these tests helped me debug the boundary conditions for odd, even, and too small a quantity source sequences.
Listing 7-18. Basic unit tests for the Segment
operator
The following are some tips for developing new grouping operators:
• Try and use the ToLookup
standard query operator in your implementation if possible.
• Separate the iterator logic into its own method, so that multiple operator overloads can re-use the same iterator implementation.
• Try to build the groups in one enumeration pass through the source sequence for optimal performance.
• Try to yield return a result grouping as soon as you can build one. Avoid building all groups up-front in the case that the user never iterates past the first group (just for performance, sometimes its unavoidable, but worth looking for an algorithm that performs best with deferred execution).
This chapter introduced how to build custom operators when required. LINQ to Object operators are simply extension methods for any type implementing the IEnumerable<T>
interface.
The general guidance for building operators is to look at the implementation of Microsoft’s standard query operators and follow their patterns. This ensures that your operators are consistent in how they handle errors and how they are used when other developers begin using these extensions in query expressions.
3.16.75.165