Chapter 7. Extending LINQ to Objects

Goals of this chapter:

• Define the different types of operators and what makes them unique.

• Define the common patterns used in building each type of operator.

• Understand the best practices and patterns used for error handling.

LINQ to Objects is designed to be extended. This chapter discusses how new operators can be written to perform any custom requirement and then be used in exactly the same way as the standard query operators Microsoft has provided.

Writing a New Query Operator

Although the standard query operators supplied are comprehensive, there is often the need to build more operators to satisfy a programming problem at hand. LINQ to Object operators are in their purest form, extension methods that add functionality to any IEnumerable<T> collection type. Although writing new operators might seem ambitious at first, the implementation code for the standard query operators supplied by Microsoft rarely span more than a few lines, and it is unlikely your custom operators will be any more complex.

Query operators for LINQ to Objects fall into four categories based on their return type and action. The basic operator categories are

Single element operator—Those that return a single element from a sequence (for example, First, Last, ElementAt).

Sequence operator—Those that return a sequence of elements (Where, Select).

Aggregate operator—Those that return an aggregate result value (Count, Min, Max, Average).

Grouping operator—Those that return groupings of elements from a sequence (GroupBy, GroupJoin).

This chapter looks at each type of operator in detail and constructs (and tests) a sample operator for each operator type.

Writing a Single Element Operator

The role of a single element operator is to add an extension method to any type that implements IEnumerable<T> and return a single element from that collection. The built-in standard query operators First, Last and ElementAt are examples of this category of operator, returning the first element, the last element, or the element at a given index position, respectively.

Understanding how Microsoft constructed the standard query operators is the first step to understanding how to write custom operators. To explore Microsoft’s approach to operator construction, let’s build our own version of the Last operator before moving onto building a more complex operator.

Building Our Own Last Operator

To learn how to build an operator that returns a single element, let’s first look at how the built-in standard query operator Last is implemented. The shortest compilable implementation for the Last operator is shown in Listing 7-1.

Listing 7-1. Shortest compilable Last operator implementation

image

The code in Listing 7-1 satisfies the compiler and is completely functional. It simply takes an IEnumerable<T> collection as its source, iterates to the last element, and returns that element as its result. The one error condition trapped by this implementation is the case where there are no elements in the source collection. Throwing an InvalidOperationException is the standard pattern used by Microsoft’s operator implementations, and custom operators should follow this pattern for consistency (omitting the InvalidOperationException in the previous code would cause an error in compilation because not all code paths return a value, so it is not really optional).

Another error condition to be handled is when the source collection is itself null (not empty, but uninitialized). The pattern used by Microsoft in this case is to throw an ArgumentNullException and to test the source argument for this condition at the beginning of any operator, as the following code demonstrates:

image

This implementation is fully functional and follows all of the error condition patterns that Microsoft employs in their operators. Once the error conditions are satisfied, performance improvement can be explored.

The current implementation iterates the entire source collection to get to the last element, all 100,000 of them if that is the size of the source collection. If the collection has a high element count, this could be considered a performance issue. For many collection types, the last element can be retrieved with a single statement. Many collections that implement IEnumerable<T> also implement the interface IList<T>. IList<T> collections allow access by element index position, a zero-based count from the first element. If a collection implements IList<T>, then custom operators should first exhaust that avenue of processing before using the slower IEnumerable<T> enumeration algorithm approach. The code shown in Listing 7-2 demonstrates how to use index position if possible (otherwise using the enumeration pattern of choice).

Listing 7-2. Last operator implementation with recommended error handling and performance optimizations

image

This implementation first tries to cast the source collection to the IList<T> interface; if it is successful, it uses that interface to resolve the last element in a single statement. If the collection does not implement IList<T>, the do-while enumeration pattern using the IEnumerable<T> interface is employed. The same error-handling patterns are followed, and this implementation is equivalent to the patterns used by Microsoft in its operator implementations.

Table 7-1. LINQ to Objects Built-in Performance Optimizations

image

Throwing an exception when there are no source collection elements (an empty sequence) may not be desirable behavior when using single element return operators. It is difficult dealing with exceptions from within a query expression (nowhere to put the try-catch statement). Many operators implement a variant of these operators that return a null value or a default value you specify. The operators FirstOrDefault and LastOrDefault are examples that carry out the basic function of their partners, but instead of throwing an exception when the source sequence is empty, they return the default value of the type by calling the default keyword on the specific underlying generic type. For our Last implementation, we would add an operator called LastOrDefault, and the code for this operator would be (the only alteration from the code for the final Last operator is the operator’s name and the last line):

image

Building the RandomElement Operator

To demonstrate how to build a custom operator that returns a single element from a source sequence, let’s look at an operator that returns an element at random. There are often times when picking a candidate at random from a collection is necessary, either for test data or deriving a random sample from a customer list to provide specific advertising or business offers. This new operator will be called RandomElement and will return one element at random from a given sequence (pseudo-random to be exact—it can be difficult to get the randomness you need for a small collection or when retrieving more than one in a tight loop).

The RandomElement operator has the following requirements:

• Take an input source IEnumerable<T> sequence and return one element at random.

• Allow control over the random number generator by optionally accepting a seed value.

• Throw an ArgumentNullException if the source sequence is null.

• Throw an InvalidOperationException if the source sequence is empty.

As seen when implementing the Last operator, although it is certain that sequences you extend implement the IEnumerable<T> interface, many sequences also implement the IList<T> and ICollection<T> interfaces, which can be used to improve performance. Any collection type that implements IList allows elements to be accessed by index position (for example, list[2] will return the third element), and any collection implementing ICollection exposes a Count property that immediately retrieves element count. These performance enhancements should be used if possible.

The code shown in Listing 7-3 uses the built-in .NET Framework random number generator to find an index position that is within a sequence’s bounds and returns that element. It first tries to use the IList indexer to retrieve the element (because it is faster), and if that interface isn’t supported by the collection, this operator uses a slower enumeration looping pattern (completely valid code, just slower).

Listing 7-3. Sample RandomElement operator returns an element at random from a source sequence and takes an optional seed value for the random number generator

image

image

Although it is completely up to the implementer to decide behavior for empty and null source collections, following the same pattern used for the standard query operators is highly recommended, as it makes consuming your custom operators the same as the other LINQ operators. The general conventions are as follows:

  1. If the source collection being extended is null, throw an ArgumentNullException.
  2. If the source collection is empty, throw an InvalidOperationException.
  3. Provide a variation of the operator with the suffix of “OrDefault” that doesn’t raise the InvalidOperationException, but returns the value default(T) instead.

Listing 7-4 shows the basic unit tests written to confirm correct behavior of the RandomElement operator. (I used NUnit—see www.nunit.org—but any unit testing framework would have worked.) The basic test cases for specific input sequence types that should be confirmed at a minimum are as follows:

• The source sequence is null.

• The source sequence has no elements (is empty).

• The source sequence implements IList.

• The source sequence implements IEnumerable<T>.

These tests, shown in Listing 7-4, are the minimum required for the RandomElement operator, and lightly confirm both the IList<T> and IEnumerable implementations with a single element source and a two-element source. Pay attention to boundary conditions when writing these tests; my first attempt at coding this operator never returned the last element—the unit tests allowed me to find and correct this error quickly.

Listing 7-4. Basic set of unit tests for the RandomElement operator—these tests are written for the NUnit testing framework (www.nunit.org); more tests are included in the sample code.

image

image

image

image

The code for our RandomElement operator in Listing 7-3 satisfies the requirements listed earlier; however, there is one more requirement that will make this single element operator conform to the patterns common in the standard query operators. We need to add a variation called RandomElementOrDefault to cater to empty source sequences without throwing an exception. This is necessary to allow the single element operators to be used on sequences that might ordinarily be empty, and rather than throwing an exception, the operator should return null for reference types or zero for numeric value types.

To add this variation of our new operator, I simply cut and pasted the code from Listing 7-3, changed its name to RandomElementOrDefault, and then altered the empty source check to return default(T); rather than throw new InvalidOperationException. The remaining part of the operator is unchanged (although good coding practice would have you refactor out the common code into a separate method). The additional operator begins with the following code:

image

Following are a few tips for developing new single element operators:

• Remember to test and make use of any performance efficiencies offered by collections that implement IList<T> and ICollection interfaces in addition to IEnumerable<T> if present on the source collection. Make certain to support the IEnumerable<T> sequences first before looking for performance optimizations.

• Implement an additional variation of these operators (suffixed with OrDefault) that returns a legal default value for a given type, rather than throw an InvalidOperationException (change the throw new InvalidOperationException to return default(T);)

• Keep your extension methods in a consistent namespace so that consumers of your operators only need to add a single namespace using clause to get access to your custom operators.

Writing a Sequence Operator

The role of a sequence operator is to return an IEnumerable<T> sequence based on the elements of a source sequence that have been reordered, filtered, or transformed in some way. The Where and Select standard query operators are examples of sequence operators.

Sequence operators make use of the yield return and yield break keywords introduced in C# 2.0. They enable you to write code that implements an iterator pattern, which allows consumers to use the familiar foreach pattern (the foreach uses the IEnumerable interface methods, GetEnumerator and MoveNext, behind the scenes; you can call these methods directly to gain a small performance improvement if necessary).

Enumerators implemented using the yield return keyword can be thought of as suspending execution after each yield return until the next element is requested. When the next element is requested, execution begins at the point and in the exact state the loop was previously suspended in. The simplest example of this is shown in Listing 7-5, which simply writes out 1, 2, and 3 in the Console window as shown in Output 7-1.

Listing 7-5. Simple yield return example—execution begins at the point after each yield return each time around the loop—see Output 7-1

image

Output 7-1

image

The job of any sequence operator is to return elements using the yield return keyword, until there are no more candidates in a sequence. The classic sequence operator to look to for guidance is the Where operator included with the standard query operators.

The Where operator extends any collection that implements IEnumerable<T> and loops through the elements in the source sequence, testing each element as it goes. If an element passes the predicate function (predicate is a function that takes a single argument and returns true or false based on the value of the parameter), that element is returned. Otherwise, that element is skipped. The yield return statement will only ever return one element and then suspend execution until the next element is requested, either manually (calling the MoveNext method) or by using a foreach loop.

The Where operator (with the argument error checking code removed for clarity) is implemented in the following general form:

image

Building the TakeRange Operator

To demonstrate how to write a sequence operator, we can write an operator called TakeRange, which will return elements in a sequence after an element satisfies the start predicate in a sequence until an element satisfies the end predicate in the same sequence.

The TakeRange operator has the following requirements:

• Take an input source and return elements that fall between the first element that passes a “start” predicate until the first element that passes the “end” predicate.

• Allow the start predicate to be optional. In this case all elements in the sequence are returned until the end predicate is satisfied.

• Not fail if the start or end predicate is never satisfied; just return an empty sequence.

• Throw an ArgumentNullException if the source sequence is null.

• Throw an ArgumentNullException if either the start or end predicate is null in the appropriate overloads.

The TakeRange operator will extend any collection that implements IEnumerable<T>, as the LINQ to Objects standard query operators always do. Due to this knowledge, the only certainty we have about the source collections is that it can be enumerated using the enumerator pattern. (Call members defined in the IEnumerable<T> interface or use the foreach statement.) Listing 7-6 shows the functional part of the TakeRange operator, the code that creates the iterator and carries out the predicate tests.

Listing 7-6. The iterator implementation for the TakeRange operator—see Listing 7-7 for the actual extension method operator implementation

image

The convention when writing sequence operators is to separate the iterator code from the extension method implementation. This allows the same functional iterator to be called from multiple extension methods that have different overload signatures. The requirements described earlier for the TakeRange operator will require two extension method signatures—one that takes a start and end predicate and one that takes only an end predicate (it will always start from the first element).

The actual extension methods that expose the TakeRange operator to our collections are shown in Listing 7-7. When only one predicate is passed, it is assumed to be the end predicate, and all elements from the start of the source sequence until this end predicate passes are returned—this is just the way we have specified the operator to work.

Listing 7-7. The TakeRange extension method declarations—both extension methods call the same iterator method as shown in Listing 7-6

image

Although optional, it is highly recommended to stay consistent with the error handling patterns followed by the standard query operators. The general conventions for sequence type operators are as follows:

  1. If the source collection being extended is null, throw an ArgumentNullException.
  2. If the source collection is empty, throw an InvalidOperationException.
  3. If either predicate is null, throw an ArgumentNullException.

Listing 7-8 shows the final code that makes up the TakeRange operator. It follows the accepted practices for input argument errors, demonstrates how to factor out duplicate code into a separate iterator method, and demonstrates how to expose multiple overloaded operators.

Listing 7-8. The TakeRange operator extension methods showing the error handling following the traditional pattern used in the Microsoft standard query operators

image

image

To ensure a custom operator works as expected and stays that way in future releases, a solid set of unit tests should be written. To test the TakeRange operator and to demonstrate its basic usage, Listing 7-9 shows the minimum set of unit tests to prove the operator is functional. These tests confirm the following conditions when calling this operator:

• The operator works consistently with normal inputs. (It actually works!)

• The operator works consistently at the boundaries. (How does it handle the first or last element in the source?)

• The operator throws an exception when called from a null source sequence.

• The operator throws an exception if called with null predicates.

Listing 7-9. The following unit tests are a starting point for testing the TakeRange operator

image

image

Following are a few tips for developing new sequence operators:

• Separate the iterator code from the extension method functions so you can provide many overloads that allow optional input arguments all calling the same main method implementation

• Test input arguments for null values and throw an ArgumentNullException("arg"); if these input arguments are essential for correct operation.

• Follow the same patterns and error handling that Microsoft used in their operators. (I’ve followed as many as possible throughout this book.)

Writing an Aggregate Operator

The role of an aggregate operator is to traverse a source sequence and carry out a function (often arithmetic) on the elements. Example aggregate standard query operators are Sum, Min, and Max, which return the sum of values, the smallest value, and the largest value, respectively.

The Min Operator

To understand the basics of an aggregate style operator, let’s look at the implementation of the Min standard query operator. Listing 7-10 shows the error handling and functional code to find the minimum integer in a sequence.

Listing 7-10. Implementation of the Min standard query operator

image

Listing 7-10 first checks the source sequence for a null condition and throws an exception if that is the case. If the source sequence is not null, each element is compared to the current minimum element found so far (by value), except the first element. When flag is false, the first element hasn’t yet been evaluated, and if the sequence is empty (no first element ever found), an InvalidOperationException is thrown; otherwise, the first element’s value is assigned to the current minimum variable on the first iteration step. After iterating and testing each element’s value against the current minimum found, the lowest value is returned.

The code shown previously for the Min operator is just one of the many overloads declared for this operator in the standard query operator (there are 22 total). Of the full set of Min operators provided, 11 are similar to the Min operator just described, and 11 also take a selector expression. The full set declared in the standard query operators is shown in Listing 7-11.

Listing 7-11. The full set of operator overloads for the Min operator, showing the argument types and return types supported by aggregate operators

image

When building your own aggregate operators, you should consider a similar set of overloads, as laborious and clipboard-intensive as it is initially writing them. My approach is to start with the int32 variation and fully unit-test that implementation before duplicating the code with the rest of the numeric and nullable numeric types. Even with this set of overloads, some of the standard query operators still don’t fulfill every need—I found this out when using the Sum operator, as is described in the next section.

Building the LongSum Operator

Early on when working with LINQ to Objects I found the built-in Sum operator that sums the values for a set of integer values returned the result as an integer type. This didn’t suit my requirements, and my sequences quickly overflowed the integer maximum boundary of the return type. A custom operator was needed to return the arithmetic sum of integer values with a result type of long (overloading the built-in Min operators didn’t solve the issue. Because I was only changing the return type and not an argument type, C# compiler’s overload resolution wasn’t satisfied—hence the name LongSum).

The LongSum operator has the following requirements:

  1. Take an input IEnumerable<int> collection as a source and return the arithmetic sum as type long.
  2. Follow the signature pattern of the existing Sum operators. Create two extension method overloaded signatures, one that takes a selector expression and one that doesn’t.
  3. Throw an ArgumentNullException if the source sequence is null.
  4. If the source sequence is empty, return a result of zero.

The obvious goal is to make using the new LongSum extension method identical to using the Sum extension methods. Listing 7-12 shows the full code for implementing the LongSum operator. The code iterates the source collection using a foreach loop and simply sums the element values as it goes. If a Selector expression argument is passed into the operator, that Selector expression is first evaluated before calling the other LongSum overload with simply a source sequence of integer values.

Listing 7-12. The code for the LongSum operator, which takes a sequence of int values and returns the mathematical sum up to the maximum value allowed in a Long type

image

To test the LongSum operator, Listing 7-13 shows the basic set of unit tests written to confirm correct behavior and demonstrate its usage. The basic set of unit tests you should consider for any custom aggregate style operators are:

• The operator works with normal inputs; positive and negative numbers, and all method overloads.

• The operator throws an exception when called on a null sequence (this is the standard error-handling pattern).

• The operator returns a safe value when called on an empty sequence. Return zero or an exception? The choice is up to the implementer.

Listing 7-13. Basic set of unit tests for the LongSum operator

image

image

The following provides some tips for developing new aggregate style operators:

• For completeness, you should create the full set of numeric overloads when you create your own operators. Refer to the set shown in Listing 7-11 for the Min operator as a good starting list for your operators.

• Provide an overload for each operator that takes a selector expression. Aggregate functions are often used in the select project of a grouped query result, where you want to define the numeric sequence you want aggregated. To implement these overloads, call the Select standard query operator passing in the select argument before calling your custom operator, for example:

image

• Test input arguments for null values and throw an ArgumentNullException("arg"); if these input arguments are essential for correct operation.

• Look at the source code for the standard query operators by using Red Gate Reflector (http://www.red-gate.com/products/reflector/). Follow the same patterns and error handling that Microsoft uses in their operators.

Writing a Grouping Operator

Grouping operators return collections of the source elements (or variations thereof) grouped by some algorithm. Examples of grouping category operators from the standard query operators are GroupBy and ToLookup.

The general steps for building a new grouping operator are:

  1. Write an IGrouping<TKey, TElement> list implementation to hold your groups. In a perfect world we could reuse Microsoft’s Grouping type implementation, however it is marked as internal. I recommend copying it as a template for your own grouping list implementation (which is what I did in this example, and provided in Listing 7-14).
  2. Write an enumeration method that builds the groups and yields the grouped results using instances of the IGrouping<TKey, TElement> types.
  3. Write an extension method signature(s) and implement each method by calling the enumeration method with the appropriate arguments.
  4. Write unit tests and ensure they pass (avoiding the religious debate as to whether tests should be written before the code or vice versa).

The ToLookup standard query operator almost universally caters to any grouping requirement (for instance, it is the basis for the GroupBy operator). If you can find an implementation by using the ToLookup operator, you can save a lot of time. For educational purposes, we are going to build a grouping operator from scratch.

Grouping Collection Implementation

Each group result from a grouping operator should follow the IGrouping<TKey,TElement> interface. This collection implementation should be read-only from the consuming application code and only populated by our grouping operator code. Listing 7-14 shows a basic implementation for a suitable grouping collection class. This implementation is a close facsimile of the Grouping type used by Microsoft in the standard query operator’s; however, we need to duplicate that code because that class is marked internal. You will notice that any property setter or method call that can alter the contents of the collection throws a NotSupportedException (the collection is read-only). You can iterate, access elements by index, and retrieve the key property for the group—but that is essentially all.

Listing 7-14. Example grouping element collection implementation—predominately read-only

image

image

image

image

Building the Segment Operator

To demonstrate writing a grouping operator, we will write an example operator called Segment. The Segment operator divides the elements in a source sequence into equal-sized groups (equal element count) where it can, except the last group if the elements can’t be distributed equally.

The Segment operator has the following requirements:

• To take any collection implementing IEnumerable<T> as an input source and divide it into multiple groups with the same number of elements.

• If the number of elements doesn’t allow all groups to be equally sized, the last group and all other groups of equal size should be reduced.

• If there are more segments than elements, put one element in as many as possible and return empty groups for the remaining.

• Each group should have a Key property with the sequential segment number; that is, 1 for the first group, 2 for the second, and so on.

• Throw an ArgumentNullException if the source sequence is null.

• Throw an ArgumentOutOfRangeException if the number of segments is zero or less than zero.

A quick example of how the Segment operator should work is shown in Listing 7-15. The Console output from this example is shown in Output 7-2.

Listing 7-15. Calling the Segment operator to divide a string array into five groups—see Output 7-2

image

Output 7-2

image

A real-world example (why this operator was first built) of segmenting data into groups is when generating groups for multivariate testing. For instance, imagine you wanted to send an email to all contacts in your customer list. You want to test variations in order to determine what version gets the greatest response. The Segment operator can be used to build these groups, as the code shown in Listing 7-16 demonstrates by creating two test groups. The Console output from this example is shown in Output 7-3.

Listing 7-16. Calling the Segment operator to divide contacts into two groups—see Output 7-3

image

Output 7-3

image

The code to implement the Segment operator is shown in Listing 7-17. This implementation first checks the input arguments and then calls a private implementation that encapsulates the building and iteration functionality. Separating the iterator function to support reuse with multiple overloaded variations of the Segment operator is considered a good practice (as mentioned in the TakeRange operator example). The return type of this operator is an IGrouping<TKey,TElement> (declared publicly in the System.Linq namespace, so it will be in scope); instances of our grouping collections are constructed as Grouping<int, T>[]’s, using the collection type shown in Listing 7-14.

Listing 7-17. The Segment operator code, which segments the contents of the source sequence into equal-sized groups

image

image

image

To confirm this segment operator matches the previously stated requirements and to confirm it stays that way throughout multiple future revisions, Listing 7-18 shows some basic unit tests. In writing this operator, these tests helped me debug the boundary conditions for odd, even, and too small a quantity source sequences.

Listing 7-18. Basic unit tests for the Segment operator

image

image

image

The following are some tips for developing new grouping operators:

• Try and use the ToLookup standard query operator in your implementation if possible.

• Separate the iterator logic into its own method, so that multiple operator overloads can re-use the same iterator implementation.

• Try to build the groups in one enumeration pass through the source sequence for optimal performance.

• Try to yield return a result grouping as soon as you can build one. Avoid building all groups up-front in the case that the user never iterates past the first group (just for performance, sometimes its unavoidable, but worth looking for an algorithm that performs best with deferred execution).

Summary

This chapter introduced how to build custom operators when required. LINQ to Object operators are simply extension methods for any type implementing the IEnumerable<T> interface.

The general guidance for building operators is to look at the implementation of Microsoft’s standard query operators and follow their patterns. This ensures that your operators are consistent in how they handle errors and how they are used when other developers begin using these extensions in query expressions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.75.165