Chapter 6. Working with Set Data

Goals of this chapter:

• Define the two options for working with set data using LINQ.

• Introduce the HashSet type and how this relates to LINQ.

• Introduce the LINQ standard query operators that relate to working with set data.

There are two ways of applying set-based functions over data sequences using LINQ. This chapter explores the merits of both options and explains when and why to use one method over another.

Introduction

Set operations allow various functions to compare elements in collections (and in some cases, the same collection) against each other in order to determine overlapping and unique elements within a collection.

Framework libraries for set operations were missing in the .NET Framework 1, 2, and 3. The HashSet was introduced in .NET Framework 3.5, and this collection type solves most set problems often faced by developers. LINQ extended set function capability with specific operators, some of which overlap with HashSet functionality. It is important to understand the benefits of both strategies and when to choose one over another. This section looks in detail at the two main choices:

• LINQ standard query operators

HashSet<T> class from the Systems.Collections.Generic namespace

The decision of how to approach a set-based task depends on problem specifics, but in general the strengths of each can be described in the following ways:

Use HashSet and its operators when

• Duplicate items are not allowed in the collections.

• Modifying the original collection is desired. These operators make changes to the original collection.

Use LINQ Operators when

• Duplicate items are allowed in the collections.

• When returning a new IEnumerable<T> is desired rather than modifying the original collection.

The LINQ Set Operators

LINQ to Objects has standard query operators for working on sets of elements within collections. These operators allow two different collections (containing the same types of elements) to be merged into a single collection using various methods.

The set operators all implement a deferred execution pattern, simply meaning that they do not evaluate the next element until they are iterated over one element at a time. Each operator is detailed in this section, including the method signatures for each operator.

Concat Operator

Concat combines the contents of two collections. It operates by looping over the first collection yield returning each element, then looping over the second collection yield returning each element. If returning the duplicate elements is not the desired behavior, consider using the Union operator instead. An ArgumentNullException is thrown if either collection is null when this operator is called.

Concat has a single overload with the following method signature:

image

Listing 6-1 demonstrates the simplest use of the Concat operator and the subtle difference between Concat and Union. The Console output from this example is shown in Output 6-1.

Listing 6-1. Simple example showing the difference between Concat and Union—see Output 6-1

image

Output 6-1

image

A useful application of the Concat operator when binding a sequence to a control is its ability to add an additional entry at the start or end as a placeholder. For example, to make the first entry in a bound sequence the text “—none chosen—”, the code in Listing 6-2 can be used, with the result shown in Figure 6-1.

Figure 6-1. The Concat operator is useful for adding prompt text to bound sequences.

image

Listing 6-2. Using the Concat operator to add values to a sequence—see Figure 6-1

image

Distinct Operator

The Distinct operator removes duplicate elements from a sequence using either the default EqualityComparer or a supplied EqualityComparer. It operates by iterating the source sequence and returning each element of equal value once, effectively skipping duplicates. An ArgumentNullException is thrown if the source collection is null when this operator is called.

The method signatures available for the Distinct operator are:

image

Listing 6-3 demonstrates how to use the Distinct operator to remove duplicate entries from a collection. This example also demonstrates how to use the built-in string comparison types in order to perform various cultural case-sensitive and insensitive comparisons. The Console output from this example is shown in Output 6-2.

Listing 6-3. Example showing how to use the Distinct operator—this example also shows the various built-in string comparer statics—see Output 6-2

image

Output 6-2

image

Except Operator

The Except operator produces the set difference between two sequences. It will only return elements in the first sequence that don’t appear in the second sequence using either the default EqualityComparer or a supplied EqualityComparer. It operates by first obtaining a distinct list of elements in the second sequence and then iterating the first sequence and only returns elements that do not appear in the second sequence’s distinct list. An ArgumentNullException is thrown if either collection is null when this operator is called.

The method signatures available for the Except operator are:

image

Listing 6-4 shows the most basic example of using the Except operator. The Console output from this example is shown in Output 6-3.

Listing 6-4. The Except operator returns all elements in the first sequence, not in the second sequence—see Output 6-3

image

Output 6-3

image

Intersect Operator

The Intersect operator produces a sequence of elements that appear in both collections. It operates by skipping any element in the first collection that cannot be found in the second collection using either the default EqualityComparer or a supplied EqualityComparer. An ArgumentNullException is thrown if either collection is null when this operator is called.

The method signatures available for the Intersect operator are:

image

Listing 6-5 shows the most basic use of the Intersect operator. The Console output from this example is shown in Output 6-4.

Listing 6-5. Intersect operator example—see Output 6-4

image

Output 6-4

image

Union Operator

The Union operator returns the distinct elements from both collections. The result is similar to the Concat operator, except the Union operator will only return an equal element once, rather than the number of times that element appears in both collections. Duplicate elements are determined using either the default EqualityComparer or a supplied EqualityComparer. An ArgumentNullException is thrown if either collection is null when this operator is called.

The method signatures available for the Union operator are:

image

Listing 6-1 demonstrated the subtle difference between Union and Concat operators. Use the Union operator when you want each unique element only returned once (duplicates removed) and Concat when you want every element from both collection sequences.

Listing 6-6 demonstrates a useful technique of combining data from multiple source types by unioning (or concatenating, excepting, intersecting, or distincting for that matter) data from either a collection of Contact elements or CallLog elements based on a user’s partial input. This feature is similar to the incremental lookup features offered by many smart-phones, in which the user inputs either a name or phone number, and a drop-down displays recent numbers and searches the contacts held in storage for likely candidates. This technique works because of how .NET manages equality for anonymous types that are projected. The key to this technique working as expected is to ensure that the projected names for each field in the anonymous types are identical in name, case, and order. If these conditions are satisfied, the anonymous types can be operated on by any of the set-based operators.

Listing 6-6 uses the sample data of Contact and CallLog types introduced earlier in this book in Table 2-1 and Table 2-2 with sample partial user-entered data of Ka and 7. The Console output from this example is shown in Output 6-5.

Listing 6-6. Anonymous types with the same members can be unioned and concatenated—see Output 6-5

image

Output 6-5

image

Custom EqualityComparers When Using LINQ Set Operators

LINQ’s set operators rely on instances of EqualityComparer<T> to determine if two elements are equal. When no equality comparer is specified, the default equality comparer is used for the element type by calling the static property Default on the generic EqualityComparer type. For example, the following two statements are identical for the Distinct operator (and all of the set operators):

image

For programming situations where more control is needed for assessing equality, a custom comparer can be written, or one of the built-in string comparisons can be used.

Built-in String Comparers

Listing 6-3 introduced an example that showed case-insensitive matching of strings using the distinct operator. It simply passed in a static instance of a built-in comparer type using the following code:

image

In addition to the string comparer used in this example, there are a number of others that can be used for a particular circumstance. Table 6-1 lists the available built-in static properties that can be called on the StringComparer type to get an instance of that comparer.

Table 6-1. Built-in String Comparers

image

Building and Using a Custom EqualityComparer Type

In Chapter 4 in the “Specifying Your Own Key Comparison Function” section, you first saw the ability to customize how LINQ evaluates equality between objects by writing a custom equality comparer type. As an example we wrote a custom comparison type that resolved equality based on the age-old phonetic comparison algorithm, Soundex. The code for the SoundexEqualityComparer is shown in Listing 4-5, and in addition to being useful for grouping extension methods, the same equality comparer can be used for the LINQ set operators. For example, Listing 6-7 shows how to use the Soundex algorithm to determine how many distinct phonetic names are present in a list of names. The following code will correctly return the Console window text, Number of unique phonetic names = 4.

Listing 6-7. Using a custom equality comparer with the Distinct operator

image

The HashSet<T> Class

HashSet<T> was introduced in .NET Framework 3.5 as part of the System.Collections.Generic namespace. HashSet is an unordered collection containing unique elements and provides a set of standard set operators such as intersection and union (plus many more). It has the standard collection operations Add (although this method returns a Boolean indicating whether or not that element already existed in the collection), Remove, and Contains, but because it uses a hash-based implementation for object identity, these operations are immediately accessible without looping the entire list as occurs with the List<T> collection for example (O(1) rather than O(n)).

Although the operators on HashSet would appear to overlap with the LINQ set operators, Intersect and Union, these HashSet implementations modify the set they were called on, rather than return a new IEnumerable<T> collection, as is the behavior of the LINQ operators. For this reason, the names on the HashSet operators are differentiated as IntersectWith and UnionWith, and the LINQ operators are also available with a HashSet collection as Intersect and Union. This naming distinction avoids naming clashes and also allows the desired behavior to be chosen in a specific case.

HashSet implements the IEnumerable<T> pattern and therefore can be used in a LINQ query. Listing 6-8 demonstrates a LINQ query to find even numbers in a HashSet made by unioning two collections. The Console output from this example is shown in Output 6-6.

Listing 6-8. LINQ query over a HashSet collection—see Output 6-6

image

Output 6-6

image

The differences between the HashSet and LINQ operator support are listed here (as documented in the Visual Studio documentation), although LINQ-equivalent approximations are easy to construct as documented in Table 6-2 and implemented in Listing 6-9.

Table 6-2. Comparison of the LINQ Set Operators and the HashTable Type Methods

image

image

image

image

image

Listing 6-9. Approximate LINQ implementations of the operators in the HashSet type

image

image

image

image

Summary

Working with set data using LINQ is no harder than choosing the correct standard query operator. The existing HashSet<T> collection type can also be used, and the decision on which set approach suits your problem boils down to the following factors:

Use HashSet and its operators when

• Duplicate items are not allowed in the collections.

• Modifying the original collection is desired. These operators make changes to the original collection.

Use LINQ Operators when

• Duplicate items are allowed in the collections.

• When returning a new IEnumerable<T> is desired instead of modifying the original collection.

Having explained all of the built-in standard query operators in this and previous chapters, the next chapter looks at how to extend LINQ by building custom operators that extend the LINQ story and integrate into the language just like the Microsoft-supplied operators.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.246.223