Chapter 8. Processing Massive Datasets with Parallel Streams – The Map and Collect Model

In Chapter 7, Processing Massive Datasets with Parallel Streams – The Map and Reduce Model, we introduced the concept of stream, the new Java 8 feature. A stream is a sequence of elements that can be processed in a parallel or sequential way. In this chapter, you will learn how to work with streams with the following topics:

  • The collect() method
  • The first example – searching data without indexing
  • The second example – a recommendation system
  • The third example – common contacts in a social network

Using streams to collect data

In Chapter 7, Processing Massive Datasets with Parallel Streams – The Map and Reduce Model, we made an introduction to streams. Let's remember their most important characteristics:

  • Streams' elements are not stored in the memory
  • Streams can't be reusable
  • Streams make a lazy processing of data
  • The stream operation cannot modify the stream source
  • Streams allow you to chain operations so the output of one operation is the input of the next one

A stream is formed by the following three main elements:

  • A source that generates stream elements
  • Zero or more intermediate operations that generate output as another stream
  • One terminal operation that generates a result that could be either a simple object, array, collection, map, or anything else

The Stream API provides different terminal operations, but there are two more significant operations for their flexibility and power. In Chapter 7, Processing Massive Datasets with Parallel Streams – The Map and Reduce Model, you learned how to use the reduce() method, and in this chapter, you will learn how to use the collect() method. Let's make an introduction to this method.

The collect() method

The collect() method allows you to transform and group the elements of the stream generating a new data structure with the final results of the stream. You can use up to three different data types: an input data type, the data type of the input elements that come from the stream, an intermediate data type used to store the elements while the collect() method is running, and an output data type returned by the collect() method.

There are two different versions of the collect() method. The first version accepts the following three functional parameters:

  • Supplier: This is a function that creates an object of the intermediate data type. If you use a sequential stream, this method will be called once. If you use a parallel stream, this method may be called many times and must produce a fresh object every time.
  • Accumulator: This function is called to process an input element and store it in the intermediate data structure.
  • Combiner: This function is called to merge two intermediate data structures into one. This function will be only called with parallel streams.

This version of the collect() method works with two different data types: the input data type of the elements that comes from the stream and the intermediate data type that will be used to store the intermediate elements and to return the final result.

The second version of the collect() method accepts an object that implements the Collector interface. You can implement this interface by yourself, but it's easier to use the Collector.of() static method. The arguments of this method are as follows:

  • Supplier: This function creates an object of the intermediate data type, and it works as seen earlier
  • Accumulator: This function is called to process an input element, transform it if necessary, and store it in the intermediate data structure
  • Combiner: This function is called to merge two intermediate data structures into one, and it works as seen earlier
  • Finisher: This function is called to transform the intermediate data structure into a final data structure if you need to make a final transformation or computation
  • Characteristics: You can use this final variable argument to indicate some characteristics of the collector you are creating

Actually, there's slight difference between the two versions. The three-param collect accepts a combiner, that is BiConsumer, and it must merge the second intermediate result into the first one. Unlike it, this combiner is BinaryOperator and should return the combiner. Therefore, it has the freedom to merge either the second inside the first or the first inside the second, or create a new intermediate result. There is another version of the of() method, which accepts the same arguments except the finisher; in this case, the finishing transformation is not performed.

Java provides you with some predefined collectors in the Collectors factory class. You can get those collectors using one of its static methods. Some of those methods are:

  • averagingDouble(), averagingInt(), and averagingLong(): This returns a collector that allows you to calculate the arithmetic mean of a double, int, or long function.
  • groupingBy(): This returns a collector that allows you to group the elements of a stream by an attribute of its objects generating a map where the keys are the values of the selected attribute and the values are a list of the objects that have a determined value.
  • groupingByConcurrent(): This is similar to the previous one except for two important differences. The first one is that it may work faster in the parallel but slower in the sequential mode than the groupingBy() method. The second and most important difference is that groupingByConcurrent() function is an unordered collector. The items in the lists are not guaranteed to be in the same order as in the stream. The groupingBy() collector on the other hand guarantees the ordering.
  • joining(): This returns a Collector factory class that concatenates the input elements into a string.
  • partitioningBy(): This returns a Collector factory class that makes a partition of the input elements based on the results of a predicate.
  • summarizingDouble(), summarizingInt(), and summarizingLong(): These return a Collector factory class that calculates summary statistics of the input elements.
  • toMap(): This returns a Collector factory class that allows you to transform input elements into a map based on two mapping functions.
  • toConcurrentMap(): This is similar to the previous one, but in a concurrent way. Without custom merger, toConcurrentMap() is just faster for parallel streams. As occurs with groupingByConcurrent(), this is an unordered collector too, whereas toMap() uses the encounter order to make the conversion.
  • toList():This returns a Collector factory class that stores the input elements into a list.
  • toCollection(): This method allows you to accumulate the input elements into a new Collection factory class (TreeSet, LinkedHashSet, and so on) in the encounter order. The method receives an implementation of the Supplier interface that creates the collection as a parameter.
  • maxBy() and minBy(): This returns a Collector factory class that produces the maximal and minimal element according to the comparator passed as a parameter.
  • toSet(): This returns a Collector that stores the input elements into a set.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.63.62