Chapter 13. Programming for Multiple Cores

WHAT YOU WILL LEARN IN THIS CHAPTER:

  • The principle features of the Parallel Patterns Library

  • How to execute loop iterations in parallel

  • How to execute two or more tasks concurrently

  • How to manage shared resources in parallel operations

  • How to indicate and manage sections in your parallel code that must not be executed by multiple processors

If your PC or laptop is relatively new, in all probability it has a processor chip with two or more cores. If it doesn't, you can skip this chapter and move on to the next chapter. Each core in a multi-core processor chip is an independent processing unit, so your computer can be executing code from more than one application at any given time. However, it is not limited to that. Having multiple cores or multiple processors provides you with the possibility of overlapping computations within a single application by having each processor handling a separate piece of the calculations required in parallel. Because at least some of the calculations in your application are overlapped, your application should execute faster.

This chapter will show you how you can use multiple cores within a single application. You will also get to use what you learned about the Windows API in the previous chapter in a significant parallel computation.

PARALLEL PROCESSING BASICS

Programming an application to use multiple cores or processors requires that you organize the computation differently from what you have been used to with serial processing. You must be able to divide the computation into a set of sub-tasks that can be executed independently of one another, and this is not always easy, or indeed possible. Sub-tasks that can run in parallel are usually referred to as threads.

Any interdependence between threads places serious constraints on parallel execution. Shared resources are a common limitation. Sub-tasks executing in parallel that access common resources, at best, can slow execution — potentially slower than if you programmed for serial execution — and, at worst, can produce incorrect results or cause the program to hang. A simple example is a shared variable that is accessed and updated by two or more threads. If one thread stored a result in the variable after the other thread has retrieved the value for updating, the result stored by the first thread will be lost and the result will be incorrect. The same kind of problem arises with updating files.

So, what are the typical characteristics of applications that may benefit from using more than one processor? First, these applications will involve operations that require substantial amount of processor time, measured in seconds rather than milliseconds. Second, the heavy computation will involve a loop of some kind. Third, the computation will involve operations that can be divided into discrete but significant units of calculation that can be executed independently of one another.

INTRODUCING THE PARALLEL PATTERNS LIBRARY

The Parallel Patterns Library (PPL) that is defined in the ppl.h header provides you with tools for implementing parallel processing in your applications, and these tools are defined within the Concurrency namespace. The PPL defines three kinds of facilities for parallel processing:

  • Templates for algorithms for parallel operations

  • A class template for managing shared resources

  • Class templates for managing and grouping parallel tasks

In general, the PPL operates on a unit of work or task that is defined by a function object or, more typically, a lambda expression. You don't have to worry about creating threads of execution or allocating work to particular processors because in general the PPL takes care of this automatically. What you do have to do is to make sure your code is organized appropriately for parallel execution.

ALGORITHMS FOR PARALLEL PROCESSING

The PPL provides three algorithms for initiating parallel execution on multiple cores:

  • The parallel_for algorithm is the equivalent of a for loop that executes loop iterations in parallel.

  • The parallel_for_each algorithm executes repeated operations on a STL container in parallel.

  • The parallel_invoke algorithm executes a set of two or more independent tasks in parallel.

These are defined as templates, but I won't go into the details of the definitions of the templates themselves. You can take a look at the ppl.h header file, if you are interested. Let's take a closer look at how you use each of them.

Using the parallel_for Algorithm

The parallel_for algorithm executes parallel loop iterations on a single task specified by a function object or lambda expression. The loop iterations are controlled by an integer index that is incremented on each iteration until a given limit is reached, just like a regular for loop. The function object or lambda expression that does the work must have a single parameter; the current index value will be passed as the argument corresponding to this parameter on each iteration. Because of the parallel nature of the operation, the iterations will not be executed in any particular order.

There are two versions of the parallel_for algorithm. The first has the form:

parallel_for(start_value, end_value, function_object);

This executes the task specified by the third argument, with index values running from start_value to end_value-1, and the default increment of 1. The first two arguments must be of the same integer type. If they are not, you will get a compile-time error. As I said, because iterations are executed in parallel, they will not be executed in sequence. This implies that each execution of the function object or lambda expression must be independent of the others, and the result must not depend on executing the task in a particular sequence.

Here's a fragment that shows how the parallel_for algorithm works:

const size_t count(100000);
long long squares[count];
Concurrency::parallel_for(static_cast<size_t>(0), count, [&squares](size_t n)
                                     { squares[n] = (n+1)*(n+1); });

The computational task for each loop iteration is defined by the lambda expression, which is the third argument. The capture clause just accesses the squares array by reference. The function computes the square of n+1, where n is the current index value that is passed to the lambda, and stores the result in the nth element of the squares array. This will be executed for values of n, from 0 to count-1, as specified by the first two arguments. Note the cast for the first argument. Because count is of type size_t, the code will not compile without it. Because 0 is a literal of type int, without the cast, the compiler will not be able to decide whether to use type int or type size_t as a type parameter for the parallel_for algorithm template.

Although this fragment shows how the parallel_for algorithm can be applied, the increase in performance in this case may not be great, and indeed, you may find that things run more slowly than serial operations. The computation on each iteration in the example is very short, and inevitably there is some overhead in allocating the task to a processor on each iteration. The overhead may severely erode any reduction in execution time in computations that are relatively short. You should also not be misled by this simple example into thinking that implementing the use of parallel processing is trivial, and that you can just replace a for loop by a parallel_for algorithm. As you will see later in this chapter, all sorts of complications can arise in a more realistic situation.

Although with a parallel_for algorithm the index must always be ascending, you can arrange for descending index values inside the function that is being executed:

const size_t count(100000);
long long squares[count];
Concurrency::parallel_for(static_cast<size_t>(0), count,
                      [&squares, &count](size_t n)
                      {  squares[count-n-1] = (n+1)*(n+1);  });

Here, the squares of the index values passed to the lambda will be stored in descending sequence in the array.

The second version of the parallel_for algorithm has the following form:

parallel_for(start_value, end_value, increment, pointer_to_function);

Here, the second argument specifies the increment to use in iterating from start_value to end_value-1, so this provides for steps greater than 1. The second argument must be positive because only increasing index values are allowed in a parallel_for algorithm. The first three arguments must all be of the same integer type. Here's a fragment using this form of the parallel_for algorithm:

size_t start(1), end(300001), step(3);
vector<double> cubes(end/step + 1);
Concurrency::parallel_for(start, end, step, [&cubes](int n)
                {  cubes[(n-1)/3] = static_cast<double>(n)*n*n;   });

This code fragment computes the cubes of integers from 1 to 300001 in steps of 3 and stores the values in the cubes vector. Thus, the cubes of 1, 4, 7, 10, and so on, up to 29998, will be stored. Of course, because this is a parallel operation, the values may not be stored in sequence in the vector.

Note that operations on STL containers are not thread-safe, so you must take great care when using them in parallel operations. Here, it is necessary to pre-allocate sufficient space in the vector to store the results of the operation. If you were to use the push_back() function to store values, the code would crash.

Because parallel_for is a template, you need to take care to ensure that the start_value, end_value, and increment arguments are of the same type. It is easy to get it wrong if you mix literals with variables. For example, the following will not compile:

size_t end(300001);
vector<double> cubes(end/3 + 1);
Concurrency::parallel_for(1, end, 3, [&cubes](size_t n)
                {  cubes.push_back (static_cast<double>(n)*n*n);   });

This doesn't compile because the literals that are the first and third arguments are of type int, whereas count is of type size_t. The effect is that no compatible template instance can be found by the compiler.

Using the parallel_for_each Algorithm

The parallel_for_each algorithm works in a similar way to parallel_for, but the function operates on an STL container. The number of iterations to be carried out is determined by a pair of iterators. This algorithm is of the form:

Concurrency::parallel_for_each(start_iterator, end_iterator, function_object);

The function object specified by the third argument must have a single parameter of the type stored in the container that the iterators will access. This will enable you to access the object in the container that is referred to by the current iterator. Iterations run from start_iterator to end_iterator-1.

Here's a code fragment that illustrates how this works:

std::array<string, 6> people = {"Mel Gibson", "John Wayne", "Charles Chaplin",
                                "Clint Eastwood", "Jack Nicholson",  "George Clooney"};
Concurrency::parallel_for_each(people.begin(), people.end(), [](string& person)
{
  size_t space(person.find(' '));
  std::string first(person.substr(0, space));
  std::string second(person.substr(space + 1, person.length() - space - 1));
  person = second + ' ' + first;
});
std::for_each(people.begin(), people.end(), [](std::string& person)
{std::cout << person << std::endl; });

The parallel_for_each algorithm operates on the people array. Obviously, a real application would have more than 6 elements. The third argument specifies the function that does the work on each iteration. The parameter is a reference to type string because the elements in people are of type string. The function reverses the order of the names in each string. The last statement uses the STL for_each algorithm to list the contents of the people array. You can see that the PPL parallel_for_each algorithm has essentially the same syntax as the STL for_each algorithm.

Note that none of the parallel algorithms provide for premature termination of the operation. With a for loop, for example, you can use a break statement to terminate the loop, but this does not apply with the parallel algorithms. The only way to terminate the operation of a parallel algorithm before all concurrent operations have been completed is to throw an exception from within the function that defines the work to be done. This will stop all parallel operations immediately. Here's an example of how you can do that:

int findIndex(const std::array<string, 6>& people, const std::string& name)
{
  try
  {
    Concurrency::parallel_for(static_cast<size_t>(0), people.size(), [=](size_t n)
    {
      size_t space(people[n].find(' '));
      if(people[n].substr(0, space) == name)
        throw n;
    });
  } catch(size_t index) { return index;  }
  return −1;
}

This function searches the array from the previous fragment to find a particular name string. The parallel_for compares the name, up to the first space in the current people array element, with the second argument to the findIndex() function. If it's a match, the lambda expression throws an exception that will be the current index to the people array. This will be caught in the catch clause in the findIndex() function, and the value that was thrown will be returned. Throwing the exception terminates all current activity with the parallel_for operation. The findIndex() function returns −1 if the parallel_for operation terminates with no exception having been thrown, because this indicates that name was not found. Let's see if it works.

Using the parallel_invoke Algorithm

The parallel_invoke algorithm is defined by a template that enables you to execute up to 10 tasks in parallel. Obviously, there is no point in attempting to execute more tasks in parallel than you have processors on your computer. You specify each task to be executed by a function object, which, of course, can be a lambda expression, so the arguments to the parallel_invoke algorithm are the function objects specifying the tasks to be executed in parallel. Here's an example:

std::array<double, 500> squares;
std::array<double, 500> values;
Concurrency::parallel_invoke(
  [&squares]
{  for(size_t i = 0 ; i<squares.size() ; ++i)
      squares[i] = (i+1)*(i+1);
},
  [&values]
{  for(size_t i = 0 ; i<values.size() ; ++i)
     values[i] = std::sqrt(static_cast<double>((i+1)*(i+1)*(i+1)));
});

The arguments to the parallel_invoke algorithm are two different lambda expressions. The first stores the squares of successive integers in the squares array, and the second stores the square root of the cube of successive integers in the values array. These lambdas are completely independent of one another and therefore can be executed in parallel.

The examples and code fragments that I have used up to now have not been computations that really take advantage of parallel execution. Let's look at something that does.

A REAL PARALLEL PROBLEM

To explore the practicalities of parallel processing, and to see the advantages, we need a computation that is chunky enough to warrant parallel processing. We also need something we can use to explore some of the pitfalls and problems that can arise with parallel operations.

The problem I have chosen does not involve a lot of code but does involve doing some calculations with complex numbers. In case you have never heard of complex numbers or maybe you did know something about them at one time but have since forgotten, I'll first give you a brief explanation of what they are and how operations on them work — just enough to understand what the calculations are about. Then we will get into the code. Take my word for it: it's not difficult, and the results are interesting.

A complex number is a number of the form a + bi, where i is the square root of −1. Thus, 1 + 2i and 0.5 − 1.2i are examples of complex numbers. The a is referred to as the real part of the complex number, and the b that is multiplied by i is referred to as the imaginary part. Complex numbers are often represented in the complex plane, as Figure 13-1 illustrates.

FIGURE 13-1

Figure 13.1. FIGURE 13-1

To add two complex numbers, you just add the real parts and the imaginary parts separately to get the real and imaginary parts for the result. Thus, adding 1 + 2i and 0.5 − 1.2i produces (1 + 0.5) + (2 − 1.2)i, so the result is 1.5 + 0.8i.

Multiplying two complex numbers a + bi and c + di works like this:

a, c + di. + bi(c + di)

Remembering that i is the square root of −1, so i2 is −1, you can expand the expression above to:

ac + adi + bci − bd

Thus, the final result is the complex number:

(ac − bd) + (ad + bc)i

Multiplying the two example numbers that I introduced at the beginning will therefore result in the number (0.5 + 2.4) + (−1.2 + 1.0)i, which is 2.9 − 0.2i. Multiplying a number a + bi by itself results in

(a2 − b2) + 2abi

Now let's consider what the problem is about. The Mandelbrot set is a very interesting region of the complex plane that is defined by iterating the equation:

Zn+1 = Z2n + c where Z0 = c

for all points c over an area of the complex plane. By iterating, I mean that for any point c in the complex plane, you set Z0 to c and use the equation to calculate Z1 as c2+c You then use Z1 to calculate Z2 using the equation, then Z2 to calculate Z3, and so on, repeating the operation for some large number of iterations. So how does this tell you which points are in the Mandelbrot set? If the distance from the origin (shown in Figure 13-1) of any Zn exceeds 2, then continuing the iterations will result in the point moving further and further away from the origin towards infinity, so you don't need to iterate it further. All points that escape like this are not in the Mandelbrot set. Any point that does not diverge in this way after a given large number of iterations belongs to the Mandelbrot set. Any point that starts out more than a distance 2 from the origin will not belong to the Mandelbrot set, so this gives us an idea of the area in the complex plane in which to look.

Here's a function that iterates the equation for the Mandelbrot set:

size_t IteratePoint(double zReal, double zImaginary, double cReal, double cImaginary)
{
  double zReal2(0.0), zImaginary2(0.0); // z components squared
  size_t n(0);
 for( ; n < MaxIterations ; ++n)       // Iterate equation Zn = Zn-1**2 + K
  {                                    // for Mandelbrot K = c and Z0 = c
    zReal2 = zReal*zReal;
    zImaginary2 = zImaginary*zImaginary;
    if(zReal2 + zImaginary2 > 4)        // If distance from origin > 2, point will
      break;                            // go to infinity so we are done
// Calculate next Z value
    zImaginary = 2*zReal*zImaginary + cImaginary;
    zReal = zReal2 - zImaginary2 + cReal;
  }
  return n;
}

The IteratePoint() function iterates the equation Zn+1=Z2n+c with the starting point, ,Z-0., specified by the first two parameters and the complex constant c, specified by the last two parameters. MaxIterations is the maximum number of times to plug Z back into the equation, and it needs to be defined as a global constant. The for loop calculates a new Z, and if its distance from the origin is greater than 2, the loop is terminated and the current value of n is returned. The if expression actually compares the square of the distance from the origin to 4, to avoid having to execute the computationally expensive square root function. If the loop ends normally, then n will have the value MaxIterations, and this value will be returned.

So, maybe the for loop in the IteratePoint() function is a candidate for parallel operations? No, I'm afraid not. Each loop iteration requires the Z from the previous iteration to be available, so this is essentially a serial operation. However, there are other possibilities. To figure out the extent of the Mandelbrot set, we will have to call this function for every point in a region of the complex plane. In that context, we will have lots of potential for parallel computation.

In order to visualize the Mandelbrot set, we will display it as an image in the client area of a window; we will treat the pixels in the client area as though they are points in the complex plane. This implies that we will need a loop to test every pixel in the client area to see whether it's in the Mandelbrot set. In broad terms, the loop will look like this:

for(int y = 0 ; y < imageHeight ; ++y)           // Iterate over image rows
  {
    for(int x = 0 ; x < imageWidth ; ++x)          // Iterate over pixels in a row
        // Iterate current point x+yi...
  }
   // Display the yth row...
}

Each pixel in the image will correspond to a particular point in a region of the complex plane. The pixel coordinates run from 0 to imageWidth along a row and from 0 to imageHeight down the y-axis of the client area. We need to map the pixels1 coordinates to the area of the complex plane we are interested in. A good region to explore to find the Mandelbrot set is the area containing real parts — which will be the x-axis in the window — from −2.1 to +2.1 and the area containing the imaginary parts — the y-axis in the window — from −1.3 to +1.3. However, the client area of a window may not have the same aspect ratio as this, so ideally, we should make sure to map the pixels to the complex plane so that we maintain the same scale on both axes. This will prevent the shape of the Mandelbrot set from being squashed or elongated. To do this, we can calculate factors to convert from pixel coordinates to the complex plane as follows:

// Client area axes
const double realMin(−2.1);       // Minimum real value
double imaginaryMin(−1.3);        // Minimum imaginary value
double imaginaryMax(+1.3);        // Maximum imaginary value
// Set maximum real so axes are the same scale
double realMax(realMin+(imaginaryMax-imaginaryMin)*imageWidth/imageHeight);

// Get scale factors to convert pixel coordinates
double realScale((realMax-realMin)/(imageWidth-1));
double imaginaryScale((imaginaryMax-imaginaryMin)/(imageHeight-1));

The realMax value is the maximum real value, and this is calculated so that the real and imaginary axes have the same scale. The realScale and imaginaryScale variables hold factors that we can use to multiply the x and y pixel coordinates, respectively, to convert them to complex plane values.

Let's put a working program together.

CRITICAL SECTIONS

A critical section is a contiguous sequence of statements that cannot be executed by more than one task at any given time. As you saw with the last example, concurrent execution of code that updates a device context will cause problems because a device context is not designed to be used in this way. Critical sections can arise in other contexts, too.

The PPL defines the critical_section class that you can use to identify sequences of statements that must not be executed in parallel.

Using critical_section Objects

There is only one constructor for the critical_section class, so creating an object is very simple:

Concurrency::critical_section cs;

This creates the object cs. A critical_section object is a mutex. The name mutex is a contraction of mutual exclusion object. A mutex allows multiple threads of execution to access a given resource, but only one at a time. It provides a mechanism for the thread to lock the resource that it is currently using. All other threads then must wait until the resource is unlocked before another thread can access it.

Locking and Unlocking a Section of Code

Once you have created a critical_section object, you can apply a lock to a section of code by calling the lock() member function as follows:

cs.lock();                   // Lock the statements that follow

The statements that follow this call to the lock() member function can be executed by only one thread at a time. All other threads are excluded until the unlock() method for the mutex is called, like this:

cs.lock();                   // Lock the statements that follow
// Code that must be single threaded...
cs.unlock();                 //  Release the lock

Once the unlock() method for the critical_section object has been called, another thread of execution may apply a lock to the code section and execute it.

If a task calls lock() for a critical_section object, and the code is already being executed by another task, the call to lock() will block. In other words, execution of the task trying to lock the code section cannot continue. In many situations this may be inconvenient. It can be the case that there are several different code sections controlled by other mutexes, and if another section is not locked, you could execute that section. In situations where you don't want the call to lock a section of code to block because there are other things you can do, you can use the try_lock() member function:

if(cs.try_lock())
{
  // Code that must be single threaded ...
  cs.unlock();          // Release the lock
}
else
  // Do something else...

The try_lock() function does not block if a lock cannot be obtained, but returns a bool value that is true if a lock is obtained, and false otherwise.

I think we have enough additional capability to get over the problems in applying the parallel_for loop to the Mandelbrot set computation.

THE COMBINABLE CLASS TEMPLATE

The combinable class template is designed to help you with shared resources that can be segmented into thread-local variables and combined after the computation is complete. It offers advantages over using a mutex, like a critical_section object, in that you don't have to lock the shared resources and cause threads to wait. The way it works is that a combinable object can create duplicates of a shared resource, one for each thread that is to be executed in parallel. Each thread uses its own independent copy of the original shared resource. The only prerequisite is that you must be able to specify an operation that will combine the duplicates to form a single resource, once the parallel operation is complete. A very simple example of where combinable can help is a variable that is shared in a loop to accumulate a total. Consider the following fragment:

std::array<double, 100> values;
double sum(0.0);
for(size_t i = 0 ; i<values.size() ; ++i)
{
  values[i] = (i+1)*(i+1);
  sum += values[i];
}
std::cout << "Total = " << sum << std::endl;

You cannot make this loop parallel without doing something about sum because it is accessed and updated on each iteration. Allowing concurrent threads to do this is likely to corrupt the value of the result. Here's how the combinable class can help you execute the loop in parallel:

Concurrency::combinable<double> sums;
Concurrency::parallel_for(static_cast<size_t>(0),  values.size() ,
[&values, &sums](size_t i)
  {
    values[i] = (i+1)*(i+1);
    sums.local() += values[i];
  });
double sum = sums.combine([](double lSum, double rSum)
                             {return lSum + rSum;  });

The combinable object, sums, can create a local variable of type double for each concurrent thread. The template argument specifies the type of variable. Each concurrent thread uses its own local variable of the type specified, which you access by calling the local() function for sums in the lambda expression.

Once the parallel_for loop has executed, you combine the local variables that are owned by sums by calling its combine() function. The argument is a lambda expression that defines how any two local variables are to be combined. The lambda expression must have two parameters that are both of the same type, T, as you used for the combinable template argument, or both of type const reference to T. The lambda must return the result of combining its two arguments. The combine() function will use the lambda expression as often as necessary to combine all the local variables that were created, and return the result. In our case, this will be the sum of all the elements of values.

The combinable class also defines the combine_each() member function for combining local results that have a void return type. The argument to this function should be a lambda expression or a function object with a single parameter of type T or type const T&. With this function, it is up to you to assemble the combined result. Here's how you would use it to combine the local results for the previous fragment:

double sum(0.0);
sums.combine_each([&sum](double localSum)
                   { sum += localSum;  });

The lambda expression that is the argument to combine_each() will be called once for each local partial total owned by sums. Thus, you will accumulate the grand total in sum.

You can reuse a combinable object. To reset the value of the local variables for a combinable object, you just call its clear() function.

TASKS AND TASK GROUPS

You can use a task_group class object to manage the parallel execution of two or more tasks, where a task may be defined by a functor or a lambda expression. A task_group object is thread-safe, so you can use it to initiate tasks from different threads. If you only want to initiate parallel tasks from a single thread, then a structured_task_group object will be more efficient.

To execute a task in another thread, you first create your task_group object; then you pass the function object or lambda expression that defines the task to the run() function for the object. Here's a fragment that shows how this works:

std::array<int, 50> odds;
std::array<int, 50> evens;
int n(0);
std::generate(odds.begin(), odds.end(), [&n] { return ++n*2 - 1;  });
n = 0;
std::generate(evens.begin(), evens.end(), [&n] { return ++n*2;  });
int oddSum(0), evenSum(0);
Concurrency::task_group taskGroup;
taskGroup.run([&odds, &oddSum]
  { for(size_t i = 0 ; i<odds.size() ; ++i)
            oddSum += odds[i];
  });
for(size_t i = 0 ; i< evens.size() ; ++i)
  evenSum += evens[i];
taskGroup.wait();

The arrays odds and evens are initialized with sequential odd and even integers, respectively. The oddSum and evenSum variables are used to accumulate the sums of the elements in the two arrays. The first call to the run() function for the taskGroup object initiates the task specified by the lambda expression on a separate thread from the current thread — in other words, on a separate processor. The lambda expression sums the elements of the odds, and accumulates the result in oddSum. The for loop that follows sums the elements of the evens array and will execute concurrently with the task group lambda. Calling the wait() function for the taskGroup object suspends execution of the current thread until all tasks initiated by the taskGroup object — just one in our case — are complete.

Note how the current thread of execution can continue once a task group task has been initiated. The current thread will only pause execution when you call the wait() function for the task_group object. Of course, if you have more than two processors, you can use the task_group object to initiate more than one task to execute concurrently with the current thread.

A structured_task_group object can only be used to initiate tasks from a single thread. Unlike task_group objects, you cannot pass a lambda expression, or function object to the run() function, for a structured_task_group object. In this case, the argument specifying the task must be a task_handle object. task_handle is a class template where the type parameter is the type of the function object it encapsulates. It's easy to construct a task_handle object, though. Here's how the previous fragment can be coded to use the structured_task_group object to initiate a parallel task:

std::array<int, 50> odds;
std::array<int, 50> evens;
int n(0);
std::generate(odds.begin(), odds.end(), [&n] { return ++n*2 - 1;  });
n = 0;
std::generate(evens.begin(), evens.end(), [&n] { return ++n*2;  });
int oddSum(0), evenSum(0);
Concurrency::structured_task_group taskGroup;
auto oddsFun = [&odds, &oddSum]
  { for(size_t i = 0 ; i<odds.size() ; ++i)
            oddSum += odds[i];
  };
taskGroup.run(Concurrency::task_handle<decltype(oddsFun)>(oddsFun));
for(size_t i = 0 ; i< evens.size() ; ++i)
  evenSum += evens [i];
taskGroup.wait();

Here you create a function object, oddsFun, from a lambda expression. The auto keyword deduces the correct type for oddsFun from the lambda expression. The run() function for the taskGroup object requires the argument to be a task_handle object that encapsulates the task to be initiated. You create this by passing the oddsFun object as the argument, with the type argument for the task_handle template specified as the declared type of oddsFun. You use the decltype keyword to do this.

Note that the argument to the task_handle constructor must be a function object that has no parameters. Consequently, you cannot use a lambda expression here that requires arguments when it is executed. If you do need to use a lambda that requires parameters, you can usually wrap it in another lambda without parameters that calls your original lambda.

Another important consideration is that it is up to you to ensure that a task_handle object is not destroyed before the task associated with it has completed execution. This can come about if execution of the current thread, in which you create the task_handle object and initiated the parallel task execution, continues to a point where the destructor of the task_handle object is called, while the parallel task is still in progress.

SUMMARY

In this chapter, you have learned the basic elements provided by the Parallel Patterns Library for programming to use multiple processors in an application. The PPL is effective only if you have a substantial compute-intensive task that will run on a machine with multiple processors, but such tasks can crop up quite often. Many engineering and scientific applications come into this category, as do many image-processing operations, and you can find large-scale commercial data-processing tasks that may benefit from using the PPL.

WHAT YOU LEARNED IN THIS CHAPTER

TOPIC

CONCEPT

The parallel_for algorithm

The parallel_for algorithm provides the equivalent of a for loop with iterations executing in parallel. You specify the task that is to be executed on each iteration by a function object or a lambda expression.

The parallel_for_each algorithm

The parallel_for_each algorithm provides a parallel loop that processes the contents of an STL container using iterators.

The parallel_invoke algorithm

You use the parallel_invoke algorithm to execute 2 to 10 tasks in parallel. You specify a task by a function object or a lambda expression.

The combinable class

You can use a combinable class object to manage resources that are accessed by tasks executing concurrently.

The task_group class

You can use a task_group object to execute one or more tasks in a separate thread of execution. A task_group object is thread-safe and can be used to initiate tasks from different threads.

The structured_task_group class

You can use a structured_task_group object to execute one or more tasks in separate threads of execution. A structured_task_group object is not thread-safe, and all tasks must be initiated from the same thread. Each task must be identified using a task_handle object.

High-resolution timers

You can use the ::QueryPerformanceFrequency() and ::QueryPerformanceCounter() functions provided by the Windows API to implement a high-resolution timer capability in an application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.235.63