Chapter 6

Where to Parallelize

What's in This Chapter?

Hotspot analysis using the Intel compiler

Hotspot analysis using the auto-parallelizer

Hotspot analysis using Amplifier XE

The purpose of parallelization is to improve the performance of an application. Performance can be measured either by how much time a program takes to run or by how much work a program can do per second. Within a program, it is the busy sections, or hotspots, that should be made parallel. The more the hotspots contribute to the overall run time of the program, the better the performance improvement you will obtain by parallelizing them.

Hotspot analysis is an important first step in the parallelism process. This chapter shows three different ways to identify hotspots in your code using Parallel Studio XE. Without carrying out Hotspot analysis, there is a danger that you will end up making little or no difference to your program's performance. The section “Hotspot Analysis Using the Auto-Parallelizer” includes some tips on how to help the auto-parallelizer do its job better.

A Note for Linux Users
Most of the text of this chapter uses the Windows version of the compiler options. You can use the option-mapping tool to find the equivalent Linux option. The following example finds the Linux equivalent of /Oy-:
map_opts -tl -lc -opts /Oy-
Intel(R) Compiler option mapping tool

mapping Windows options to Linux for C++

‘-Oy-’ Windows option maps to
  --> ‘-fomit-frame-pointer-’ option on Linux
  --> ‘-fno-omit-frame-pointer’ option on Linux
  --> ‘-fp’ option on Linux
The -t option sets the target OS and can be either l (or linux) or w (or windows).
The -l option sets the language and can be either c or f (or fortran). All the text after the -opts option is treated as options that should be converted. The option-mapping tool does not compile any code; it only prints the mapped options.
To use the option-mapping tool, make sure that the Intel compiler is in your path.

Different Ways of Profiling

You are already familiar with the four steps to parallelization (described in Chapter 3, “Parallel Studio XE for the Impatient”): analyze, implement, debug, and tune. It's now time to carry out the first of those steps, analyzing the hotspots in your code.

This book describes four ways of conducting a Hotspot analysis, the first three of which are covered in this chapter:

  • Using the Intel compiler's loop profiler and associated profile viewer
  • Letting the Intel compiler's auto-parallelizer help you find the hotspots
  • Using Amplifier XE
  • Performing a survey using Advisor (covered in Chapter 10, “Parallel Advisor Driven Design”)

Each approach has its merits, and you will probably grow to like a particular one. What you shouldn't do is guess where the hotspots are! If you do, you could end up spending wasted effort making code parallel with little or no return on your invested time.

Loops Are Not the Only Place to Parallelize
All the hotspot examples in this chapter use loop parallelism. Most of the time, you will find that you implement your parallelism effort at the loop level. However, other programming constructs also lend themselves to being made parallel, such as sequential code sections, recursive code, linked lists, and pipelines. These kinds of examples are explored in Chapter 7, “Implementing Parallelism.”
In this chapter the focus is on loop parallelism, but the Hotspot analysis techniques can be used for other programming patterns, as well.

The Example Application

The code in Listing 6.1 (at the end of this chapter) produces a black-and-white picture of a Mandelbrot fractal. The picture is stored as a PPM file and can be viewed with any PPM viewer. If you don't have a viewer, try IrfanView (

Listing 6.1 is split into the following files:

  • main.cpp — The entry point to the program
  • mandelbrot.cpp — Calculates the fractal
  • mandelbrot.h — Contains a number of defines and prototypes
  • ppm.cpp — Prints the fractal to a PPM file
  • wtime.c — A utility for measuring the application run time

When you run the example application, it displays the following simple text on the screen:

Time to calc :…3.707
Time to print :…7.548
Time (Total) :…11.25

Figure 6.1 shows the default.ppm file generated by running the application and viewed using IrfanView.

Figure 6.1 The output of the Mandelbrot application


Table 6.1 shows the results of running the program built with the Intel compiler, using the options /O2 (optimize for speed) and /Qipo (enable interprocedural optimization). The results are the best of five runs, on an Intel Xeon Workstation with an Intel Xeon CPU, X5680 @ 3.33 GHz (two processors, supporting a total of 24 hardware threads).

Table 6.1 Time Taken to Run the Example Application

Function Time
Calculating 3.433
Printing 2.206
Total 5.638

Activity 6-1: Building the Example Application
In this activity you build and run the Mandelbrot program.
1. Copy the source code in Listing 6.1 and place each script in a separate file.
2. Open an Intel Parallel Studio XE command prompt.
3. Build the program with the following command:
icl  /O2 /Qipo wtime.c main.cpp mandelbrot.cpp ppm.cpp -o 6-1.exe
4. Run the program you have just created and record the time taken.
5. Examine the generated default.ppm file with a PPM viewer.

Instructions for Linux Users

All the activities in this chapter can be carried out on a Linux platform, but you'll need to use the Linux compiler icc instead of icl. You will also need to find the equivalent Linux compiler options by following the instructions in the section “A Note for Linux Users.”

Sourcing the Compiler and Amplifier XE

To make the Parallel Studio XE tools available from a shell, source the following scripts (or add the commands to your ./bash_profile):
source /opt/intel/composerxe/bin/ intel64
source /opt/intel/vtune_amplifier_xe/
source /opt/intel/inspector_xe/
This assumes you've installed Parallel Studio XE in the default location.

Viewing the PPM File

Your Linux systems should have a default PPM viewer installed, such as gthumb, eog, or gwenview.

Hotspot Analysis Using the Intel Compiler

A well-kept secret is that the Intel compiler has its own profiler and viewer. These are different products from Amplifier XE and rely on the compiler instrumenting your code.

With the profiler and viewer you can:

  • Profile functions
  • Profile loops
  • View the output in a standalone viewer
  • Read the results from a text file

Profiling Steps

Figure 6.2 shows the steps for profiling an application:

1. Compile the source code using the /Qprofile-functions and /Qprofile-loops options. The compiler instruments each loop and each function with extra code that will track each time they are used.
2. Run the program. This produces a text file for each profile (having the .dump extension) and an XML file.
3. View the results with the command loopprofileviewer, passing it the name of the XML file that has just been generated.

If you do not want to use the profile viewer or the XML, you can read the results from the .dump file. You can disable the generating of an XML file by setting the INTEL_LOOP_PROF_XML_DUMP environment variable to zero. Table 6.2 lists the options for controlling the profiling.

Inlining: Where Are My Symbols?
When doing a Hotspot analysis with interprocedural optimization (IPO) or inlining enabled, some functions end up being inlined and are not visible in the Hotspot analysis. Here are three different strategies you can use to get better visibility:
  • Don't use IPO. You can disable it with the compiler option /Qipo-.
  • Disable inlining using the /Ob0 or /Ob1 option.
  • Use the /Qopt-report-phase ipo_inl option to get a list of inlined functions so that you can manually reconstruct the call tree.
Note that the first two options improve visibility but may have a detrimental effect on performance.

Figure 6.2 Using the Intel compiler to find the hotspots


Table 6.2 Profiling Options and Their Arguments

Option Arguments
/Qprofile-functions None
/Qprofile-loops:<arg> Inner, outer, all
/Qprofile-loops-report:<arg> 1 or 2 (times, or times and counts)

An Example Session

Taking the Mandelbrot program, which by now you should be familiar with, here is a description of the profiling steps and the output generated. You can try this for yourself in Activity 6-2.

1. The program is compiled with optimization level /O2:
icl /Zi /O2 /Qipo wtime.c main.cpp  mandelbrot.cpp  ppm.cpp -o m1.exe  
   /Qprofile-functions  /Qprofile-loops:all /Qprofile-loops-report:2
The /Qprofile-functions option tells the compiler to profile the functions. The /Qprofile-loops:all option tells the compiler to profile both inner and outer loops. The /Qprofile-loops-report option selects the level of detail the report should contain; specifying 2 tells the compiler to report loop times and iteration counts.
2. Running the program gives the usual output:
Time to calc :…3.707
Time to print :…7.548
Time (Total) :…11.25
When the program has finished running, the directory will contain the following files:
C:dvCH6>dir /b
The names of the XML and dump files are augmented with a time stamp.
3. To call the viewer, the name of the XML file is passed in:
loopprofileviewer  loop_prof_1317923290.xml

(Linux users: or loopprofileviewer.csh)
Figure 6.3 shows the results displayed in the viewer. The top set of results is the function profile, and the bottom set is the loop profile. You can sort the results by clicking at the tops of the columns. There is also a facility for filtering what is displayed by threshold. For example, you can choose to display only the top 10 percent of hotspots.
Table 6.3 shows the results of profiling the Mandelbrot program M1.exe with inlining enabled. The biggest hotspots are the three loops at the top of the table. Time refers to the time the loop takes including any function calls. Self time refers to the time the loop takes without including any called functions.
The first loop at mandelbrot.cpp:19 is reported as being in the main function, but this is not true. The cause of the apparent error is that the function in which the loop resides has been inlined. Using the options /Qopt-report-phase ipo_inl and /Qopt-report-routine:main shows that the nested function calls to CalcMandelbrot(), SetZ(), and Mandelbrot() have all been inlined:
-> INLINE: ?Mandelbrot@@YAXXZ(2905) (isz = 71) (sz = 74 (31+43))
    -> INLINE: ?SetZ@@YAXHHMM@Z(2906) (isz = 51) (sz = 62 (19+43))
      -> INLINE: ?CalcMandelbrot@@YAMMM@Z(2907) (isz = 26) (sz = 36 (17+19))
4. Rebuilding the application with inlining disabled (using the /Ob0 option) improves visibility but has a huge impact on the WriteMandelBrot() function. Instead of taking fewer than 4 seconds to complete, it now takes more than 40 seconds. Table 6.4 shows the loop analysis with inlining disabled.
5. The next thing to decide is which loop should be made parallel. Two criteria are important:
  • There should be a decent number of iterations of the loop.
  • The individual loops should do a reasonable amount of work.
As shown in Table 6.4, two loops have a large number of iterations, ppmt.cpp:12 and mandelbrot.cpp:32. Both have a self time of around 1 percent, which translates to about a third of a second — this is plenty of work to consider making parallel. You can view the exact value in the loopprofileviewer.
There are other considerations, such as loop dependencies, to take into account when it comes to implementing the parallelism. At this stage, however, the only task is to identify the hotspots.

Figure 6.3 The standalone loop-profiling viewer


Table 6.3 Results of the Loop Profiling with Inlining Enabled


Table 6.4 Results of the Loop Profiling with Inlining Enabled


Overhead Introduced by Profiling

Using the profiling option of the compiler adds an overhead to the run time. Table 6.5 records the time taken for each type of profiling. On the Mandelbrot program, with all the profiling options enabled, the program runs twice as slow as when no profiling is carried out.

Pros and Cons of Profiling with the Intel Compiler
  • Pros
    • Easy to use
    • Everything you need is available with the compiler, including a standalone viewer
    • Profiles loops as well as functions
  • Cons
    • Very basic functionality
    • Requires code to be instrumented, introducing a compile-time and a runtime overhead, which can be significant
    • No call tree, so you have to construct the call stack manually
    • No comparison facility

Table 6.5 Time Taken to Run the Example Application

Type of Profiling Time Speedup
No profiling 5.638 1
Functions 7.953 0.71
Functions and outer loops (time) 10.68 0.53
Functions and outer loops (time and count) 10.86 0.52
Functions and all loops (time) 10.98 0.51
Functions and all loops (time and count) 11.25 0.50


Activity 6-2: Using the Compiler's Loop Profiler
In this activity you use the Intel compiler to instrument the Mandelbrot program and then find the busiest hotspots using the loopprofileviewer.
1. Make sure you have carried out Activity 6-1.
2. Rebuild the application, adding the /Zi option to generate debug information, and the /Qprofile options so that the compiler instruments the code:
icl /Zi /O2 /Qipo wtime.c main.cpp  mandelbrot.cpp  ppm.cpp -o 6-2.exe 
    /Qprofile-functions  /Qprofile-loops:all /Qprofile-loops-report:2
3. Run the program you have just created and record the time taken:
4. Start the loopprofileviewer from the command line, and browse to the XML file that has just been generated.

Dealing with the Lack of Symbol Visibility

One of the difficulties of profiling an optimized application is that the compiler will inline some function calls.
5. Repeat steps 2 to 4, adding the option /Ob0 to the end of the build options.
6. Repeat steps 2 to 4 again, but this time use the following options:
icl /Zi /O2 /Qipo wtime.c main.cpp  mandelbrot.cpp  ppm.cpp -o 6-2.exe 
   /Qprofile-functions  /Qprofile-loops:all 
/Qprofile-loops-report:2  /Qopt-report-phase ipo_inl 
7. Look at the report the compiler prints to the screen. This should help you to identify which functions have been inlined.

Instructions for Linux Users

Refer to the “Instructions for Linux Users” section in Activity 6-1 before carrying out this activity.

Hotspot Analysis Using the Auto-Parallelizer

The Intel compiler has an auto-parallelizer that can automatically add parallelism to loops. By default, the auto-parallelizer is disabled, but you can enable it with the /Qparallel option. Some developers use this feature to give hints on where best to parallelize their code.

The auto-parallelizer does four things:

  • Finds loops that could be candidates for making parallel
  • Decides if there is a sufficient amount of work done to justify parallelization
  • Checks that no loop dependencies exist
  • Appropriately partitions any data between the parallelized code

Profiling Steps

Figure 6.4 shows the steps for profiling with the help of the auto-parallelizer:

1. Compile the sources with the /Qparallel option. To get superior results, it's always best to enable interprocedural optimization (/Qipo). The option /Qpar-report2 instructs the compiler to generate a parallelization report, listing which loops were made parallel.
2. Look at the results from the compiler and make a note of any lines that were successfully parallelized.
3. Add your own parallel constructs to the identified loops.
4. Rebuild the application without the /Qparallel option.

Figure 6.4 Using the auto-parallelizer to find hotspots


You might ask, “Why not just accept the results of the parallelizer?” The following are two of the common reasons:

  • The auto-parallelizer (at the time of writing) uses OpenMP. Many developers prefer to use a more composable parallelism, such as that provided with Cilk Plus or Threading Building Blocks. In this context, “composability” refers to how well a parallel model can be mixed with other models.
  • Some developers don't like relying on automatic features. They prefer to have more control over where and when threading is implemented.

An Example Session

Here's an example session of finding hotspots with the auto-parallelizer. You can try this out for yourself in Activity 6-3.

1. The serial version of the code is run so that you have some results to compare against:
C: >serial.exe
Time to calc :…3.667
Time to print :…2.311
Time (Total) :…5.978
2. The Mandelbrot application is then built with auto-parallelism enabled (/Qparallel). The optimization level must be at least /O1 to engage the auto-parallelizer:
icl /Zi /O2  wtime.c main.cpp  mandelbrot.cpp  ppm.cpp -o m1.exe  
/Qparallel /Qipo /Qpar-report2
The compiler will report on every loop it finds, including the header files, so the screen will get filled with messages. Here are the ones related to the source code:
main.cpp(14):(col.3) remark: LOOP WAS AUTO-PARALLELIZED
main.cpp(14):(col.3) remark: loop was not parallelized: insufficient inner loop
main.cpp(14):(col.3) remark: loop was not parallelized: existence of parallel dependence
ppm.cpp(11):(col.3) remark: loop was not parallelized: existence of parallel dependence
ppm.cpp(12):(col.5) remark: loop was not parallelized: existence of parallel dependence
As an experiment, running the parallelized code shows that the time taken to do the calculations is much better than the 3.667 seconds that was previously achieved without parallelism:
Time to calc :…0.596
Time to print :…2.272
Time (Total) :…2.868
3. The parallelized loop reported at line 14 of main.cpp is examined. The first thing you will discover is that there is no loop, but rather a call to Mandelbrot()!
main.cpp 12: std::cout << "calculating…" << std::endl;
main.cpp 13:   double start = wtime();
main.cpp 14:   Mandelbrot();
main.cpp 15:   double mid = wtime();
The loop in question is in the Mandelbrot() function in Mandelbrot.cpp, but it has been inlined by the use of the option /Qipo:
mandelbrot.cpp 27:  void Mandelbrot ()
mandelbrot.cpp 28:  {
mandelbrot.cpp 29:    float xinc = (float)deltaX/(maxI-1);
mandelbrot.cpp 30:    float yinc = (float)deltaY/(maxJ-1);
mandelbrot.cpp 31:    for (int i=0; i<maxI; i++) {
mandelbrot.cpp 32:      for (int j=0; j<maxJ; j++) {
mandelbrot.cpp 33:      SetZ(i, j, xinc, yinc);
mandelbrot.cpp 34:      }
mandelbrot.cpp 35:    }
mandelbrot.cpp 36:  }
To make the code parallel using the Cilk Plus method, replace the outer for(..) with cilk_for()and add the Cilk include file to the top of the Mandelbrot.cpp file:
mandelbrot.cpp  0:  #include "mandelbrot.h"
mandelbrot.cpp  1:  #include <cilk/cilk.h>
mandelbrot.cpp 30:    float yinc = (float)deltaY/(maxJ-1);
mandelbrot.cpp 31:    cilk_for (int i=0; i<maxI; i++) {
mandelbrot.cpp 32:      for (int j=0; j<maxJ; j++) {
4. Building and running the program gives a better performance improvement than with the auto-parallelism:
icl /Zi /O2  wtime.c main.cpp  mandelbrot.cpp  ppm.cpp -o myparallel.exe /Qipo

Time to calc :…0.2475
Time to print :…2.178
Time (Total) :…2.426

Programming Guidelines for Auto-Parallelism

Although this chapter is about using the auto-parallelizer to find hotspots, this is a good time to mention how you can help the auto-parallelizer to do its job better. For auto-parallelism to succeed, you must follow certain guidelines:

  • The loop must be countable at compile time. Try to use constants where possible.
  • There must be no data dependencies between loop iterations.
  • Avoid placing structures in loop bodies (for example, function calls, pointers with ambiguous indirection to globals, and so on).
  • Don't use the option /Od (or /Zi) on its own. Auto-parallelism will work only at optimization levels /O1 or greater.
  • Use IPO (/Qipo). IPO gets applied before auto-parallelism and can improve the chance of the code being made parallel.
  • Try to help the compiler by using the #pragma parallel option. (See the section “Using #pragma parallel.”)

Additional Options

Table 6.6 lists other options that you can use. Refer to the compiler help for more information.

Table 6.6 Some Auto-Parallelizer Options

Option Description
Qpar-affinity Specifies thread affinity
Qpar-num-threads Specifies the number of threads to use in a parallel region
Qpar-report Controls the diagnostic information reported by the auto-parallelizer
Qpar-runtime-control Generates code to perform runtime checks for loops that have symbolic loop bounds
Qpar-schedule Specifies a scheduling algorithm or a tuning method for loop iterations
Qpar-threshold Sets a threshold for the auto-parallelization of loops
Qparallel Tells the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel
Qparallel-source-info Enables or disables source location emission when OpenMP or autoparallelization code is generated
Qpar-adjust-stack Tells the compiler to generate code to adjust the stack size for a fiberbased main thread

Helping the Compiler to Auto-Parallelize

To ensure correct code generation, the compiler treats any assumed dependencies as if they were proven dependencies, which prevents any auto-parallelization. The compiler will always assume a dependency where it cannot prove that it is not a dependency. However, if the programmer is certain that a loop can be safely auto-parallelized and any dependencies can be ignored, the compiler can be informed of this in several ways.

Using #pragma parallel

Used immediately before a loop, the #pragma parallel option instructs the compiler to ignore any assumed loop dependencies that would prevent correct auto-parallelization. It complements, but does not replace, the fully automatic approach; the loop will still not be parallelized if the compiler can prove that any dependencies exist.

Any loop being parallelized must conform to the for-loop style of an OpenMP work-sharing construct. The pragma can be used by itself or in conjunction with a selection of clauses, such as private, which acts in a similar way to the clauses used in the OpenMP method.

Currently, the clauses include the following:

  • always[assert], which overrides the compiler heuristics that determine whether parallelizing a loop would increase performance. Using this clause forces the compiler to parallelize if it can, even if it considers that doing so might not improve performance. Adding assert causes the compiler to generate an error if it considers that the loop cannot be vectorized.
  • private( var1[ :expr1][, var2[ :expr2] ] … ), where var is a scalar or array variable. When parallelizing a loop, private copies of each variable are created for each thread. expr is an optional expression used for array or pointer variables, which evaluates to an integer number giving the number of array elements. If expr is absent, the rules are the same as those used in the OpenMP method, and all the array elements are privatized. If expr is present, only that number of elements of the array are privatized. Multiple private clauses are merged as a union.
  • lastprivate( var1[ :expr1][, var2[ :expr2] ] … ), where var and expr are the same as for private. Private copies of each variable are used within each thread created by the parallelization, as in the private clause; however, the values of the copies within the final iteration of the loop are copied back into the variables when the parallel region is left.

Following is an example of using #pragma parallel:

(41)    #pragma parallel private(b)
(42)    for( i=0; i<MAXIMUS; i++ )
(43)    {
(44)        if( a[i] > 0 )
(45)        {
(46)            b = a[i];
(47)            a[i] = 1.0/a[i];
(48)        }
(49)        if( a[i] > 1 )a[i] += b;
(50)    }

This results in the loop being both vectorized and parallelized, with the following messages:

C:Test.cpp(42): (col. 4) remark: LOOP WAS AUTO-PARALLELIZED.
C:Test.cpp(42): (col. 4) remark: LOOP WAS VECTORIZED.

Using #pragma noparallel

You can use the #pragma noparallel option immediately before a loop to stop it from being auto-parallelized.

Note that both #pragma parallel and #pragma noparallel are ignored unless the /Qparallel option is set.

Pros and Cons of Profiling with the Auto-Parallelizer
  • Pros
    • Easy to carry out
    • Quickly helps you spot the right places to parallelize
    • Auto-parallelized loop can be compared with your own manually implemented parallelism
  • Cons
    • Can easily be confounded by nontrivial code
    • Difficult to identify loops when IPO is enabled


Activity 6-3: Using the Auto-Parallelizer to Help Find Hotspots
In this activity you enable the Intel compiler's auto-parallelizer and use the location of the successfully parallelized loops to add your own parallel code.
1. Make sure you have carried out Activity 6-1. You will need the results of running the application to compare with the results in this activity.
2. Rebuild the application, adding the /Qparallel option to enable auto-parallelism, and the /Qpar-report2 option to tell the compiler to generate a report:
icl /Zi /O2 /Qipo wtime.c main.cpp  mandelbrot.cpp  ppm.cpp -o 6-3.exe 
   /Qparallel /Qpar-report2
3. Examine the messages from the compiler. You should find that one of the loops has been auto-parallelized.
4. Run 6-3.exe. Calculate the speedup compared to 6-1.exe, which you created in Activity 6-1. The application should be faster.
You can calculate the speedup using the following formula. New time is the time taken by 6-3.exe, and original time is the time taken by 6-1.exe.
 speedup = new time / original time
5. Add a cilk_for and an include to the loop that the auto-parallelizer has identified:
#include <cilk/cilk.h>
 cilk_for (…etc ) {
6. Rebuild the application using the following options. Note that auto-parallelism is no longer enabled.
icl /Zi /O2 /Qipo wtime.c main.cpp  mandelbrot.cpp  ppm.cpp -o 6-3b.exe
7. Run the program and calculate the speedup.

Instructions for Linux Users

Refer to the “Instructions for Linux Users” section in Activity 6-1 before carrying out this activity.

Hotspot Analysis with Amplifier XE

The Hotspot analysis used in Amplifier XE helps you to identify the most time-consuming source code. Hotspot analysis also collects stack and call tree information. The analysis can be used to launch an application/process or attach to a running program/process.

Conducting a Default Analysis

The steps for conducting a Hotspot analysis with Amplifier XE were described in Chapter 3.

To get the best view of the application in Amplifier XE, it is best to disable inlining by using the /Ob0 or /Ob1 compiler options. The /Ob0 option disables all inlining, whereas the/Ob1 inlines only code that has been marked with the keywords inline, _inline_, _forceinline, _inline, or with a member function defined within a class declaration. (See online help for more information on these keywords.) Figure 6.5 shows the summary page of two Hotspot analysis sessions: one with inlining enabled (a) and one without (b). You can see that when inlining is disabled, the symbol names of the different functions become available.

Figure 6.5 Analysis summary with and without inlining


Finding the Right Loop to Parallelize

At the time of writing, Amplifier XE does not have a loop profiler, so you have to manually traverse up the call stack of a hotspot to find the best place to add parallelism. Figures 6.6 through 6.9 show screenshots of doing such a traversal. Clicking on the hotspot in Figure 6.6 displays the source view of the hotspot (Figure 6.7).

Figure 6.6 Bottom-up view of the hotspots


Figure 6.7 Source code view of the hotspots


By double-clicking the stack pane on the right (see Figures 6.7 and 6.8), it is possible to traverse up the call stack until an appropriate place to add the parallelism is found, as in Figure 6.9.

Activity 6-4: Conducting a Hotspot Analysis with Amplifier XE
In this activity you carry out a Hotspot analysis on the Mandelbrot program with Amplifier XE.
1. Make sure you have carried out Activity 6-1.
2. Rebuild the application, adding the /Zi flag to generate debug information:
icl  /O2 /Qipo /Zi wtime.c main.cpp mandelbrot.cpp ppm.cpp -o 6-4.exe
3. Start an Amplifier XE GUI from the command line:
4. Create a new project named Chapter 6.
a. Select File ⇒ New ⇒ Project.
b. In the Project Properties dialog, make sure the Application Field points to your Mandelbrot application.
5. Carry out a Hotspot analysis by selecting File ⇒ New ⇒ Hotspot Analysis.

Dealing with the Lack of Symbol Visibility

You've already seen in the previous activities that functions disappear because of compiler inlining. Adding the /Ob1 option to the build improves visibility.
6. Repeat steps 2 to 5, using the following compiler options. You should notice an improvement in what you see.
icl  /O2 /Qipo /Zi /Ob1 wtime.c main.cpp mandelbrot.cpp ppm.cpp 
    -o 6-4.exe

Traversing Up the Call Stack

7. From the bottom-up view, double-click the largest hotspot. The source view should be displayed.
8. In the stack pane (on the right of the source view), manually trace back up the call stack (by double-clicking the call stack entries) until you find code that has a loop in it.
You should be able to find the best place to add parallelism by doing this manual stack traversal.

Instructions for Linux Users

Refer to the section “Instructions for Linux Users” in Activity 6-1.

Figure 6.8 Source code view, one stack up


Figure 6.9 Source code view, two stacks up


Large or Long-Running Applications

In very large or long-running projects, the amount of data collected may grow to an unmanageable size. The postprocessing of the collected data (finalization) and opening and viewing very large data sets can become very sluggish and almost impractical to use.

Reducing the Size of Data Collected

Some strategies for reducing the amount of data collected include:

  • Adjust the duration time estimate. Amplifier XE reduces the amount of samples it collects on very long runs. You can change the duration time estimate from “under 1 minute” to “over 3 hours,” with some intermediate values, as well.
  • Automatically stop collection after a short period of time (for example, 30 seconds).
  • Modify the data-collection limit. The default is 100MB.
  • Use the Pause and Resume APIs to limit when data is collected.

The first three items in the list are all configurable from the Project Properties dialog (see Figure 6.10), which you can access from the Amplifier XE menu File ⇒ Properties.

Figure 6.10 The Project Properties page


Using the Pause and Resume APIs

You can insert calls to the Pause and Resume APIs in your application to pause and resume data collection, respectively. By doing this you can reduce the amount of data that is collected. These APIs have to be used with caution, especially when analyzing threaded code, because important events may be missed, leading to a meaningless analysis.

The following code snippet shows how to use _itt_pause() and _itt_resume() functions in the Mandelbrot program:

#include "ittnotify.h"

int main()
  std::cout << "calculating…" << std::endl;
  double start = wtime();
  double mid = wtime();

  std::cout << "printing…" << std::endl;
  double end = wtime();

Once this code is inserted, any Hotspot analysis should be started by clicking the Start Paused button rather than the Start button.

To use the APIs, include the ittnotify.h header file. If you get an unresolved symbol at link time, you may have to add the libittnotify.lib library, which you can find in the Amplifier XElib32 or Amplifier XElib64 folders. Use the lib64 version if you are building a 64-bit application; otherwise, use the lib32 version.

Table 6.7 shows the difference in the size of data that is collected when doing a normal Hotspot analysis versus doing one with pauses and waits. As you can see, there is a significant saving in the amount of data collected.

Table 6.7 Amount of Data Collected when Profiling with and without the Pause and Resume APIs

Method Data Size
Without pause/resume 253.9k
With pause/resume 172.0k

Pros and Cons of Profiling with Amplifier XE
  • Pros
    • Very small profiling overhead
    • Easy to traverse the call stack
    • No special build needed, other than providing debug symbols
    • Multiple options for collection and viewing
    • Results can be compared
  • Cons
    • No loop profiler
    • No call graph (but see the comments on manual call stack traversing in the section “Finding the Right Loop to Parallelize”)

Source Code

The source code in Listing 6.1 consists of several files and is used in the hands-on activities.

listing Listing 6.1: main.cpp

#include <fstream>
#include <iostream>
#include <iomanip>
#include "mandelbrot.h"

float zr[maxI][maxJ],zi[maxI][maxJ];
float zcolor[maxI][maxJ];
extern "C" double wtime();

int main()
  std::cout << "calculating…" << std::endl;
  double start = wtime();
  double mid = wtime();

  std::cout << "printing…" << std::endl;
  double end = wtime();

  std::cout << "Time to calc :…"<< std::setprecision(4) 
     << mid-start <<std::endl;
  std::cout << "Time to print :…" << end-mid <<std::endl;
  std::cout << "Time (Total) :…" << end-start <<std::endl;

code snippet Chapter6main.cpp

#include "mandelbrot.h"

float CalcMandelbrot(float r,float  i)
  float zi = 0.0;
  float zr = 0.0;

  int itercount = 0;

  float maxit = (float)maxIteration;
  while(1) {
  float temp = zr * zi;
  float zr2 = zr * zr;
  float zi2 = zi * zi;
  zr = zr2 - zi2 + r;
  zi = temp + temp + i;
  if (zi2 + zr2 > maxThreshold)
    return (float)256*itercount/maxit;
  if (itercount > maxIteration)
    return (float)1.0;
  return 1;

void SetZ( int i, int j, float xinc, float yinc )
  zr[i][j] = (float) -1.0*deltaX/2.0 + xinc * i;
  zi[i][j] = (float) 1.0*deltaY/2.0 - yinc * j;
  zcolor[i][j] = CalcMandelbrot(zr[i][j], zi[i][j] ) /1.0001;

void Mandelbrot ()
  float xinc = (float)deltaX/(maxI-1);
  float yinc = (float)deltaY/(maxJ-1);
  for (int i=0; i<maxI; i++) {
    for (int j=0; j<maxJ; j++) {
      SetZ(i, j, xinc, yinc);

code snippet Chapter6mandelbrot.cpp

#ifndef _MANDLE_H_
#define _MANDLE_H_
const int factor = 8;
const int  maxThreshold = 96;
const int  maxIteration = 500;
const int maxI = 1024 * factor;
const int  maxJ = 1024 * factor;
const float deltaX = 4.0;
const float deltaY = 4.0;

extern float zr[maxI][maxJ],zi[maxI][maxJ];
extern float zcolor[maxI][maxJ];

void Mandelbrot ();
void WriteMandlebrot();

code snippet Chapter6mandelbrot.h

#include <fstream>
#include "mandelbrot.h"

// write to a PPM file.
void WriteMandlebrot()
  std::ofstream ppm_file("default.ppm");
  ppm_file << "P6 " << maxI << " " << maxJ << " 255" << std::endl;

  unsigned char red, green, blue; // BLUE - did minimal work
  for (int i=0; i<maxI; i++) {
    for (int j=0; j<maxJ; j++) {
      float color = (float)zcolor[i][j] ;
      float temp = color;
      if (color >= .99999)
        red = 255 ; green = 255; blue = 255;
        red = 0 ; green = 0; blue = 0;
      // write to PPM file
      ppm_file << red  << green << blue;

code snippet Chapter6ppm.cpp

#ifdef _WIN32
#include <windows.h>
double wtime()
  LARGE_INTEGER frequency;
  return (double)(ticks.QuadPart/(double)frequency.QuadPart);
#include <sys/time.h>
#include <sys/resource.h>
double wtime()
  struct timeval time;   
  struct timezone zone;   
  gettimeofday(&time, &zone);   
  return time.tv_sec + time.tv_usec*1e-6;

code snippet Chapter6wtime.c


This chapter described several methods of finding hotspots within an application. In practice you would probably want to use a combination of the methods to get best results. The identification of the hotspots is essential if you want to avoid wasted effort in attempting parallelism of any existing code.

It is very easy to apply parallelism at every opportunity you see within the code — for example, at every loop. However, many of these loops may not be invoked often enough nor do enough work, to make the effort of their parallelism worthwhile. Some loops that are tempting to parallelize may not really contribute much to the overall run time.

Finding the parallelization opportunities within your code is the goal of Hotspot analysis. It is an essential first step in adding parallelism to your code. Without this knowledge of your program, you are in danger of making code parallel without seeing any improvement in performance.

Having found the hotspots, the next steps are to implement the parallelism, check for errors, and, finally, tune the threaded application. The next chapter shows how to use different programming models to implement parallelism.

