Chapter 7. Application profiling and tuning

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Application profiling and tuning

In this chapter, we use some of the HPC applications to illustrate how to employ the PEDE and HPCT tools for profiling and tuning.

Usually people try to locate the hotspots of an application and then optimize it. There are some points that are usually considered for an HPC application profiling and tuning, such as the synchronization, load balance, communication overhead and the memory, file I/O issues, along with the real computation effort. Consequently to work on an existing HPC application for optimization, the first step is profiling to recognize the hot spot, Xprofile, or gprof to provide for this function. Then other tools for different profiling and tuning purpose can be utilized, such as the HPCT from IBM.

A hyper HPC program sPPM [S. Anderson et al], which is written by C and Fortran utilizing MPI and OpenMP for parallel programming is described. The contents describe the procedure to locate the hotspot in this program and analyze the possible reason based on the profiling data, followed by possible tuning hints.

The following topics are discussed in this chapter:

•Profiling and tuning hints for sPPM

•Profiling and analyzing Gadget 2

•Analyzing a ScaLAPACK routine using the HPCT Toolkit

7.1 Profiling and tuning hints for sPPM

This section introduces tuning for sPPM.

7.1.1 A glance at the application sPPM

Look for information about the sPPM application at:

http://www.lcse.umn.edu/research/sppm/README.html

The experiment for sPPM occurs based on Power Linux. This program computes a 3-D hydrodynamics problem on a uniform mesh using a simplified version of the PPM (Piecewise Parabolic Method) code.

The coordinates are -1:1 in x and y, and 0:zmax in z, where zmax depends on the overall aspect ratio prescribed. A plane shock traveling up the +z axis encounters a density discontinuity, at which the gas becomes denser. The shock is carefully designed to be simple, but strong, about Mach 5. The gas initially has a density of 0.1 ahead of the shock. At over 5dz at the discontinuity, it changes to 1.0.

This version of sPPM is implemented primarily in Fortran with system-dependent calls in C. All message passing is done with MPI. Only a small set of MPI routines are actually required:

•MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize

•MPI_Wtime, MPI_IRecv, MPI_Isend

•MPI_Wait, MPI_Allreduce, MPI_Request_free

The parallelization strategy within an SMP is based on spawning extra copies of the parent MPI process. This is done with the sproc routine, which is similar but not identical to the standard UNIX fork call. It differs in that sproc causes all static (for example, not stack based) storage to be implicitly shared between processes on a single SMP node. This storage is always declared in Fortran with either SAVE or COMMON.

Synchronization is accomplished by examining shared variables and by use of UNIX System V semaphore calls. (See twiddle.m4 and cstuff.c.) In addition, a fast-vendor supplied implementation of an atomic fetch and add (test_and_add) is used.

7.1.2 Use gprof to view the profiling data

As described in advance, various approaches can be applied to get different types of profiling data. To get the cpu time profiling data of a program, we must compile the source code with -p -pg. After execution, for each MPI task, a gmon.out file is generated, making it possible to use gprof to generate data issuing the command:

-bash-4.1$ gprof ../sppm gmon.out > prof.txt

In this command:

•sppm is the executable

•gmon.out is the raw profiling data

•the prof.txt contains the readable profiling data

Specifically for the sppm application, the output is shown in Figure 7-1 on page 143.

Figure 7-1 The CPU time profiling for sPPM

So from the flat profiling result, the function sppm, difuze and interf took over 80% of the time. Besides, these functions are invoked millions of times, so it is useful if we do optimization for these functions. Because if any slight benefit is brought in, it can be multiplied with millions due to the amount of calls. However it is possible that it is hard to get these functions optimized because for each single call, it only takes few times, which means that these function can be highly optimized already. For the function _vrec_P7, it is impossible to do anything with it because it is a built-in function of the compiler library. Thus it is still worthwhile to look at the functions named as hydyz, hydzy, hydxy, dydyx, hydzz. So let us look further into the call graph profiling data. We describe the details for each function call. See Figure 7-2 on page 144.

Figure 7-2 Detailed call graph

Based on the call graph, we can also ascertain that difuze and interf are called by sppm, and it does not have any further children to call. So it is a good choice to start to optimize the code of difuze and interf. Also some other functions must be noticed. Figure 7-3 gives two of the examples.

Figure 7-3 Detailed call graph for hydyz and dydxy

In addition, tools can help to figure out what is the potential problem in some of the hotspots if it is hard to sort out by inspection of the code. Usually the hotspots are either caused by heavy computation with lots of “load and store” instructions or the communication took plenty of time along with load balance issues. Hence in the later sections, possible causes of the hotspots are analyzed.

7.1.3 Using binary instrumentation for MPI profiling

We presented the general steps about binary instrumentation in the previous chapters. However since it is desired to profile all MPI communications for the program, we choose all of the source code file in the instrumentation view, as shown in Figure 7-4.

Figure 7-4 Instrumentation panel

After the instrumented binary is generated, the default profiling setting is sufficient for information that we need to generate trace files only for the application tasks with the minimum, maximum, and median MPI communication time. This is also true for task zero (people might use task), if task zero is not the task with minimum, maximum, or median MPI communication time. If you need trace files generated for additional tasks, make sure the necessary options under the HPC Toolkit menu are checked during the profiling configuration phase. Refer to Chapter 6, “Parallel Environment Developer Edition tools” on page 95 for the details of each options.

Figure 7-5 Options for MPI profiling

Depending on the number of tasks in your application, make sure that the Maximum trace rank and Limit trace rank are set correctly. If your application executes many MPI function calls, you might need to set the value of Maximum trace events to a higher number than the default 30,000 MPI function calls. Refer to Figure 7-5.

Chapter 6, “Parallel Environment Developer Edition tools” on page 95 explains how to do the launching configuration with an application; therefore, using the same setting, just click Pprofile, and the program runs correctly. Synchronize your project for the generated profiling data. You can now open them with HPCT after you right-click the files. The elapsed cpu time for the MPI function calls is displayed in the summary section. Under that, notice the details of each MPI function call in each source code file and further down to each function that invoked MPI routines. For sPPM, we got the summary data in Figure 7-6.

Figure 7-6 sPPM mpi profiling data

We can see the most time consumable MPI routine is MPI_Wait, technically each asychronized MPI_Isend and MPI_Irecv might have one MPI_Wait, so the column of Count illustrates the MPI_Wait has the most frequent invoking, so the wall clock is the highest one. However the communication overhead might not a big issue in this project if comparing with the entire execution time (near 10 minutes). It is still interesting that based on the result shows here, the assumption is the work load on each MPI tasks is not well balanced; otherwise, the MPI_Wait routine only takes a short time. If we look at the MPI trace picture, which can be generated by opening the .mpt file through the mpi trace menu in Figure 7-7 on page 147, we can see that the MPI_wait waits for a longer time with mpi rank 4, 5, 6, 7, which means these rank can get less workload than rank 0, 1, 2, 3. Despite that workload issue, the communication happened many times, it is possible to use depth halo with more boundary data layers to reduce the communication frequency. Another point is it is worthwhile to think about if there is a possible way to reduce the total transferred data size, which can be done with altering the major algorithm.

Figure 7-7 MPI trace picture for sPPM

7.1.4 Use OpenMP profiling to identify if the workload for threads

Because the sPPM code utilizes OpenMP threads for some of the hotspots, we can also profile it to see if there is any load balance issue. Refer to “OpenMP profiling” on page 115 for details about how to get your program instrumented for OpenMP profiling. Here, we choose to profile all of the OpenMP regions listed in Figure 7-8 on page 148. Because of the limitation that it is not supported to profile nested OpenMP parallel region, some of the OpenMP parallel regions cannot get profiling data.

Figure 7-8 Instrument all the OpenMP parallel regions for sPPM

After the execution of the instrumented binary, the OpenMP profiling data was generated. Open the .viz files. The summary data is shown in Figure 7-9.

Figure 7-9 OpenMP profiling data

Double-click the most time consumable region. The tool navigates you to which line the code is in the source file. So it is easily located where the OMP issue might be. By further inspecting the code, in this parallel region we have many single and master directives that made the code fragment in serial. So to get better performance, it is necessary to think about how to get those directives less. However based on Figure 7-8 and Figure 7-9, notice that there is no profiling data for the runhyd3.F source code because there are nested OMP regions. As mentioned before, under this circumstance, HPCT does not support OpenMP profiling currently.

Figure 7-10 hotspot in OpenMP code

In a nut shell, the HPC Toolkit, which is part of the Parallel Environment Developer Edition, can generate profiling data for HPC application developers quickly (Figure 7-10). Based on this profiling data, it is easier to get an idea about how to optimize an existing HPC application without understanding some scientific theory. However there are some other good tools in HPCT with PEDE that are not mentioned here, for example, the I/O profiling, due to sPPM, it does not have heavy I/O operations.

7.2 Profiling and analyzing Gadget 2

Gadget 2 is a freely available application for cosmological simulations widely used in the HPC area, which can be used to address several problems, such as colliding and merging galaxies, dynamics of the gaseous intergalactic medium, or to address star formation and its regulation by feedback processes. Further information about the project along with source code is at:

http://www.mpa-garching.mpg.de/gadget/

This study case takes Gadget2 to:

•Show how to configure a project that requires special build set up.

•Show how to use some basic Eclipse editor features to browse the source code.

•Show how the tools drill-down the application source code looking for performance enhancements opportunities.

Importing Gadget 2 as an existing Makefile application

The Gadget 2 source code was downloaded and uncompressed in the remote Linux on Power cluster node. Next, it was imported in the PEDE Eclipse environment as a synchronized project (see Figure 7-11). The rationale behind our choices are:

•Although the majority of the Gadget’s source code is written in C, we chose to import it as a Synchronized Fortran project because it also contains a small portion of Fortran code and by doing that we ensured that Eclipse handles both Fortran and C/C++ languages properly.

•We imported the project using the empty Makefile project template because Gadget relies on the GNU Makefile build system.

•In one of the steps to import the project, you need to set the connection to the remote node holding the application source code. We managed to create a new connection since there was not one for the PowerLinux node (see Figure 7-12 on page 151).

•We chose --Other Toolchain-- from the Local Toolchain field because PEDE Eclipse was running in a Windows XP notebook, so there was no use in choosing a local toolchain.

Figure 7-11 Creating a synchronized project of Gadget 2

Figure 7-12 Creating a connection with remote PowerLinux cluster node

Configured build settings

The Gadget 2 source code comes from some examples of simulations that are built by different makefiles located in the Gadget2 folder (see Figure 7-13). We chose the Cosmological formation of a cluster of galaxies example for this section, which is built by the Makefile.cluster makefile. So it was needed to change the default settings to properly build the project:

1. Change the default build directory to Gadget2 in the gadget project properties (see Figure 7-14 on page 152).

2. Create a new make target that instructs Eclipse to run make -f Makefile.cluster from the build directory (see Figure 7-15 on page 152). For further information about management of makefile targets, consult the Eclipse help in the menu bar: Help → Help Contents → C/C++ Development User Guide → Tasks → Building projects → Creating a Make target.

Figure 7-13 Gadget 2 folders hierarchy

Figure 7-14 Changing default build directory

Figure 7-15 Creating a new make target

The project is now built using the cluster target instead of default all. In Figure 7-16 on page 153, the right-column box, Make Target, has an icon for the cluster target where we just double-click it to trigger a fresh build.

Figure 7-16 New created cluster make target

There is another configuration that many users do not realize is important regarding include paths and preprocessor symbols. If they are not properly set, some features, such as code search and completion, will not work perfectly because the Eclipse C/C++ parse does not understand application source code content completely (Figure 7-17).

Figure 7-17 Gadget 2 preprocessor symbols

More details about how to set Include paths and preprocessor symbols are at the menu: Help → Help Contents → C/C++ Development User Guide → Tasks → Building projects → Adding Include paths and symbols.

Creating a run launcher configuration

The next task was to create a run launcher configuration, which often served as a template for other launch configurations, such as debugger and IBM HPC Toolkit tools.

Our newly created configuration leveraged IBM Parallel Environment (PE) as the execution environment, as shown in Figure 7-18. However, any of the several other execution environments were applicable, and integration is provided by PEDE.

Figure 7-18 Set execution environment

Other tabs in the run launcher configuration were properly fulfilled:

•In the Application tab (Figure 7-19 on page 155): It set the gadget2 executable path.

•In the Arguments tab (Figure 7-20 on page 155): It set the input control file for the
Gadget2 simulation as an argument of the executable. It also changed the default working directory because the input control file makes references to others relative to the Gadget2 folder.

•In the Environment tab (Figure 7-21 on page 156): It was set to a runtime variable that otherwise would fail gadget 2 run. For example, LD_LIBRARY_PATH was exported in runtime because the application depends on third-party shared library and not in a default path.

Figure 7-19 Run launcher configuration: Set application path

Figure 7-20 Run launcher configuration: Set gadget2 arguments

Figure 7-21 Run launcher configuration: Set environment variables

Searching for hot spots using Xprof

We chose to use the Xprof tool (refer to “X Windows Performance Profiler” on page 130) as the first tool toward performance analysis of Gadget 2 because it allows discovery tasks and methods that most impact overall execution time.

The application is built with the required -pg flag, executed with the previous mentioned run configuration to get gprof data files and then loaded into Xprof (see Figure 7-22 on page 157).

Figure 7-22 Xprof main view

Switching to flat report (menu Report → Flat Profile) enabled us to visualize a rank of methods with the most CPU-intensive on top. Figure 7-23 on page 158 shows that 69.6% of (cumulated) time is spent in a single method named force_treeevaluate_shortrange. The force_treeevaluate_shortrange seemed to be a good candidate for our investigation, but before taking a look at this code we preferred to seek more information.

Figure 7-23 Xprof flat report

After double-clicking force_treeevaluate_shortrange in the flat report, Xprof pointed out the major rectangle in the bottom left of Figure 7-22 on page 157. Graphically, its dimensions represent two measurements:

•The width represents how many times a method was called by its parents.

•The height represents the amount of execution time a method spent in its own code, for example, excluding time spent in its children.

To zoom in the call graph, we applied a filter (menu Filter → uncluster Function) that resulted in Figure 7-24 on page 159.

Figure 7-24 Xprof main view with uncluster functions

The force_treeevaluate_shortrange’s rectangle stands out on both width and height proportions. Moreover, Figure 7-25 shows that the total amount of seconds in the self and self+desc fields are equal; therefore, none of the execution time was spent in children methods.

Figure 7-25 force_treeevaluate_shortrange: Sef+desc versus self time

In fact, force_treevaluate_shortrange was an obvious hot spot, but it was also needed to investigate its parent’s methods. If you right-click the rectangle, and select menu options like in Figure 7-26 on page 160, a chain of parents is displayed. The force_treeevaluate_shortrange method has only one parent method, as shown in Figure 7-27 on page 161.

Figure 7-26 Show ancestor functions only

Figure 7-27 force_treeevaluate_shortrange ancestor functions

Searching for hot spot method declaration

After identified force_treeevaluate_shortrange is identified as a hot spot and gravity_tree has its own ancestor method, it is time to search for locations of their declaration into the Gadget 2 source code. Although Xprof provides facilities to open and browse application’s code, it was preferred to use the project already imported into Eclipse to take advantage of other tools of PEDE for further analysis and to eventually make use of Eclipse’s code editor rich capabilities.

The search window shown in Figure 7-28 on page 162 can be opened by clicking the C/C++ language option (menu Search → C/C++). The search result for force_treeevaluate_shortrange method is shown in Figure 7-29 on page 162.

Select the occurrence under forcetree.c, double-click, and its source code opens in the Eclipse editor. We also opened the occurrence under gravtree.c, which turned out to be the only location where force_treeevaluate_shortrange is called.

Figure 7-28 Searching for force_treeevaluate_shortrange

Figure 7-29 Opening force_treeevaluate_shortrange implementation

Further information about the Eclipse search in the C/C++ source code is at: Help → Help contents → C/C++ Development User Guide → Tasks → Searching the CDT.

7.3 Analyzing a ScaLAPACK routine using the HPCT Toolkit

ScaLAPACK is a library of high-performance linear algebra routines for parallel distributed memory machines. ScaLAPACK solves dense and banded linear systems, least squares problems, eigenvalue problems, and singular value problems.

In this section, we analyze the performance of the ScaLAPACK routine PDGETRF, which performs an LU factorization of a dense real matrix. This calculation is similar to that used in the well-known Linpack Benchmark.

7.3.1 The Structure of ScaLAPACK

ScaLAPACK is a continuation of the LAPACK project, which contains high-performance linear algebra software for shared memory computers. LAPACK itself is built on top of the well-know Basic Linear Algebra Subprograms (BLAS). The BLAS routines are a set of kernels for performing low-level vector and matrix operation. Since the interfaces to the BLAS routines have been defined to be a de facto standard, computer vendors can provide highly-tuned versions of the BLAS routines, designed to perform optimally on a given architecture.

ScaLAPACK extended the idea of the BLAS and defines a set of BLACS. In this way, high performance can be achieved by confining machine-specific optimizations to the BLAS and BLACS libraries.

The high-level structure of the ScaLAPACK and related library is shown in Figure 7-30.

Figure 7-30 ScaLAPACK structure

The relevant components of the directory structure of the ScaLAPACK distribution are:

SCALAPACK

lapack-3.4.1

BLAS

SRC
scalapack-2.0.2

BLACS

PBLAS

SRC

TESTING

LIN

EXAMPLE

The ScaLAPACK library contains an extensive suite of test programs in the scalapack-2.0.2/TESTING directory. Typically these test programs test an individual routine with many different input parameters to test the correctness of the installation. We modified the top-level driver pdludriver.f, used it to build the test program xdlu and to increase the value of the parameter TOTMEM, so that larger problem sizes are handled. We also modified the data file that drives the xdlu program to solve a single large problem, rather than many small ones.

7.3.2 The build process

Both LAPACK and ScaLAPACK have a top-level Makefile, which defines all the build targets and the dependencies. Each of these Makefiles includes a file (make.inc and SLmake.inc, respectively) that defines the compiler options to be used on a particular platform.

Within each directory, we run the following make commands:

lapack-3.4.1

make blaslib

make lapacklib

scalapack-2.0.2

make lib

make scalapackexe

Within the corresponding Eclipse/PTP project, we define these targets by right-clicking the lapack-3.4.1 and scalapack-2.0.2 directories in the Project Explorer window and selecting Make Targets → Build (Alt-F9) → Add.

7.3.3 CPU profiling

The first step in CPU profiling is to perform a CPU hot-spot analysis. To do this, we add the flags -g -pg to lapack-3.4.1/make.inc and scalapack-2.0.2/SLmake.inc, and rebuild the package.

Running the program on a single node:

poe ./xdlu -procs 16

The run generates a series of gmon.out files in the format used by the standard UNIX utility gprof. On the Power architecture, running either Linux or AIX, the Xprof utility can be used to explore these profiles:

source /opt/ibmhpc/ppedev.hpct/env_sh

Xprof ./xdlu prof*/gmon*

The following window appears as shown in Figure 7-31 on page 165.

Figure 7-31 Profile report

We run View → Uncluster functions so that the program call tree is no longer enclosed in the box associated with the program code. This changes the appearance of the window, as shown in Figure 7-32 on page 166.

Figure 7-32 Profile report uncluster view

Next we examine the traditional gprof “flat profile” by selecting Report → Flat Profile (Figure 7-33 on page 167).

Figure 7-33 Flat profile report view

This shows that the overwhelming majority of time, 92.5%, is spent in the Level 3 BLAS routine DGEMM, which is a matrix-matrix multiplication routine. At this point, the obvious course of action is to replace the Fortran BLAS library, distributed with the LAPACK library, with calls to vendor-specific tuned BLAS - ESSL for the IBM Power architecture, or the MKL library for the Intel x86_64 architecture. However, we continue working with the Fortran DGEMM BLAS routine to demonstrate the other tools in the HPCT suite.

It is unusual in real-world applications to find one routine consuming such a high proportion of CPU time, and it is easy to locate the rectangle corresponding to DGEMM in the call graph. Normally, from the flat profile view, we select the line corresponding to DGEMM, and select Tools → Locate in the graph to show the DGEMM rectangle, as shown in Figure 7-34 on page 168.

Figure 7-34 DGEMM rectangle

The rectangle is almost exactly square, showing a large amount of time spent in DGEMM, where all of it is spent in the routine itself and nothing in its children. By right-clicking the DGEMM rectangle and selecting display source code, a window is displayed that shows the source code of the DGEMM routine. Because all of the routines were compiled with the XL compilers’ -qfullpath flag, the tool can locate the source code immediately, without t specifying a path to the source directory.

By scrolling down the window, we find the lines of code where the time is being spent, as shown in Example 7-1.

Example 7-1 Lines of the program where the running time is spent

no. ticks

line per line source code

303 2 DO 90 J = 1,N

304 IF (BETA.EQ.ZERO) THEN

305 DO 50 I = 1,M

306 C(I,J) = ZERO

307 50 CONTINUE

308 ELSE IF (BETA.NE.ONE) THEN

309 DO 60 I = 1,M

310 C(I,J) = BETA*C(I,J)

311 60 CONTINUE

312 END IF

313 18 DO 80 L = 1,K

314 129 IF (B(L,J).NE.ZERO) THEN

315 TEMP = ALPHA*B(L,J)

316 6 DO 70 I = 1,M

317 24093 C(I,J) = C(I,J) + TEMP*A(I,L)

318 70 CONTINUE

319 END IF

320 80 CONTINUE

321 90 CONTINUE

Each tick is a sample of 1/100th second so that 24093 ticks associated with line 317 corresponds to 240.93 seconds of CPU time associated with the line of code.

7.3.4 HPM profiling

The “DO 70” loop in Example 7-1 on page 168 is written so that the arrays are accessed in column major order, which is typically good for Fortran code. However, better performance can usually be achieved by unrolling DO-loops, so that more use is made of each array element that is loaded into cache before that element is discarded.

Such optimizations are typical of the customized BLAS routines available in the IBM ESSL and the Intel MKL libraries. In this case, we use compiler optimizations to show how different levels of optimization can affect the performance of this loop.

In general, hardware performance counters are extremely hard to interpret in absolute terms without detailed knowledge of both the underlying hardware and the application. However, in tuning an application, it is often possible to get some useful information by examining performance counters before and after tuning occurs.

For ScaLAPACK, we instrumented the xdlu executable for gathering HPM information using the command:

hpctInst -dhpm ./xdlu

This command creates the file xdlu.inst. Note that in general it is better to instrument just a few key areas of the code because the full HPM instrumentation adds significantly to the overall run time. We therefore instrumented xdlu using Eclipse/PTP, adding instrumentation ONLY to the call to pdgemm. We choose this, rather than dgemm itself, because this section of code remains unchanged even though next we completely replace the dgemm routine.

First, we run the application using the default HPM Event group. For each task, we obtain a .viz file and a .txt file. First, we examine one of the .txt files and search for the section labelled pdgemm, as shown in Figure 7-35 on page 170.

Instrumented section: 1 - Label: pdgemm

process: 3276958, thread: 1

file: pdgemm_.c, lines: 259 <--> 506

Context is process context.

No parent for instrumented section.

Inclusive timings and counter values:

Execution time (wall clock time) : 43.9370487332344 seconds

Initialization time (wall clock time): 0.19987041503191 seconds

Overhead time (wall clock time) : 0.0135871283710003 seconds

PM_FPU_1FLOP (FPU executed one flop instruction) : 17912567

PM_FPU_FMA (FPU executed multiply-add instruction) : 35711748287

PM_FPU_FSQRT_FDIV (FPU executed FSQRT or FDIV instruction) : 0

PM_FPU_FLOP (FPU executed 1FLOP, FMA, FSQRT or FDIV instruction) : 35729660854

PM_RUN_INST_CMPL (Run instructions completed) : 190402795098

PM_RUN_CYC (Run cycles) : 184265923937

Utilization rate : 99.759 %

Instructions per run cycle : 1.033

Total scalar floating point operations : 71441.409 M

Scalar flop rate (flops per WCT) : 1625.995 Mflop/s

Scalar flops per user time : 1629.925 Mflop/s

Algebraic floating point operations : 71441.409 M

Algebraic flop rate (flops per WCT) : 1625.995 Mflop/s

Algebraic flops per user time : 1629.925 Mflop/s

Scalar FMA percentage : 99.975 %

Scalar percentage of peak performance : 9.693 %

Figure 7-35 pdgemm output

We note two specific values here: Scalar percentage of peak performance, which is 9.963%, and Algebraic flob rate (flops per WCT), which is 1625.925Mflop/s.

To show the difference, we can potentially tune the dgemm kernel optimally, link the code with the IBM ESSL (Engineering and Scientific Subroutine Library), and repeat the experiment. In this case, we obtain the following information for the pdgemm section, as shown in Figure 7-36 on page 171.

Instrumented section: 1 - Label: pdgemm

process: 3670092, thread: 1

file: pdgemm_.c, lines: 259 <--> 506

Context is process context.

No parent for instrumented section.

Inclusive timings and counter values:

Execution time (wall clock time) : 19.9932569861412 seconds

Initialization time (wall clock time): 0.202705550938845 seconds

Overhead time (wall clock time) : 0.0138184614479542 seconds

PM_FPU_1FLOP (FPU executed one flop instruction) : 18123119

PM_FPU_FMA (FPU executed multiply-add instruction) : 35711748287

PM_FPU_FSQRT_FDIV (FPU executed FSQRT or FDIV instruction) : 0

PM_FPU_FLOP (FPU executed 1FLOP, FMA, FSQRT or FDIV instruction) : 35729871406

PM_RUN_INST_CMPL (Run instructions completed) : 61851849877

PM_RUN_CYC (Run cycles) : 83586326217

Utilization rate : 99.446 %

Instructions per run cycle : 0.740

Total scalar floating point operations : 71441.620 M

Scalar flop rate (flops per WCT) : 3573.286 Mflop/s

Scalar flops per user time : 3593.178 Mflop/s

Algebraic floating point operations : 71441.620 M

Algebraic flop rate (flops per WCT) : 3573.286 Mflop/s

Algebraic flops per user time : 3593.178 Mflop/s

Scalar FMA percentage : 99.975 %

Scalar percentage of peak performance : 21.368 %

Figure 7-36 pdgemm new output

We now see that the Scalar percentage of peak performance increased to 21.368%, and the Algebraic flop rate (flops per WCT) increased to 3573.286Mflop/s.

The DGEMM routine in ESSL was carefully constructed to make much effective use of the memory hierarchy, so that as much use is made of data loaded into the caches, close to the processor, before calculation needs to be delayed by accessing memory. We can see this by looking at a counter group that lists details of memory access. When we run the Fortran version, and look at group 11, we see the values for the counters shown in Figure 7-37 on page 172.

PM_DATA_FROM_MEM_DP (Data loaded from double pump memory) : 0

PM_DATA_FROM_DMEM (Data loaded from distant memory) : 0

PM_DATA_FROM_RMEM (Data loaded from remote memory) : 10345114

PM_DATA_FROM_LMEM (Data loaded from local memory) : 11189436

PM_RUN_INST_CMPL (Run instructions completed) : 191101049438

PM_RUN_CYC (Run cycles) : 185020305804

Figure 7-37 Memory counter group details

This can be contrasted with values obtained from the ESSL version of DGEMM, as shown in Figure 7-38.

PM_DATA_FROM_MEM_DP (Data loaded from double pump memory) : 0

PM_DATA_FROM_DMEM (Data loaded from distant memory) : 0

PM_DATA_FROM_RMEM (Data loaded from remote memory) : 2878806

PM_DATA_FROM_LMEM (Data loaded from local memory) : 2243331

PM_RUN_INST_CMPL (Run instructions completed) : 62033373737

PM_RUN_CYC (Run cycles) : 83109304433

Figure 7-38 Values from the ESSL version of DGEMM

We can see clearly from these statistics that the ESSL DGEMM routine loads far less data from memory.

Raw performance counters can also be examined using the peekperf GUI, as shown in Figure 7-39 on page 173.

Figure 7-39 Raw performance counters shown using the peekperf GUI

7.3.5 MPI profiling

We next perform MPI profiling, which can be enabled by linking with the libmpitrace library. We access the PEDE environment variables source /opt/ibmhpc/ppedev.hpct/env_sh and add the following syntax to the link step in the file TESTING/LIN/Makefile:

-L$(IHPCT_BASE)/lib64 -lmpitrace

Alternatively, we can use dynamic instrumentation to add tracing to a prebuilt executable:

hpcInst -dmpi xdlu

Although we do not modify the source code to alter the communications patterns, we can still influence the way the code performs by modifying some of the parameters in the input file.

One of the key parameters that affects the performance of ScaLAPACK routines is the block size, given by MB in the input file. If the block size in linear algebra routines is too small, excessive amounts of communication with small messages is generated. On the other hand, if the block size is too large, the blocks will be too large to fit into the processor cache, so that the reduction in communications is offset by the reduced computational efficiency.

To illustrate this, we show the textual MPI profile information and the graphical display of MPI calls for the two cases where NB=32 (which is a reasonably good value for the Fortran code), and NB=256 (which is too large). In each case, we request that information be collected for all tasks, and all events are also collected to the trace file. To achieve this, we set the following environment variables:

export OUTPUT_ALL_RANKS=yes

export TRACE_ALL_TASKS=yes

export TRACE_ALL_EVENTS=yes

We found initially that the default number of events to trace, 30,000, was too small, and se we increased it as follows:

export MAX_TRACE_EVENTS=500000

Examining the file hpct_0_0.mpi.txt, the trace from task 0, for NB=32 is shown in Figure 7-40.

-----------------------------------------------------------------

MPI Routine #calls avg. bytes time(sec)

-----------------------------------------------------------------

MPI_Comm_size 1 0.0 0.0000

MPI_Comm_rank 3 0.0 0.0000

MPI_Send 3428 32055.3 0.101

MPI_Rsend 6 8.0 0.0000

MPI_Isend 47755 2307.9 0.249

MPI_Recv 56648 3861.0 6.608

MPI_Testall 47755 0.0 0.048

MPI_Waitall 473 0.0 0.000

MPI_Bcast 9341 15476.4 1.248

MPI_Barrier 2 0.0 0.020

MPI_Reduce 789 1270.8 0.207

MPI_Allreduce 37 5.9 0.004

-----------------------------------------------------------------

total communication time = 8.485 seconds.

total elapsed time = 56.336 seconds.

Figure 7-40 hpct_0_0.mpi.txt output file

The bottom of this file is shown in Figure 7-41.

Communication summary for all tasks:

minimum communication time = 8.485 sec for task 0

median communication time = 9.411 sec for task 4

maximum communication time = 10.226 sec for task 14

Figure 7-41 hpct_0_0.mpi.txt bottom section

The corresponding sections of the task 0 trace file when NB=256 are shown in Figure 7-42 on page 175.

-----------------------------------------------------------------

MPI Routine #calls avg. bytes time(sec)

-----------------------------------------------------------------

MPI_Comm_size 1 0.0 0.0000

MPI_Comm_rank 3 0.0 0.000

MPI_Send 515 411610.2 0.442

MPI_Rsend 6 8.0 0.000

MPI_Isend 45691 2367.4 0.223

MPI_Recv 52190 6067.0 16.339

MPI_Testall 45691 0.0 0.046

MPI_Waitall 128 0.0 0.000

MPI_Bcast 6555 626.8 0.126

MPI_Barrier 2 0.0 0.031

MPI_Reduce 237 4286.7 0.860

MPI_Allreduce 37 5.9 0.002

-----------------------------------------------------------------

total communication time = 18.069 seconds.

total elapsed time = 97.995 seconds.

Figure 7-42 Task 0, NB = 256

The communication summary of all tasks is shown in Figure 7-43.

Communication summary for all tasks:

minimum communication time = 14.504 sec for task 9

median communication time = 18.069 sec for task 0

maximum communication time = 24.945 sec for task 15

Figure 7-43 Communication summary

A summary of MPI communications can be obtained by loading the .viz files into the peekperf GUI, as shown in Figure 7-44 on page 176.

Figure 7-44 .viz file into the peekperf GUI

It is also instructive to examine the communications patterns for both cases. We examine the .mpt file using the peekview GUI. In both cases, we present two views: the first is zoomed in to show approximately one second of execution, and the second zooms in further to show approximately 0.2 seconds.

For the NB=32 example, the second view is shown in Figure 7-45 on page 177.

Figure 7-45 NB case second view

For the NB=32 case, the 0.2 second view is shown in Figure 7-46.

Figure 7-46 NB=32 case 0.2 view

For the NB=256 case, the one second view is shown in Figure 7-47 on page 178.

Figure 7-47 Communication phase

We see that the communications phase lasts nearly the full second in this case. The 0.2 second view, zooming in on the busy part, is shown in Figure 7-47.

Figure 7-48 shows the peekperf trace view.

Figure 7-48 Peekperf trace view

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7. Application profiling and tuning

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 7. Application profiling and tuning