Chapter 3

Profiling performance of hybrid applications with Score-P and Vampir

Guido Juckeland*; Robert Dietrich      * Department of Information Services and Computing, Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Dresden, DEU, Germany
Technische Universität Dresden, Dresden, Germany

Abstract

The purpose of this chapter is to familiarize the reader with the concept of evolutionary performance improvement and the tools involved when adding other parallelization paradigms to OpenACC applications. Such hybrid applications can suffer from a number of performance bottlenecks and a holistic picture of all activities during the application run can shed light on how to improve the overall performance.

At the end of this chapter the reader will have a basic understanding of:

 Terminology and methods for performance analysis of hybrid applications (e.g., MPI + OpenACC)

 How to modify hybrid applications to generate and record performance data

 How to visualize and analyze the performance data to make educated choices on application improvement

 Common pitfalls when extending an existing parallel application to include another parallelization paradigm

Keywords

MPI; CUDA; Profiling; Tracing; Sampling; Events; Score-P; Vampir; Performance analysis

The purpose of this chapter is to familiarize the reader with the concept of evolutionary performance improvement and the tools involved when adding other parallelization paradigms to OpenACC applications. Such hybrid applications can suffer from a number of performance bottlenecks and a holistic picture of all activities during the application run can shed light on how to improve the overall performance.

At the end of this chapter the reader will have a basic understanding of:

 Terminology and methods for performance analysis of hybrid applications (e.g., MPI + OpenACC)

 How to modify hybrid applications to generate and record performance data

 How to visualize and analyze the performance data to make educated choices on application improvement

 Common pitfalls when extending an existing parallel application to include another parallelization paradigm

OpenACC aims at providing a relatively easy and straightforward way to describe parallelism on platforms with hardware accelerators. By design, it is also an approach for porting legacy High Performance Computing (HPC) applications to novel architectures. Legacy and new applications that exceed the capabilities of a single node can use Message Passing Interface (MPI) for internode communication and coarse work distribution. The combined OpenACC and MPI applications are referred to as hybrid applications. It is also possible to combine OpenACC with OpenMP on the host side to utilize all resources of a compute node or to even use all three levels of parallelism (MPI, OpenACC, and OpenMP) concurrently. Tuning application performance for one parallelization paradigm is challenging, adding the second or third level of parallelism introduces a whole new layer of potential performance challenges due to the interaction of all parallelization paradigms. Profile-guided development can also cover these hybrid computation scenarios when using the right profiling tools.

Profiling tools from compiler or accelerator vendors are usually limited to the scheme the product addresses, e.g., only OpenACC and/or CUDA/OpenCL activity. Nearly all vendor tools are unable to record MPI activity leaving the programmer in the dark how well hybrid applications perform over all aspects of the application parallelism. Research-based performance tools cover this gap. HPCtoolkit, Tau, and Score-P are the most prominent third-party profiling tools that also offer hardware accelerator support. Out of the three Score-P is the one that covers the most parallelization paradigms, can record the most concurrent activity and, as a result, can provide the most complete performance picture even for very complex applications. Therefore, Score-P will be used as the example performance recording tool for this chapter. Vampir will be used for visualizing the performance data since it is by far the most capable trace visualizer and profile generator.

You can get Score-P at http://www.score-p.org and Vampir at http://www.vampir.eu.

Performance Analysis Techniques and Terminology

While the term “profiling” is commonly used to describe application performance analysis as a whole, it technically only refers to a subset of the performance measurement and analysis techniques. A formal comparison of the techniques is shown in Fig. 1.

f03-01-9780124103979
Fig. 1 Formal categorization of performance analysis techniques and their relation (Ilsche, Schuchart, Schöne, & Hackenberg, 2014).

Performance measurement and visualization is composed of three steps: data acquisition, recording, and presentation. The performance monitor wants to analyze the behavior of the application. To do so it can either interrupt the unmodified application and “pull” out information as to what the application was doing when it was interrupted (sampling) or the application can be modified itself to “push” information about its activity to the performance monitor (event based instrumentation). Data recording can either immediately summarize all data or fully log all activity as time-stamped entries. The full log file can be presented as a timeline or a profile for any arbitrary time interval while the summarized data can only be presented as a profile of the whole application run.

All analysis techniques have advantages and disadvantages. For example, sampling, which interrupts the application at a fixed sampling rate, e.g., has a constant perturbation of the run time of the application that is measured. However, the measurement accuracy is depending on the sampling frequency. Event based instrumentation will record all targeted activity, i.e., all functions or code segments that have been augmented either manually or via the compiler with event triggers, but the run time perturbation depends on the event rate which is not known at compile time and can be of orders of magnitude in a worst-case scenario. Immediate summarization maintains a low memory footprint for the performance monitor and does not lead to additional input/ouput (I/O) but it drops the temporal context of the recorded activity, whereas for logging it is vice-versa. Profiles immediately show activities according to their impact on the application (typically run time distribution over the functions) while they omit the temporal context. Timelines again are just the opposite; they show the temporal evolution of a program while making it harder to isolate the most time consuming activity right away.

Event based tracing is able to record the “full picture” of all activity during an application run, albeit at the price of a potentially very high run time perturbation. Fortunately, a compromise between the introduced techniques can be used to generate as much information as necessary at the lowest possible overhead. Score-P offers the—even concurrent—usage of all the presented performance analysis techniques and as such is unique in its capabilities as a performance monitor.

Evolutionary Performance Improvement

Examples in this book show that profile-driven development can be used to constantly improve an application’s performance using OpenACC by offloading more and more activity and optimizing data transfers. Optimizing hybrid applications follows a similar pattern as illustrated in Fig. 2.

f03-02-9780124103979
Fig. 2 The evolutionary application performance improvement cycle.

The performance optimization cycle starts with preparation of the application, followed by the actual measurement, and the analysis of the performance data. Based on this data the programmer will try to mitigate performance issues and start the whole process over again.

The remainder of the chapter explains how the first three steps of the performance optimization cycle are carried out with Score-P and Vampir using a particle-in-cell simulation accelerated with CUDA where the CUDA part could easily be substituted with an OpenACC implementation yielding the same results. Furthermore, various optimization steps are introduced which also highlight more generally applicable performance tuning options.

A Particle-in-Cell Simulation of a Laser Driven Electron Beam

Particle-in-cell codes simulate the movement of particles in electromagnetic fields by partitioning the simulation domain into a grid (the cells) while maintaining the particles as freely moving entities. The specific simulation (Burau et al., 2010) used for the performance study in this chapter describes how a very high energy laser pulse enters a hydrogen gas and in its wake field accelerates electrons to generate an electron beam traveling at almost the speed of light without the need for a considerably larger conventional particle accelerator.

The actual simulation runs through discrete time steps where each step involves four phases as shown in Fig. 3. First the Lorentz force ( F si1_e ) on all particles as a result of the electric ( E si2_e ) and magnetic ( B si3_e ) fields are computed. Next the particles are moved along those computed forces. Since moving charged particles represent an electric current ( J si4_e ), those currents are computed next. Lastly the computed current will influence the electric and magnetic fields which need to be updated before the cycle starts over. The duration of a simulated time step is chosen such that no particle can travel further than one cell in one time step.

f03-03-9780124103979
Fig. 3 Phases of the PIConGPU algorithm for a single time step.

PIConGPU originates from a proof-of-concept by a high school student during an internship at HZDR. It was a single Graphic Processing Unit (GPU) CUDA implementation that ran orders of magnitude faster than any other PIC code. Since then this application has been ported to multiple GPUs and the code has moved from CUDA C to C++11. Using the steps described in this chapter, the overall performance has been improved even further. Functionality of PIConGPU that can be reused for other applications has been extracted into external libraries, so that other particle-mesh-simulations can benefit from the key components that make PIConGPU so fast. The source code of PIConGPU is available at https://github.com/ComputationalRadiationPhysics/picongpu/.

Preparing the Measurement Through Code Instrumentation

In order to acquire very detailed performance data, the source code of the application under test needs to be modified to push events to a performance monitor. This process is called instrumentation. Score-P typically uses compiler instrumentation, which means the compiler is invoked with additional options to generate callbacks for all function entries and exits which will be handled by Score-P. These callbacks are the events that were introduced previously. This of course requires that the compiler supports the injection of such callbacks which, however, most current compilers do.

All parallelization paradigms (MPI, OpenMP, Pthreads, OpenACC, CUDA, OpenCL, OpenSHMEM, or any combination) are instrumented automatically using provided performance tool interfaces, wrapper libraries, or source-to-source transformation. As a result, Score-P can record all activity directly without manual changes to the source code of the application under test.

In order to invoke the compiler instrumentation, Score-P provides compiler wrappers for most of the common compilers. These wrappers will add the correct flags to insert the necessary callbacks through the compiler. A generic wrapper script called scorep provides access to these wrappers. It is used as shown in Fig. 4.

f03-04-9780124103979
Fig. 4 Makefile modification to enable the Score-P compiler wrappers.

In the linking step it might be necessary to explicitly specify the accelerator paradigms in case the Score-P wrapper does not correctly detect them. For the PIConGPU example the Score-P MPI compiler wrapper that is used for linking the whole application must be informed that it is actually targeting a CUDA application so that it will link the corresponding CUDA monitoring plugin to the application to record CUDA activity. The upcoming Score-P 3.0 release will also include OpenACC event recording through the OpenACC performance tool API. It requires a compiler that supports this API and may also require passing the --openacc flag to the Score-P compiler wrapper.

The above example illustrates the general principle of source code instrumentation. For PIConGPU the instrumentation is slightly more complex. First of all, one does not want to record functions such as constructors or destructors. Thus, they are compiled without instrumentation. In addition, the option of providing a filter list to the compiler instrumentation is also used to exclude a number of functions from instrumentation. From the general behavior, it is important to capture the four phases of each iteration with as little run time overhead as possible; the underlying details can be added on demand.

Score-P 2.0 introduced the capability of data acquisition of the application activity through sampling. In order to use that feature the application still needs to be instrumented so that the parallelization library activity is still recorded using events and external tools are required for acquiring the call stack.

Recording Performance Information During the Application Run

An application will launch the Score-P performance monitor automatically with the first instrumented events in the application. The performance monitor is configured through a number of environment variables. In order to minimize the run time perturbation, the default settings of Score-P produce an event based profile that will not include any accelerator activity. To set up Score-P to record all relevant activity for the PIConGPU example, the environment variable as shown in Fig. 5 are used. In order to enable tracing of OpenACC API activity using the upcoming Score-P 3.0 release, the environment variable SCOREP_OPENACC_ENABLE needs to be set to “yes.” The available options are described in the Score-P documentation.

f03-05-9780124103979
Fig. 5 Score-P environment variables to enable MPI + CUDA tracing.

After the program is executed the current working directory contains a new subdirectory named scorep- < timestamp_of_program_run >. Inside the subdirectory you find both the profile (profile.cubex) as well as the trace file (traces.otf2) as shown in Fig. 6.

f03-06-9780124103979
Fig. 6 Contents of the scorep-* subdirectory after a successful combined profile and trace run.

The generated profile is a very valuable first approach to understanding the overall application behavior. For applications where the event rate is unknown or expected to be very large, it is generally recommended to only generate a profile first and extend the tracing to only cover the needed levels of information, either through limited instrumentation or through record filtering. The profiles (profile.cubex) can be visualized with the viewer “cube.” An example view is shown in Fig. 7.

f03-07-9780124103979
Fig. 7 Screenshot of the cube profile viewer (Geimer et al., 2010).

The cube profile viewer shows the analysis of the application run with respect to multiple metrics (left-most column). In the middle the currently selected metric is split up amongst the functions of the program. In the right column the selected metric for the selected function is split up among all processes and threads of the application run.

The trace file allows for an in-depth analysis of both the profile data as well as the temporal context. How this can be used to optimize our PIConGPU application is shown in the following section. A more detailed explanation of all Score-P settings can be found in the manual that is part of the installation.

Looking at a First Parallel PIConGPU Implementation

As a next step, the trace file, traces.otf2, is opened with Vampir, as shown in Fig. 8. The trace thumbnail (top right) shows that only 0.2 s out of the whole application run are selected, and the repetitive pattern suggests that about 2.5 iteration steps of the simulation are shown. In the middle with the color coded activities is the master timeline which shows both the MPI and host processes (Process 1–4) and the corresponding CUDA contexts (Thread 1/1–4). The legend in the lower right shows the meaning of the colors. The black lines between processes represent MPI messages that are exchanged, the ones between a process and a thread are CUDA memory copies. It can be seen that the MPI activity is dominating the program execution while the actual CUDA activity is rather time limited.

f03-08-9780124103979
Fig. 8 Vampir displaying the trace of a first naive MPI parallelization of PIConGPU.

The various displays for the performance data can be selected through the toolbar icons on the top left. There are two groups of displays, the timeline displays and statistical displays. The time displays show the activity along the temporal evolution in the horizontal direction. In this group is the master timeline, which shows the color coded activity of all event streams (which can be processes, threads, or CUDA streams).

The color code is by default based on the function groups as shown in the function legend display in the lower right. The process timeline shows the calling context over time for one event stream. The counter timeline displays the values of a performance counter associated with one event stream. This can be e.g., a PAPI counter, but also derived counters such as memory allocations or anything that was created with the Vampir metric editor. The temporal context of a counter for all event streams is shown in the performance radar.

The second display group contains the statistics displays which present various profile data. The function summary shows the distribution of run time or number of invocations among functions or function groups. The message summary provides statistics about all data transfers (between host and device or between MPI processes). The other displays are the process summary, the message matrix, the I/O summary and the call tree. A detailed explanation of all Vampir features can be found in its manual which is part of the installation.

The most important feature of Vampir, which sets it apart from standard profiling tools is the ability to zoom into any arbitrary time interval in the program execution timeline. Only an excerpt of the trace is shown in Fig. 8 as indicated in the thumbnail view in the upper right (the selected interval is depicted by the two black bars). All displays including the statistics are always updated to the currently displayed time interval. As a result, Vampir enables profiling of applications for any phase of the application and at extremely high temporal resolutions. All displays of Vampir can be configured with respect to how they display the information by right-clicking into the display. Left-clicking on anything in any display will bring up a context view that will show details about the selected item (such as start/end time, duration, function name, file name, line number, message size, etc.).

The trace file shown in Fig. 8 depicts the run of a first attempt of parallelizing the single GPU PIConGPU code using MPI to increase the simulation area (weak scaling). The trace shows a typical problem of such a first parallelization attempt: Sequential task execution. In this case one can see that both the MPI activity and the CUDA memory copies dominate the host execution time while GPU utilization is rather low.

The high portion of MPI activity is due to very long message transfer times for rather small messages (which can be analyzed using the message summary). It turns out that PIConGPU computes so fast on the GPU that the MPI transfers cannot be hidden. In order to reduce transfer times the only option is to reduce the transfer latency. In this case the 1 GB Ethernet in the development cluster was augmented with an Infiniband network to provide significantly faster MPI communication.

Another issue is the rather long phase of synchronous CUDA memory copies (the bar of the host processes right before a set of kernels is started on the device). Here, better interleaving of compute and data transfer will increase the overall throughput. The very low GPU utilization is not an issue by itself, but rather an effect of the host not being able to supply enough work to the GPU as it is waiting for data from others or busy with other work.

Freeing Up the Host Process

The next improved version of PIConGPU addresses the identified issues and introduces an additional Pthread into which all MPI communication activity is offloaded (Thread 1–4:2). As shown in Fig. 9 this frees up the host process to launch work to the GPU as soon as all required data are available and to retrieve data for communication with neighboring processes as soon as possible as well. The overall GPU utilization has improved, also due to the reduced message latency of the Infiniband fabric.

f03-09-9780124103979
Fig. 9 A first improved version of PIConGPU.

Optimizing GPU Kernels

Now that the GPU is busy most of the time, the question arises if the GPU computing time can be reduced. Using the function summary to display only the functions of the CUDA group as shown in Fig. 9, it can be seen that the dominating kernel is called “moveParticles” followed by “cptCurrent.”

The common part in both kernels have is that they need to traverse the particle list to first accumulate the contribution to the aggregate electrical current of a cell (cptCurrent) and then to update the positions of the particle (moveParticles). It turns out that the used data structure—a linked list of a C struct representing the particles (storing position, velocity, and charge) which was taken over from the originating PIC Central Processing Unit (CPU) implementation—does not work well on a GPU which requires coalesced memory accesses by neighboring CUDA threads. The particle data structure was changed to a list of structs of arrays of 256 floats and the performance improved dramatically as shown in Fig. 10. This is also due to changing the used MPI communication routines from synchronous to asynchronous MPI.

f03-10-9780124103979
Fig. 10 PIConGPU using the optimized particle data structure and asynchronous MPI communication.

Adding GPU Task Parallelism

When zooming into a host-device-pair in Fig. 10, it turns out that there is quite some lag between some kernel launches and their start of execution. Furthermore, there are still times when the GPU is idle due to synchronous CUDA memory copies. The introduction of asynchronous GPU activity using CUDA streams enables PIConGPU to push more work to the GPU and let the GPU figure out the best way to process it. The result can be seen in Fig. 11. Each host process now uses CUDA streams (in this case five streams per GPU); one as a target for CUDA memory copies, the remainder to offload concurrent work.

f03-11-9780124103979
Fig. 11 PIConGPU at large scale, running on 512 nodes of OLCF’s Titan.

In order to achieve this extremely high level of concurrency on the GPU, PIConGPU implements an internal event system that can automatically trigger activity and map data dependencies. This event system gets translated into CUDA events, so that a kernel can be launched even if input data is still being transferred. As a result PIConGPU is able to scale to a very large number of GPUs and still maintain a very high GPU utilization. PIConGPU and its descendants is frequently running on OLCF’s Titan using the whole system and has been a Gordon-Bell-Award finalist in 2013 (Bussmann et al., 2013).

Investigating OpenACC Run Time Events With Score-P and Vampir

Compilers and run times have a certain freedom in implementing OpenACC directives. Therefore, it is important to review their conversion and execution. For example, a kernels directive can trigger the device initialization, device memory allocations, and data transfers without explicitly specifying respective operations. The profiling interface introduced with OpenACC 2.5 defines a set of events that unveil details on the implementation and execution of OpenACC directives. It enables tools such as Score-P to measure the duration of OpenACC regions, expose waiting time and offloading overhead on the host, and track memory allocations on the accelerator. More accelerator events such as the begin and end of GPU kernels and CPU-GPU data transfers can be gathered with the CUPTI interface for CUDA targets or via OpenCL library wrapping (Dietrich & Tschüter, 2015). OpenACC events relate low-level accelerator events with the application’s source code. They are marked as implicit or explicit and, depending on their type, they can also provide information on the variable name of a data transfer or the kernel name of kernel launch operations.

Fig. 12 shows the Vampir visualization for an execution interval of an application that uses MPI, OpenMP, and OpenACC. In the selected interval two MPI processes execute the same program regions, each running two OpenMP threads and one GPU with two CUDA streams. Accelerator activities are triggered asynchronously, but the host is waiting most of the time for their completion. For example, the call tree on the right shows that the update construct triggers a data download from the accelerator to the host and a wait operation. OpenACC and OpenMP regions are annotated with the file name and line number to correlate with the source code.

f03-12-9780124103979
Fig. 12 OpenACC run time events expose the execution of OpenACC directives and relate accelerator activities with the program’s source code (Dietrich, Juckeland, & Wolfe, 2015).

Summary

While the PIConGPU example is a specific example, the identified performance bottlenecks are genuine and the presented solutions can be applied to other applications as well. Whether the accelerator is programmed using CUDA (as done with PIConGPU) or OpenACC makes no difference; the utilized improvements are available in both paradigms or refer to the underlying MPI activity.

Lessons learned:

 Performance analysis needs to be an integral part of every (parallel and especially hybrid) program development in order to utilize the available resources as efficient as possible.

 Sample-based profiling offers a first glimpse at potential hot spots in the program execution with very low run time overhead.

 Event based tracing provides the full picture of all concurrent activity during the program execution. The log level should be selected carefully in order to not overload the I/O subsystem.

 Interactive navigation through a trace file with the possibility for intermittent profiling of various application phases provides the application developer with a better understanding of what the program is actually doing at any point in time.

 Asynchronous activity for both MPI and accelerator activity is the key to high performance.

References

Burau H., Widera R., Hönig W., Juckeland G., Debus A., Kluge T., et al. PIConGPU: A fully relativistic particle-in-cell code for a GPU cluster. IEEE Transactions on Plasma Science. 2010;38(10):2831–2839.

Bussmann M., Burau H., Cowan T.E., Debus A., Huebl A., Juckeland G., et al. Radiative signatures of the relativistic Kelvin-Helmholtz instability. In: Proceedings of SC13: International conference for high performance computing, networking, storage and analysis (S. 5:1–5:12); Denver, CO: ACM; 2013.

Dietrich R., Juckeland G., Wolfe M. Open ACC programs examined: A performance analysis approach. In: 44th international conference on parallel processing (ICPP); Beijing: IEEE; 2015:S310–S319.

Dietrich R., Tschüter R. A generic infrastructure for OpenCL performance analysis. In: 8th International conference on intelligent data acquisition and advanced computing systems (IDAACS); Warsaw: IEEE; 2015:S334–S341.

Geimer M., Wolf F., Wylie B.J., Abraham E., Beckert D., Mohr B. The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience. 2010;22(6):702–719.

Ilsche T., Schuchart J., Schöne R., Hackenberg D. Combining instrumentation and sampling for trace-based application performance analysis. In: Proceedings of the 8th international workshop on parallel tools for high performance computing; Stuttgart: Springer International Publishing; 2014:123–136.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.1.57