0 10 20 30 40 50 60
GB per node
render nodes
1/32 Memory
1/8 Memory 1/4 Memory 1/2 Memory 1/1 Memory
FIGURE 4.7: Total memory used per node for the RM data set. Image
source: Ize et al. [15].
ancer sharing the node with the display process. The render nodes thus have
a limited amount of work to perform and the display process quickly becomes
the bottleneck. Figure 4.8 shows that when using one thread to receive the
pixels and another to copy the pixels from the receive buffer to the image, the
maximum frame rate is 55fps, no matter how many render threads and nodes
being used. Using two threads to copy the pixels to the image results in up
toa1.6× speedup, three copy threads improves performance by up to 2.3×,
but additional copy threads offer no additional benefit, demonstrating that
copying is the bottleneck until three copy threads are used, after which the
receive thread becomes the bottleneck since it is not able to receive the pixels
fast enough to keep the copy threads busy. When around 18 threads are being
used, be they on a few nodes or many nodes, the system can obtain a frame
rate of 127fps. This is exactly the expected maximum of 127fps given by the
amount of time it takes to transmit all the pixel data across the InfiniBand
interconnect. More render threads result in lower performance due to the MPI
implementation not being able to keep up with the large volume of communi-
cation. In order to achieve the results required, the MPI implementation must
be tuned to use more RDMA buffers and turn off shared receive queues (SRQ);
otherwise, the system can still achieve the same maximum frame rate of about
127fps with 17 render cores, but after that point, adding more cores causes
performance to drop off quickly, with 384 render cores (48 render nodes) being
2× slower. However, this is a moot point since faster frame rates will offer no
tangible benefit.
0 50 100 150 200 250 300 350 400
render cores
2C 3C 4C 5C 6C 7C 6C+LB
FIGURE 4.8: Display process scaling: frame rate using varying numbers of
render cores to render a trivial scene when using one copy thread (1C) to seven
copy threads (7C), and when using six copy threads with the load balancer
process on the same node (6C+LB). Note that performance is not enhanced
beyond three copy threads. Image source: Ize et al. [15].
Modern graphics cards can produce four megapixel images at 60fps. As this
image size is roughly twice the HD image size, in our system the maximum
frame rate would halve to about 60fps. Higher resolutions than 4 megapixels
are usually achieved with a display wall consisting of a cluster of nodes driving
multiple screens. In this case, the maximum frame rate will be given not by
display node to receive its share of the image. Assuming each node renders 4
megapixels, and the load balancing and rendering continue to scale, the frame
rate will thus stay at 60fps, regardless of the resolution of the display wall.
Since three copy threads are able to keep up with the receiving thread, and
the load balancer process is also running on the same node, there are three
unused cores on the tested platform. If data is replicated across the nodes
then these three cores can be used for a render process. This render process
will also benefit from being able to use the higher-speed shared memory for its
MPI communication with the display and load balancer instead of the slower
InfiniBand. However, if DC is required, then it will not be possible to run
any render processes on the same node since those render processes will be
competing with the display and load balancer for scarce network bandwidth
and this will much more quickly saturate the network port and result in much
lower maximum frame rates.
With modern hardware and software, the described system can ray trace
massive models at real-time frame rates on a cluster and even show interactive
to real-time rates when rendering distributed geometry using a small cache.
The system is one to two orders of magnitude faster than previous cluster ray
tracing implementations, which used both slower hardware and algorithms [11,
39], or had equivalent hardware but did not scale to as many nodes or to high
frame rates [5]. Compared to compositing approaches, the system can achieve
about a 4× improvement in the maximum frame rate for same size non-empty
images compared to the state of the art [18] and can also handle advanced
shading effects for improved visualization.
4.6 Conclusion
Parallel rendering methods for generating images from visualizations are
an important area of research. In this chapter, a general framework for par-
allel rendering was presented and applied to both geometry rendering and
volume rendering. In the future, as HPV moves into the exascale regime, par-
allel rendering methods will likely become more important as in situ methods
require parallel rendering and the send-image method of parallel display will
scale better than the send-geometry method. It is anticipated that GPUs will
become integrated into compute nodes, which offer another avenue for parallel
rendering in HPV.
