The calculations required for visualizing MOs are computationally demanding, and existing quantum chemistry visualization programs are only fast enough to interactively compute MOs for only small molecules on a relatively coarse lattice. At the time of this writing, only VMD and MacMolPlt support multicore CPUs, and only VMD uses GPUs to accelerate MO computations. A great opportunity exists to improve upon the capabilities of existing tools in terms of interactivity, visual display quality, and scalability to larger and more complex molecular systems.
1.3.1. Mathematical Background
In this section we provide a short introduction to MOs, basis sets, and their underlying equations. Interested readers are directed to seek further details from computational chemistry texts and review articles
[6]
and
[7]
. Quantum chemistry packages solve the electronic Schrödinger equation
H
Ψ =
E
Ψ for a given system. Molecular orbitals are the solutions produced by these packages. MOs are the eigenfunctions Ψ
ν
for expression of the molecular wavefunction Ψ, with
H
the Hamiltonian operator and
E
the system energy. The wavefunction determines molecular properties, for instance, the one-electron density is
ρ
(
r
) = |Ψ(
r
)|
2
. The visualization of the molecular orbitals resulting from quantum chemistry calculations requires evaluating the wavefunction on a 3-D lattice so that isovalue surfaces can be computed and displayed. With minor modifications, the algorithms and approaches we present for evaluating the wavefunction can be adapted to compute other molecular properties such as charge density, the molecular electrostatic potential, or multipole moments.
Each MO Ψ
ν
can be expressed as a linear combination over a set of
K
basis functions Φ
κ
,
where
c
νκ
are coefficients contained in the quantum chemistry calculation output files, and used as input for our algorithms. The basis functions used by the vast majority of quantum chemical calculations are atom-centered functions that approximate the solution of the Schrödinger equation for a single hydrogen atom with one electron, so-called atomic orbitals. For increased computational efficiency, Gaussian type orbitals (GTOs) are used to model the basis functions, rather than the exact solutions for the hydrogen atom:
The exponential factor
ζ
is defined by the basis set;
i
,
j
, and
k
are used to modulate the functional shape; and
N
ζijk
is a normalization factor that follows from the basis set definition. The distance from a basis function's center (nucleus) to a point in space is represented by the vector
R
= {
x
,
y
,
z
} of length
R
= |
R
|.
The exponential term in
Eq. 1.2
determines the radial decay of the function. Composite basis functions known as contracted GTOs (CGTOs) are composed of a linear combination of
P
individual GTO
primitives
in order to accurately describe the radial behavior of atomic orbitals.
The set of contraction coefficients {
c
p
} and associated exponents {
ζ
p
} defining the CGTO are contained in the quantum chemistry simulation output.
CGTOs are classified into different
shells
based on the sum
l
=
i
+
j
+
k
of the exponents of the
x
,
y
, and
z
factors. The shells are designated by letters s, p, d, f, and g for
l
= 0, 1, 2, 3, 4, respectively, where we explicitly list here the most common shell types but note that higher-numbered shells are occasionally used. The set of indices for a shell is also referred to as the
angular momenta
of that shell. We establish an alternative indexing of the angular momenta based on the shell number
l
and a systematic indexing
m
over the possible number of sums
l
=
i
+
j
+
k
, where
counts the number of combinations and
m
= 0, …,
M
l
− 1 references the set
.
The linear combination defining the MO Ψ
ν
must also sum contributions from each of the
N
atoms of the molecule and the
L
n
shells of each atom
n
. The entire expression, now described in terms of the
data output from a QM package, for an MO wavefunction evaluated at a point
r
in space then becomes
where we have replaced
c
νκ
by
c
νnlm
, with the vectors
R
n
=
r
−
r
n
connecting the position
r
n
of the nucleus of atom
n
to the desired spatial coordinate
r
. We have dropped the subscript
p
from the set of contraction coefficients {
c
} and exponents {
ζ
} with the understanding that each CGTO requires an additional summation over the primitives, as expressed in
Eq. 1.3
.
The normalization factor
N
ζijk
in
Eq. 1.2
can be factored into a first part
η
ζl
that depends on both the exponent
ζ
and shell type
l
=
i
+
j
+
k
and a second part
η
ijk
(=
η
lm
in terms of our alternative indexing) that depends only on the angular momentum,
The separation of the normalization factor in
Eq. 1.5
allows us to factor the summation over the primitives from the summation over the array of wavefunction coefficients. Combining
(1.2)
(1.3)
and
(1.4)
and rearranging terms gives
We define
using our alternative indexing over
l
and
m
explained in the previous section. Both data storage and operation count can be reduced by defining
and
. The number of primitives
P
nl
depends on both the atom
n
and the shell number
l
.
Figure 1.2
shows the organization of the basis set and wavefunction coefficient arrays listed for a small example molecule.
1.3.2. GPU Molecular Orbital Algorithms
Visualization of MOs requires the evaluation of the wavefunction on a 3-D lattice, which can be used to create 3-D isovalue surface renderings, 3-D height field plots, or 2-D contour line plots within a plane. Since wavefunction amplitudes diminish to zero beyond a few Angstroms (due to the radial decay of exponential basis functions), the boundaries of the 3-D lattice have only a small margin beyond the bounding box containing the molecule of interest.
The MO lattice computation is heavily data dependent, consisting of a series of nested loops that evaluate the primitives composing CGTOs and the angular momenta for each shell, with an outer loop over atoms. Since the number of shells can vary by atom and the number of primitives and angular momenta can be different for each shell, the innermost loops traverse variable-length coefficient arrays
containing CGTO contraction coefficients and exponents, and the wavefunction coefficients. Quantum chemistry packages often produce output files containing redundant basis set definitions for atoms of the same species; such redundancies are eliminated during preprocessing, resulting in more compact data and thus enabling more effective use of fast, but limited-capacity on-chip GPU memory systems for fast access to key coefficient arrays. The memory access patterns that occur within the inner loops of the GPU algorithms can be optimized to achieve peak performance by carefully sorting and packing coefficient arrays as a preprocessing step. Each of the coefficient arrays is sorted on the CPU so that array elements are accessed in a strictly consecutive pattern. The pseudo-code listing in
Algorithm 1
summarizes the performance-critical portion of the MO computation described by
Eq. 1.6
.
The GPU MO algorithm decomposes the 3-D lattice into a set of 2-D planar slices, which are computed independently of each other. In the case of a single-GPU calculation, a simple
for
loop processes the slices one at a time until they are all completed. For a multi-GPU calculation, the set of slices is dynamically distributed among the pool of available GPUs. Each of the GPUs requests a slice index to compute, computes the assigned slice, and stores the result at the appropriate location in host memory.
Each planar slice computed on a GPU is decomposed into a 2-D CUDA grid consisting of fixed-size 8 × 8 thread blocks. As the size of the 3-D lattice increases, the number of planar slices increases, and the number of thread blocks in each CUDA grid increases accordingly. Each thread is responsible for computing the MO at a single lattice point. For lattice dimensions that cannot be evenly divided by the thread block dimensions or the memory coalescing size, padding elements are added (to avoid unnecessary branching or warp divergence). The padding elements are computed just as the interior lattice points are, but the results are discarded at the end of the computation.
Figure 1.3
illustrates the multilevel parallel decomposition strategy and required padding elements.
Algorithm 1.
Calculate MO value Ψ
ν
(
r
) at lattice point
r
.
2:
ifunc
⇐ 0 {index the array of wavefunction coefficients)
3:
ishell
⇐ 0 {index the array of shell numbers}
4:
for
n
= 1 to
N
do
{loop over atoms}
5:
(x
,
y
,
z
) ⇐
r
−
r
n
(r
n
is position of atom
n
}
7:
iprim
⇐
atom_basis
[
n
] {index the arrays of basis set data}
8:
for
l
= 0
to num_shells_per_atom
[
n
] − 1} {loop over shells}
10:
for
p
= 0 to
num_prim_per_shell
[
ishell
] − 1
do
{loop over primitives}
11:
c
′
p
⇐
basis_c
[
iprim
]
12:
ζ
p
⇐
basis_zeta
[
iprim
]
13:
Φ
CGTO
⇐ Φ
CGTO
+
c
′
p
e
−
ζpR
2
16:
for all
0 ≤
i
≤
shell_type
[
ishell
]
do
{loop over angular momenta}
17:
jmax
⇐
shell_type
[
ishell
] −
i
18:
for all
0 ≤
j
≤
jmax
do
20:
c
′ ⇐
wavefunction
[
ifunc
]
21:
Ψ
ν
⇐ Ψ
ν
+
c
′ Φ
CGTO
x
i
y
j
z
k
In designing an implementation of the MO algorithm for the GPU, one must take note of a few key attributes of the algorithm. Unlike simpler forms of spatially evaluated functions that arise in molecular modeling such as Coulombic potential kernels, the MO algorithm involves a comparatively large number of floating-point operations per lattice point, and it also involves reading operands from several different arrays. Since the MO coefficients that must be fetched depend on the atom type, basis set, and other factors that vary due to the data-dependent nature of the algorithm, the control flow complexity is also quite a bit higher than for many other algorithms. For example, the bounds on the loops on lines 10 and 16 of
Algorithm 1
are both dependent on the shell being evaluated. The MO algorithm makes heavy use of exponentials that are mapped to the dedicated exponential arithmetic instructions provided by most GPUs. The cost of evaluating
e
x
by calling the CUDA routines
expf()
or
__expf()
, or evaluating 2
x
via
exp2f()
, is much lower than on the CPU, yielding a performance benefit well beyond what would be expected purely as a result of effectively using the massively parallel GPU hardware.
Given the high performance of the various exponential routines on the GPU, the foremost consideration for achieving peak performance is attaining sufficient operand bandwidth to keep the GPU
arithmetic units fully occupied. The algorithms we describe achieve this through careful use of the GPU's fast on-chip caches and shared memory. The MO algorithm's inner loops read varying numbers of coefficients from several different arrays. The overall size of each of the coefficient arrays depends primarily on the size of the molecule and the basis set used. The host application code dispatches the MO computation using one of several GPU kernels depending on the size of the MO coefficient arrays and the capabilities of the attached GPU devices. Optimizations that were applied to all of the kernel variants include precomputation of common factors and specialization and unrolling of the angular momenta loops (lines 16 to 24 of
Algorithm 1
). Rather than processing the angular momenta with loops, a
switch
statement is used that can process all of the supported shell types with completely unrolled loop iterations, as exemplified in the abbreviated source code shown in
Figure 1.4
.
Constant Cache
When all of the MO coefficient arrays (primitives, wavefunctions, etc.) will fit within 64kB, they can be stored in the fast GPU constant memory. GPU constant memory is cached and provides near-register-speed access when all threads in a warp access the same element at the same time. Since all of the threads in a thread block must process the same basis set coefficients and CGTO primitives, the constant cache kernel source code closely follows the pseudo-code listing in
Algorithm 1
.
Tiled-Shared Memory
If the MO coefficient arrays exceed 64kB in aggregate size, the host application code dispatches a GPU kernel that dynamically loads coefficients from global memory into shared memory as needed, acting
as a form of software-managed cache for arbitrarily complex problems. By carefully sizing shared memory storage areas (tiles) for each of the coefficient arrays, one can place the code for loading new coefficients outside of performance-critical loops, which greatly reduces overhead. Within the innermost loops, the coefficients in a shared memory tile are read by all threads in the same thread block, reducing global memory accesses by a factor of 64 (for 8 × 8 thread blocks). For the outer loops (over atoms, basis set indices, the number of shells per atom, and the primitive count and shell type data in the loop over shells) several coefficients are packed together with appropriate padding into 64-byte memory blocks, guaranteeing coalesced global memory access and minimizing the overall number of global memory reads. For the innermost loops, global memory reads are minimized by loading large tiles immediately prior to the loop over basis set primitives and the loop over angular momenta, respectively. Tiles must be sized to a multiple of the 64-byte memory coalescing size for best memory performance, and power-of-two tile sizes greatly simplify shared memory addressing arithmetic. Tiles must also be sized large enough to provide all of the operands consumed by the loops that follow the tile-loading logic.
Figure 1.5
illustrates the relationship between coalesced memory block sizes, the portion of a loaded array that will be referenced during subsequent innermost loops, and global memory padding and unreferenced data that exist to simplify shared memory addressing arithmetic and to guarantee coalesced global memory accesses.
Hardware Global Memory Cache
NVIDIA recently released a new generation of GPUs based on the “Fermi” architecture that incorporate both L1 and L2 caches for global memory. The global memory cache in Fermi-based GPUs enables a comparatively simple kernel that uses only global memory references to run at nearly the speed of the highly tuned constant cache kernel; it outperforms the tiled-shared memory kernel due to the reduction in arithmetic operations encountered within the inner two loop levels of the kernel. The hardware cache kernel can operate on any problem size, with an expected graceful degradation in performance up until the point where the problem size exceeds the cache capacity, at which point it may begin to perform slower than uncached global memory accesses if cache thrashing starts to occur. Even in the situation where the problem size exceeds cache capacity, the strictly consecutive memory access patterns employed by the kernel enable efficient broadcasting of coefficients to all of the threads in the thread block.
Zero-Copy Host-Device I/O
One performance optimization that can be paired with all of the other algorithms is the use of CUDA “host-mapped memory.” Host-mapped memory allocations are areas of host memory that are made directly accessible to the GPU through on-demand transparent initiation of PCI-express transfers between the host and GPU. Since the PCI-e transfers incur significant latency and the PCI-e bus can provide only a fraction of the peak bandwidth of the on-board GPU global memory, this technique is a net win only when the host side buffers are read from or written to only once during kernel execution. In this way, GPU kernels can directly access host memory buffers, eliminating the need for explicit host-GPU memory transfer operations and intermediate copy operations. In the case of the MO kernels, the output lattice resulting from the kernel computation can be directly written to the host output buffer, enabling output memory writes to be fully overlapped with kernel execution.
Just-in-Time Kernel Generation
Since
Algorithm 1
is very data dependent, we observe that most instructions for loop control and conditional execution could be eliminated for a given molecule by generating a molecule-specific kernel at runtime. A significant optimization opportunity exists based on dynamical generation of a molecule-specific GPU kernel. The kernel is generated when a molecule is initially loaded, and may then be reused. The generation and just-in-time (JIT) compilation of kernels at runtime has associated overhead that must be considered when determining how much code to convert from data-dependent form into a fixed sequence of operations. The GPU MO kernel is dynamically generated by emitting the complete arithmetic sequence normally performed by looping over shells, primitives, and angular momenta for each atom type. This on-demand kernel generation scheme eliminates the overhead associated with loop control instructions (greatly increasing the arithmetic density of the resulting kernel) and allows the GPU to perform much closer to its peak floating-point arithmetic rate. At present, CUDA lacks a mechanism for runtime compilation of C-language source code, but provides a mechanism for runtime compilation of the PTX intermediate pseudo-assembly language through a driver-level interface. OpenCL explicitly allows dynamic kernel compilation from C-language source.
To evaluate the dynamic kernel generation technique with CUDA, we implemented a code generator within VMD and then saved the dynamically generated kernel source code to a file. The standard batch-mode CUDA compilers were then used to recompile VMD incorporating the generated CUDA kernel. We have also implemented an OpenCL code generator that operates in much the same way, but the kernel can be compiled entirely at runtime so long as the OpenCL driver supports online compilation. One significant complication with implementing dynamic kernel generation for OpenCL is the need to handle a diversity of target devices that often have varying preferences for the width of vector types, work-group sizes, and other parameters that can impact the structure of the kernel. For simplicity of the present discussion, we present the results for the dynamically generated CUDA kernel only.
The source code for these algorithms is available free of charge, and they are currently implemented in the molecular visualization and analysis package VMD
[1]
and
[2]
.