Chapter 13

MPC

An effective floating-point compression algorithm for GPUsa

A. Yang; J. Coplin; H. Mukka; F. Hesaaraki; M. Burtscher    Texas State University, San Marcos, TX, United States

Abstract

Because of their high peak performance and energy efficiency, massively parallel accelerators such as graphics processing units (GPUs) are quickly spreading in high-performance computing, where large amounts of floating-point data are processed, transferred, and stored. Such environments can greatly benefit from data compression if it is done quickly enough. Unfortunately, most conventional compression algorithms are unsuitable for highly parallel execution. In fact, how to design good compression algorithms for massively parallel hardware still is not fully understood. To remedy this situation, we studied millions of lossless compression algorithms for single- and double-precision floating-point values built exclusively from easily parallelizable components. We analyzed the best of these algorithms, explained why they compress well, and derived the Massively Parallel Compression algorithm from them. This algorithm requires little internal state, achieves heretofore unreached compression ratios (CRs) on several data sets, and roughly matches the best CPU-based algorithms in CR while outperforming them by one to two orders of magnitude in throughput and energy efficiency.

Keywords

Lossless data compression; Floating-point compression; Algorithm design; GPUs

Acknowledgments

This work was supported by the US National Science Foundation under Grants 1141022, 1217231, 1406304, and 1438963, a REP grant from Texas State University, and grants and gifts from NVIDIA Corporation. The authors acknowledge the Texas Advanced Computing Center for providing some of the HPC resources used in this study.

1 Introduction

High-performance computing (HPC) applications often process and transfer large amounts of floating-point data. For example, many simulations exchange data between compute nodes and with mass-storage devices after every time step. Most HPC programs retrieve and store large data sets, some of which may have to be sent to other locations for additional processing, analysis, or visualization. Moreover, scientific programs often save checkpoints at regular intervals.

Compression can reduce the amount of data that needs to be transmitted and/or stored. However, if the overhead lowers the effective throughput, compression will not be used in the performance-centric HPC domain. Hence the challenge is to maximize the compression ratio (CR) while meeting or exceeding the available transfer bandwidth. In other words, the compression and decompression have to be done in real time. Furthermore, the compression should be lossless and single pass. Intermediate program results that are exchanged between compute nodes, for example, generally cannot be lossy. A single-pass algorithm is needed so that the data can be compressed and decompressed in a streaming fashion as they are being generated and consumed, respectively.

Some compression algorithms are asymmetric, meaning that compression takes much longer than decompression. This is useful in situations where data are compressed once and decompressed many times. However, this is not the case for checkpoints, which are almost never read, nor for intermediate program results that are compressed to boost the transmission speed between compute nodes, between accelerators and hosts, or between compute nodes and storage devices. Thus we focus on symmetric algorithms in this chapter.

Massively parallel compute graphics processing units (GPUs) are quickly spreading in HPC environments to accelerate calculations as GPUs not only provide much higher peak performance than multicore CPUs but also are more cost- and energy-efficient. However, utilizing GPUs for data compression is difficult because compression algorithms typically compress data based on information from previously processed words. This makes compression and decompression data-dependent operations that are difficult to parallelize. Thus most of the relatively few parallel compression approaches from the literature simply break the data up into chunks that are compressed independently using a serial algorithm. However, this technique is not suitable for massively parallel hardware. First, the data would have to be broken up into hundreds of thousands of small chunks, at least one per thread, thus possibly losing much of the history needed to compress the data well. Second and more importantly, well-performing serial compression algorithms generally require a large amount of internal state (e.g., predictor tables or dictionaries), making it infeasible to run tens of thousands of them in parallel. As a consequence, the research community does not yet possess a good understanding of how to design effective compression algorithms for massively parallel machines.

At a high level, most data compression algorithms comprise two main steps, a data model and a coder. Roughly speaking, the goal of the model is to predict the data. The residual (i.e., the difference) between each actual value and its predicted value will be close to zero if the model is accurate for the given data. This residual sequence of values is then compressed with the coder by mapping the residuals in such a way that frequently encountered values or patterns produce shorter output than infrequently encountered data. The reverse operations are performed to decompress the data. For instance, an inverse model takes the residual sequence as input and regenerates the original values as output.

To systematically search for effective and massive parallelism-friendly compression algorithms, we synthesized a large number of compressors and their corresponding decompressors using the following approach. We started with a detailed study of previously proposed floating-point compression algorithms [214]. Short overviews of these algorithms as well as additional related work are available elsewhere [1]. We then broke each algorithm down into their constituent parts, rejected all parts that could not be parallelized well, and generalized the remaining parts as much as possible. This yielded a number of algorithmic components for building data models and coders. We then implemented each component using a common interface, that is, each component can be given a block of data as input, which it transforms into an output block of data. This makes it possible to chain the components, allowing us to generate a vast number of compression-algorithm candidates from a given set of components. Note that each component comes with an inverse that performs the opposite transformation. Thus, for any chain of components, which represents a compression algorithm, we can synthesize the matching decompressor. Fig. 1 illustrates this approach on the example of the four components named LNV6s, BIT, LNV1s, and ZE that make up the 6D version of our Massively Parallel Compression (MPC) algorithm.

f13-01-9780128037386
Fig. 1 The four chained components that make up the 6D MPC compression algorithm along with the corresponding four inverse components that make up the decompression algorithm.

We used exhaustive search to determine the most effective compression algorithms that can be built from the available components. Limiting the components to those that can exploit massively parallel hardware guarantees that all of the compressors and decompressors synthesized in this way are, by design, GPU-friendly.

Because floating-point computations are prevalent on highly parallel machines and floating-point data tend to be difficult to compress, we decided to target this domain. In particular, we implemented 24 highly parallel components in the Compute Unified Device Architecture (CUDA) C++ programming language for GPUs and employed our approach on single- and double-precision versions of 13 real-world data sets. Based on a detailed analysis and generalization of the best four-stage compression algorithms we found for each data set as well as the best overall algorithm, we were able to derive the MPC algorithm that works well on many different types of floating-point data.

MPC treats double- and single-precision floating-point values as 8- or 4-byte integers, respectively, and exclusively uses integer instructions for performance reasons as well as to avoid the possibility of floating-point exceptions or rounding inaccuracies. This means that positive and negative zeros and infinities, not-a-number (NaN), denormals, and all other possible floating-point values are fully supported.

The first stage of MPC subtracts the nth prior value from the current value to produce a residual sequence, where n is the dimensionality of the input data. The second stage rearranges the residuals by bit position, that is, it emits all the most significant bits of the residuals packed into words, followed by the second-most significant bits, and so on. The third stage computes the residual of the consecutive words holding these bits. The fourth stage compresses the data by eliminating zero words.

MPC is quite different from the floating-point compression algorithms in the current literature. In particular, it requires almost no internal state, making it suitable both for massively parallel software implementations as well as for hardware implementations. On several of the studied data sets, MPC outperforms the general-purpose compressors bzip2, gzip, and lzop as well as the special-purpose compressors pFPC and GFC by up to 33% in CR. Throughput evaluations show our CUDA implementation running on a single K40 GPU to be faster in all cases than even the parallel pFPC code running on twenty high-end Xeon CPU cores. Moreover, MPC’s throughput of at or above 100 GB/s far exceeds the throughputs of current-of-the-shelf networks and PCI-Express buses, making real-time compression possible for results that are computed on a GPU before they are transferred to the host or the network interface card (NIC), and making real-time decompression possible for compressed data that are streamed to the GPU from the CPU or the NIC. The performance-optimized CUDA implementation of MPC is available at http://cs.txstate.edu/~burtscher/research/MPC/.

This chapter makes the following key contributions:

 It presents the lossless MPC compression algorithm for single- and double-precision floating-point data that is suitable for massively parallel execution.

 It systematically evaluates millions of component combinations to determine well-performing compression algorithms within the given search space.

 It analyzes the chains of components that work well to gain insight into the design of effective parallel compression algorithms and to predict how to adapt them to other data sets.

 It describes previously unknown algorithms that compress several real-world scientific numeric data sets significantly better than prior work.

 It demonstrates that, in spite of substantial constraints, MPC’s CRs rival those of the best CPU-based compressors while yielding much higher throughputs and energy efficiency than (multicore) CPU-based compressors.

2 Methodology

2.1 System and Compilers

We evaluated the CPU compressors on a system with dual 10-core Xeon E5-2687W v3 processors running at 3.1 GHz and 128 GB of main memory. The operating system was CentOS 6.7. We used the gcc compiler version 4.4.7 with “-O3 -pthread -march=native” for the CPU implementations.

For GPU compressors, we used a Kepler-based Tesla K40c GPU. It had 15 streaming multiprocessors with a total of 2880 CUDA cores running at 745 MHz and 12 GB of global memory. We also used a Maxwell-based GeForce GTX Titan X GPU. It had 24 streaming multiprocessors with a total of 3072 CUDA cores running at 1 GHz and 12 GB of global memory. We used nvcc version 7.0 with the “-O3 -arch=sm_35” flags to compile the GPU codes for the K40 and “-O3 -arch=sm_52” to compile the GPU codes for the Titan X.

2.2 Measuring Throughput and Energy

For all special-purpose floating-point compressors, the timing measurements were performed by adding code to read a timer before and after the compression and decompression code sections and recording the difference. For the general-purpose compressors, we measured the runtime of compression and decompression when reading the input from a disk cache in main memory and writing the output to /dev/null. In the case of GPU code, we excluded the time to transfer the data to or from the GPU as we assumed the data to have been produced or transferred there with or without compression. Each experiment was conducted three times, and the median throughput was reported. Note that the three measured throughputs were very similar in all cases. The decompressed results were always compared to the original data to verify that every bit was identical.

For measuring energy on the CPU, we used a custom power measurement tool based on the PAPI framework [15]. PAPI makes use of the Running Average Power Limit (RAPL) functionality of certain Intel processors, which in turn measures Model Specific Registers to calculate the energy consumption of a processor in real time. Our power tool simply made calls to the PAPI framework to start the energy measurements, to run the code to be measured, and then to stop the measurements and output the result. Doing so incurred negligible overhead.

For measuring energy on the GPU, we used the K20 Power Tool, which also works for the K40c GPU we used in our experiments [16]. To measure the energy consumption of the GPU, the K20 Power Tool uses power samples from the GPU’s internal power sensor and the “active” runtime. The power tool defines this as the amount of time the GPU is drawing power above the idle level. Fig. 2 illustrates this.

f13-02-9780128037386
Fig. 2 Sample power profile.

Because of how the GPU draws power and how the built-in power sensor samples, only readings above a certain threshold (the dashed line at 55 W in this case) reflect when the GPU is actually executing the program [16]. Measurements below the threshold are either the idle power (less than about 26 W) or the “tail power” due to the driver keeping the GPU active for a time before powering it down. Using the active runtime ignores any execution time that may take place on the CPU. The power threshold is dynamically adjusted to maximize accuracy for different GPU settings.

2.3 Data Sets

We used the 13 FPC data sets for our evaluation [3]. Each data set consisted of a binary sequence of IEEE 754 double-precision floating-point values. They included MPI messages (msg), numeric results (num), and observational data (obs).

MPI messages: These five data sets contain the numeric messages sent by a node in a parallel system running NAS Parallel Benchmark (NPB) and ASCI Purple applications.

 msg_bt: NPB computational fluid dynamics pseudo-application bt

 msg_lu: NPB computational fluid dynamics pseudo-application lu

 msg_sp: NPB computational fluid dynamics pseudo-application sp

 msg_sppm: ASCI Purple solver sppm

 msg_sweep3d: ASCI Purple solver sweep3d

Numeric simulations: These four data sets are the result of numeric simulations.

 num_brain: Simulation of the velocity field of a human brain during a head impact

 num_comet: Simulation of the comet Shoemaker-Levy 9 entering Jupiter’s atmosphere

 num_control: Control vector output between two minimization steps in weather-satellite data assimilation

 num_plasma: Simulated plasma temperature evolution of a wire array z-pinch experiment

Observational data: These four data sets comprise measurements from scientific instruments.

 obs_error: Data values specifying brightness temperature errors of a weather satellite

 obs_info: Latitude and longitude information of the observation points of a weather satellite

 obs_spitzer: Data from the Spitzer Space Telescope showing a slight darkening as an extrasolar planet disappears behinds its star

 obs_temp: Data from a weather satellite denoting how much the observed temperature differs from the actual contiguous analysis temperature field

Table 1 provides pertinent information about each data set. The first two data columns list the size in megabytes and in millions of double-precision values. The middle column shows the percentage of values that are unique. The fourth column displays the first-order entropy of the values in bits. The last column expresses the randomness of each data set in percent, that is, it reflects how close the first-order entropy is to that of a truly random data set with the same number of unique values. For the single-precision experiments, we simply converted the double-precision data sets.

Table 1

Information About the Double-Precision Data Sets

Size (MB)Doubles (Millions)Unique Values (%)First-Order Entropy (Bits)Randomness (%)
msg_bt254.033.3092.923.6795.1
msg_lu185.124.2699.224.4799.8
msg_sp276.736.2698.925.0399.7
msg_sppm266.134.8710.211.2451.6
msg_sweep3d119.915.7289.823.4198.6
num_brain135.317.7394.923.9799.9
num_comet102.413.4288.922.0493.8
num_control152.119.9498.524.1499.6
num_plasma33.54.390.313.6599.4
obs_error59.37.7718.017.8087.2
obs_info18.12.3723.918.0794.5
obs_spitzer189.024.775.717.3685.0
obs_temp38.14.99100.022.25100.0

t0010

2.4 Algorithmic Components

We tested the following algorithmic components in our experiments. They are generalizations or approximations of components extracted from previously proposed compression algorithms. Each component takes a block of data as input, transforms it, and outputs the transformed block.

The input data are broken down into fixed-size chunks. Each chunk is assigned to a thread block for parallel processing. We chose 1024-element chunks to match the maximum number of threads per thread block in our GPUs.

The NUL component simply outputs the input block. The INV component flips all the bits. The BIT component breaks a block of data into chunks and then emits the most significant bit of each word in the chunk, followed by the second most significant bits, and so on. The DIMn component also breaks the blocks into chunks and then rearranges the values in each chunk such that the values from each dimension are grouped together. For example, DIM2 emits all the values from the even positions first and then all the values from the odd positions. We tested the dimensions n = 2, 3, 4, 5, 8, 16, and 32. The LNVns component uses the nth prior value in the same chunk as a prediction of the current value, subtracts the prediction from the current value, and emits the residual. The LNVnx component is identical except it XORs the prediction with the current value to form the residual. In both cases, we tested n = 1, 2, 3, 5, 6, and 8. Note that all of the preceding components transform the data blocks without changing their size. The following two components are the only ones that can actually reduce the length of a data block. The ZE component outputs a bitmap for each chunk that specifies which values in the chunk are zero. The bitmap is followed by the nonzero values. The RLE component performs run-length encoding, that is, it replaces repeating values by a count and a single copy of the value. Each component has a corresponding inverse that performs the opposite transformation for decompression.

Because it may be more effective to operate at byte rather than word granularity, we also included the singleton pseudo component “|”, which we call the cut, that converts a block of words into a block of bytes through type casting (i.e., no computation is necessary). As a result, we need three versions of each component and its inverse, one for double-precision values (8-byte words), one for single-precision values (4-byte words), and one for byte values. Each component operates on an integer representation of the floating-point data, that is, the bit pattern representing the floating-point value is copied verbatim into an appropriately sized integer variable.

We chose this limited number of components because we only included components that we could implement in a massively parallel manner. Nevertheless, as the results in the next section show, very effective compression algorithms can be created from these components. In other words, the sophistication and effectiveness of the ultimate algorithm is the result of the clever combination of components, not the capability of each individual component. This is akin to how complex programs can be expressed through a suitable sequence of very simple machine instructions.

To derive MPC, we investigated all four-stage compression algorithms that can be built from the preceding components. Because of the presence of the NUL component, this includes all one-, two-, and three-stage algorithms as well. Note that only the first three stages can contain any 1 of the 24 components just described. The last stage must contain a component that can reduce the amount of data, that is, either ZE or RLE. The cut can be before the first component, in which case the data is exclusively treated as a sequence of bytes; after the last component, in which case the data is exclusively treated as a sequence of words; or between components, in which case the data is initially treated as words and then as bytes. The 5 possible locations for the cut, the 24 possible components in each of the first three stages, and the 2 possible components in the last stage result in 5 * 24 * 24 * 24 * 2 = 138, 240 possible compression algorithms with four stages.

3 Experimental results

3.1 Synthesis Analysis and Derivation of MPC

Table 2 shows the four chained components and the location of the cut that the exhaustive search found to work best for each data set as well as across all 13 data sets, which is denoted as “Best.” For Best, the CR is the harmonic mean over the data sets.

Table 2

Best Four-Stage Algorithm for Each Data Set, for All Single-Precision Data Sets, and for All Double-Precision Data Sets

Data SetDouble PrecisionSingle Precision
CRFour-Stage AlgorithmCRFour-Stage Algorithm
With CutWith Cut
msg_bt1.143LNVls BIT LNVls ZE |1.233DIM5 ZE LNV6x | ZE
msg_lu1.244LNVSs | DIM8 BIT RLE1.588LNVSs LNVSs LNV5x | ZE
msg_sp1.192DIM3 LNV5x BIT ZE |1.362DIM3 LNV5x BIT ZE |
msg_sppm3.359DIM5 INV6x ZE | ZE4.828RLE DIMS LNV6s ZE |
msg_sweep3d1.293LNVls DIM32 | DIM8 RLE1.545LNVls DIM32 | DIM4 RLE
num_brain1.182LNVls BIT LNVls ZE |1.344LNVls BIT LNVls ZE |
num_comet1.267LNVls BIT LNVls ZE |1.199LNVls | DIM4 BIT RLE
num_control1.106LNVls BIT LNVls ZE |1.122LNVls BIT LNVls ZE |
num_plasma1.454LNV2s LNV2s LNV2x | ZE1.978LNV2s LNV2s LNV2x | ZE
obs_error1.210LNVlx ZE INVls ZE |1.289LNV6S BIT LNVls ZE |
obs_info1.245LNV2s | DIM8 BIT RLE1.477LNV8s DIM2 | DIM4 RLE
obs_spitzer1.231ZE BIT LNVls ZE |1.080ZE BIT LNVls ZE |
obs_temp1.101LNV8s BIT LNVls ZE |1.126BIT INVlx DIM32 | RLE
Best1.214LNV6s BIT LNVls ZE |1.265LNV6s BIT LNVls ZE |

t0015

3.1.1 Observations about individual algorithms

The individual best algorithms are truly the best within the search space, that is, the next best algorithms compress successively worse (by a fraction of a percent). Hence it is not the case that an entire set of algorithms performs equally well.

Whereas the last component (ignoring the cut) has to be ZE or RLE, ZE is chosen more often. This indicates that the earlier components manage to transform the data in a way that generates zeros but not many zeros in a row, which RLE would be better able to compress.

Interestingly, in most cases, the first three stages do not include a ZE or RLE component, that is, they do not change the length of the data stream but transform it to make the final stage more effective. Clearly, these noncompressing transformations are very important, emphasizing that the best overall algorithm is generally not the one that maximally compresses the data in every stage. This also demonstrates that chaining whole compression algorithms, as proposed in some related work, is unlikely to yield the most effective algorithms.

There were several instances of DIM8 right after the cut in the double-precision algorithms and of DIM4 right after the cut in the single-precision algorithms. This utilization of the DIM component differs from the anticipated usage. Instead of employing it for multidimensional data sets, these algorithms use DIM to separate the different byte positions from each 4- or 8-byte word. Note that the frequently used BIT component serves a similar purpose but at a finer granularity. This repurposing of the DIM component shows that automatic synthesis is able to devise algorithms that the authors of the components may not have foreseen.

The very frequent occurrence of BIT indicates that the individual bits of floating-point values tend to correlate more strongly across values than within values. This might be expected as, for example, the top exponent bits of consecutive values are likely to be the same.

Another interesting observation is that the cut was often at the end, that is, it is not used. This means that the entire algorithm operates at word granularity, including the Best algorithm, and that there is no benefit from switching to byte granularity. This is good news because it simplifies and speeds up the implementation as only word-granular components are needed. NUL and INV are also not needed. Clearly, inverting all the bits is unnecessary and algorithms with fewer than four components (i.e., that include NUL) compress less well.

The LNV predictor component is obviously very important. Every listed algorithm includes it, and most of them include two such components. LNV comes in two versions, one that uses integer subtraction to form the residual and the other that uses XOR (i.e., bitwise subtraction). Subtraction is much more frequent, which is noteworthy as the current literature seems undecided as to which method to prefer.

In many cases, the single-precision algorithm is the same as the corresponding double-precision algorithm, especially when excluding the aforementioned DIM4 versus DIM8 difference immediately after the cut. This similarity is perhaps expected since the data sets contain the same values (albeit in different formats). However, in about half the cases, the algorithms are different, sometimes substantially (e.g., on msg_bt) and yield significantly different CRs. This implies that the bits that are dropped when converting from double- to single-precision benefit more from different compression algorithms than the remaining bits. Hence it would probably be advantageous if distinct algorithms were used for compressing the bits or bytes at different positions within the floating-point words.

3.1.2 Observations about the Best algorithm

Focusing on the Best algorithm, which is the algorithm shown in Fig. 1, we find that the single- and double-precision data sets result in the same algorithm that maximizes the harmonic-mean CR. Hence the following discussion applies to both formats. The most frequent pattern of components in the individual algorithms is “LNV*s BIT LNV1s ZE,” where the star represents a digit. Consequently, it is not surprising that the Best algorithm also follows this pattern. It is interesting, however, that Best uses a “6” in the starred position even though, with one exception, none of the individual algorithms do. We believe that the exhaustive search selected a 6 because 6 is the least common multiple of 1, 2, and 3, all of which occur more often than 6. In other words, the first component tries to predict each value using a similar prior value, which is best done when looking back n positions, where n is the least common multiple of the dimensionality of the various data sets. Hence it is possible that a larger n will work better, but we only tested up to n = 8.

With this in mind, we can now explain the operation of the Best algorithm. The job of the LNV6s component is to predict each value using a similar prior value to obtain a residual sequence with many small values. Because not all bit positions are equally predictable (e.g., the most significant exponent bits are more likely than the other bits to be predicted correctly and to therefore be zero in the residual sequence), it is beneficial to group bits from the same bit position together, which is what the BIT component does. The resulting sequence of values apparently contains consecutive “words” that are identical, which the LNV1s component turns into zeros. The ZE component then eliminates these zeros.

3.1.3 Derivation of the MPC algorithm

With this insight, a good compression algorithm for floating-point data can be derived by adapting the first component to the dimensionality of the data set and keeping the other three components fixed. We named the resulting algorithm MPC for “Massively Parallel Compressor.” It is based on the aforementioned “LNV*s BIT LNV1s ZE” pattern but uses the data-set dimensionality in the starred location. Note that we obtained the same pattern when using cross-validation, that is, when excluding one of the inputs so as not to train and test on the same input. MPC is identical to the best algorithm the exhaustive search found for several of the studied data sets. Even better, in cases where the actual dimensionality is above eight, MPC yields CRs exceeding those of the Best algorithm (cf. the Best results in Table 2 vs the MPC results in Tables 3 and 4), which validates our generalization of the Best algorithm.

Table 3

Compression Ratios on the Double-Precision Data Sets

msg_btmsg_lumsg_spmsg_sppmmsg_sweep3dnum_brainnum_cometnum_controlnum_plasmaobs_errorobs_infoobs_spitzerobs_tempharmean
Izop1.0491.0001.0004.9351.0171.0001.0721.0111.0311.2581.0001.0301.0001.102
pigz1.1311.0571.1117.4831.0931.0641.1601.0571.6081.4471.1571.2281.0351.240
pbzip21.0881.0181.0556.9201.2921.0431.1731.0295.7871.3351.2171.7451.0231.320
pFPC1.2501.1441.2404.7021.6431.1441.1461.0374.7401.3131.1921.0131.0051.326
GFC1.2001.1481.2033.5211.2191.0911.0941.0131.1281.2371.1481.0131.0391.185
MPC1.2101.2151.2113.0151.2901.1851.2701.1081.1671.1831.2171.1871.1031.251

t0020

Bold values denote the best value in each column.

Table 4

Compression Ratios on the Single-Precision Data Sets

msg_btmsg_lumsg_spmsg_sppmmsg_sweep3dnum_brainnum_cometnum_ controlnum_plasmaobs_errorobs_infoobs_spitzerobs_tempharmean
Izop1.0441.0001.0006.6081.0171.0001.0771.0131.0001.1651.0001.0031.0001.096
pigz1.1851.0951.2099.6351.1591.1281.1511.0821.3861.4681.2111.1851.0791.271
pbzip21.1291.0411.1418.7322.3611.1131.1171.0438.6471.3381.3271.3911.0491.398
MPC1.3391.4441.3893.8431.5371.3471.1811.1241.3481.3011.4401.0491.1171.354

t0025

Bold values denote the best value in each column.

3.2 Compression Ratios

Tables 3 and 4 show the CRs on the double- and single-precision data sets, respectively, for the general-purpose compressors pbzip2 [17], pigz [18], and lzop [19], the special-purpose floating-point compressors pFPC [20] and GFC [21] (they only support double precision), and for MPC. The highest CR for each data set is highlighted in the tables. Because we wanted to maximize the CR, we selected the command-line flags that result in the highest CR where possible. For lzop we used –c6, as this is the highest compression level before the runtime becomes intractable. For pigz and pbzip2, we used –c9 –p20. For GFC and MPC, we specified the data-set dimensionality on the command line. GFC and MPC are parallel GPU implementations. pFPC, pbzip2, and pigz are parallel CPU implementations. The remaining compressor, lzop, is a serial CPU implementation.

All of the tested algorithms except lzop and GFC compressed at least one data set best. MPC delivered the highest CR on six double-precision and eight single-precision data sets. This was a rather surprising result given that MPC is “handicapped” by being constrained to only utilize GPU-friendly components that retain almost no internal state.

MPC proved to be superior to lzop and GFC, both of which it outperformed on most of the tested data sets in terms of CR. Moreover, GFC and pFPC did not support single-precision data. pFPC and pbzip2 outperformed MPC on average. However, this was the case only because of two data sets, msg_sppm and num_plasma, on which they yielded much higher CRs than MPC.

Most of the double-precision data sets were less compressible than the single-precision data sets. This was expected as the least significant mantissa bits tend to be the most random in floating-point values and many of those bits are dropped when converting from double to single precision.

3.3 Compression and Decompression Speed

Tables 5 and 7 list the double- and single-precision throughputs in gigabytes per second on the 13 data sets for all tested compressors. Tables 6 and 8 list the same information, but for decompression. As stated before, GFC and pFPC do not support single-precision data. For the GPU-based compressors, we show results for both the K40, which has a compute capability of 3.5, and the Titan X, which has a compute capability of 5.2.

Table 5

Double-Precision Compression Throughput in Gigabytes per Second

msg_btmsg_lumsg_spmsg_sppmmsg_sweep3dnum_brainnum_cometnum_ controlnum_plasmaobs_errorobs_infoobs_spitzerobs_tempAverage
Izop0.5730.6290.6240.3910.6230.6260.5610.6240.4350.2000.5760.2150.5950.513
Pigz0.1680.1840.1330.0960.1810.1740.1400.1870.2230.1040.1600.1230.1750.157
pbzip20.0400.0470.0380.0280.0380.0430.0460.0350.0090.0410.0290.0520.0310.037
pFPC0.6370.5700.6660.8440.5550.5140.4670.5060.2680.3480.1490.5560.2550.487
GFC_3542.2030.7545.9644.2019.9222.4717.0125.275.5599.8482.99931.396.32723.38
GFC_5238.6328.3440.7540.7318.3520.7115.6723.295.1239.0752.76427.845.83021.32
MPC_3513.4613.7013.6313.5413.3913.4013.4313.6013.2313.4012.7913.5913.1213.40
MPC_5218.4718.4118.6118.7418.6118.4818.4918.2317.7918.1616.8618.6517.5718.23

t0030

Bold values denote the best value in each column.

Table 6

Double-Precision Decompression Throughput in Gigabytes per Second

msg_btmsg_lumsg_spmsg_sppmmsg_sweep3dnum_brainnum_cometnum_ controlnum_plasmaobs_errorobs_infoobs_spitzerobs_tempAverage
Izop0.6740.8100.8110.5230.7710.8050.6870.8040.5800.4350.6890.4420.7590.676
Pigz0.0840.0820.0750.2280.0810.0810.0870.0890.1080.0970.0850.0740.0870.097
pbzip20.1450.1250.1380.3830.1210.1160.1250.1220.1520.1100.0630.1480.0890.141
pFPC1.1081.0131.1722.1121.1010.9510.8480.8750.7600.7730.4020.9350.5620.970
GFC_3553.6739.1158.4556.2125.3328.5821.6332.147.07012.523.81439.938.04629.73
GFC_5258.0742.3263.0760.8227.4130.9223.4034.777.65013.554.12743.208.70632.16
MPC_3519.2519.0619.3119.0918.8718.8718.9418.7318.1018.8517.4018.7618.2118.72
MPC_5218.9318.8018.9318.8418.5518.5718.6018.4717.7718.4917.0318.4317.9318.41

t0035

Bold values denote the best value in each column.

Table 7

Single-Precision Compression Throughput in Gigabytes per Second

msg_btmsg_lumsg_spmsg_sppmmsg_sweep3dnum_brainnum_cometnum_ controlnum_plasmaobs_errorobs_infoobs_spitzerobs_tempAverage
Izop0.4820.6190.6140.5090.5810.6120.5070.6010.5610.1660.5200.3180.5660.512
Pigz0.1600.1710.1420.4770.1610.1520.1340.1760.1680.0530.1300.0870.1460.166
pbzip20.0500.0340.0380.0190.0130.0420.0410.0360.0090.0420.0250.0480.0300.033
MPC_3513.2513.2013.1713.2213.1713.1712.9313.0512.4312.7111.8313.1712.6712.92
MPC_5218.9718.8018.7619.0318.5518.7518.3918.7917.5118.0716.1318.9617.6218.33

t0040

Bold values denote the best value in each column.

Table 8

Single-Precision Decompression Throughput in Gigabytes per Second

msg_btmsg_lumsg_spmsg_sppmmsg_sweep3dnum_brainnum_cometnum_ controlnum_plasmaobs_errorobs_infoobs_spitzerobs_tempAverage
Izop0.6530.8120.8010.5030.7680.7870.6560.7940.7010.3330.5970.5550.7050.667
pigz0.0820.0820.0730.2400.0790.0790.0880.0850.0850.0890.0810.0700.0810.093
pbzi p20.1280.1120.1310.4080.1370.1090.1100.1080.0970.0890.0470.1170.0600.127
MPC_3518.8518.4118.5918.9817.2518.7318.2218.5617.4817.7515.9418.7717.6118.09
MPC_5218.5718.1718.3618.7217.1318.4517.8718.3517.3417.6815.8318.5517.4417.88

t0045

Bold values denote the best value in each column.

lzop is the fastest CPU-based algorithm at compression and decompression, but it is still 28 times slower than MPC. Thus MPC not only compresses better than lzop but also greatly outperforms it in throughput.

pFPC, the overall best implementation in terms of CR (on the double-precision data), is roughly 18 times slower when running on 20 CPU cores than MPC. GFC is the fastest tested implementation (on the double-precision data sets). It is roughly 1.5–2 times faster than MPC, which is expected as GFC is essentially a two-component algorithm whereas MPC has four. However, GFC compresses relatively poorly. Importantly, the single- and double-precision compression and decompression throughput of MPC matches or exceeds that of the PCI-Express v3.0 bus linking the GPU to the CPU in our system, making real-time operation possible.

3.4 Compression and Decompression Energy

Tables 9 and 11 show the energy efficiency when compressing each of the 13 double- and single-precision inputs. Tables 10 and 12 show the same information, but for decompression. As stated before, pFPC does not support single-precision data. Also, the K20 Power Tool does not currently support measuring energy on the Titan X. Thus we provide energy measurements only for MPC_35.

Table 9

Double-Precision Energy Efficiency in Words Compressed per Micro-Joule

msg_btmsg_lumsg_spmsg_sppmmsg_sweep3dnum_brainnum_cometnum_ controlnum_plasmaobs_errorobs_infoobs_spitzerobs_tempAverage
Izop0.7550.8270.8200.5160.8170.8220.7370.8200.5690.2660.7480.2850.7770.674
Pigz0.1650.1800.1320.1010.1780.1700.1420.1830.2220.1050.1620.1220.1730.157
pbzip20.0400.0450.0390.0260.0380.0420.0460.0360.0080.0420.0300.0510.0320.037
pFPC0.5970.5320.6310.8760.5620.4870.4320.4560.2820.3420.1520.5190.2460.470
MPC_3544.5330.3746.5456.1453.5652.80117.6752.104.1343.9426.37943.464.84039.73

t0050

Bold values denote the best value in each column.

Table 10

Double-Precision Energy Efficiency in Words Decompressed per Micro-Joule

msg_btmsg_lumsg_spmsg_sppmmsg_sweep3dnum_brainnum_cometnum_ controlnum_plasmaobs_errorobs_infoobs_spitzerobs_tempAverage
Izop0.8871.0611.0640.6971.0141.0540.9061.0530.7530.5760.9030.5860.9920.888
pigz0.1080.1050.0970.2920.1040.1050.1110.1130.1380.1250.1090.0950.1110.124
pbzip20.1340.1160.1280.3480.1140.1090.1190.1140.1460.1060.0670.1360.0880.133
pFPC1.1901.0861.2672.2491.2001.0190.9090.9320.8200.8350.4451.0030.6131.044
MPC_3543.6431.8047.5245.7020.6023.2317.5826.135.74810.183.10132.466.54124.17

t0055

Bold values denote the best value in each column.

Table 11

Single-Precision Energy Efficiency in Words Compressed per Micro-Joule

msg_btmsg_lumsg_spmsg_sppmmsg_sweep3dnum_brainnum_cometnum_controlnum_plasmaobs_errorobs_infoobs_spitzerobs_tempAverage
Izop1.2731.6211.6051.3281.5201.6051.3361.5721.4500.4381.3830.8411.4901.343
Pigz0.3170.3390.2820.9730.3190.3000.2890.3460.3410.1120.2700.1750.2940.335
pbzip20.0970.0700.0760.0370.0260.0850.0840.0740.0170.0850.0570.0940.0640.067
MPC_3550.94165.151.5384.31119.3690.32107.793.186.2397.9293.62079.423.20666.37

t0060

Bold values denote the best value in each column.

Table 12

Single-Precision Energy Efficiency in Words Decompressed per Micro-Joule

msg_btmsg_lumsg_spmsg_sppmmsg_sweep3dnum_brainnum_cometnum_controlnum_plasmaobs_errorobs_infoobs_spitzerobs_tempAverage
Izop1.7081.6211.6051.3281.5201.6051.3361.5721.4500.4381.3830.8411.4901.377
pigz0.2100.3390.2820.9730.3190.3000.2890.3460.3410.1120.2700.1750.2940.327
pbzi p20.2420.0700.0760.0370.0260.0850.0840.0740.0170.0850.0570.0940.0640.078
MPC_3S49.43109.745.8382.3710.3710.4619.92135.67.2698.4882.08265.692.77242.31

t0065

Bold values denote the best value in each column.

We can clearly see that MPC dominates the CPU compressors in terms of energy efficiency. For the CPU compressors, lzop is best for double- and single-precision compression as well as single-precision decompression, while pFPC is best for double-precision decompression.

Data compression on the GPU is between 1.8 to over 80 times more energy efficient than even the most energy efficient of the tested CPU compressors. Furthermore, it appears that decompression is more energy efficient on average than compression for lzop, pbzip2, and pFPC. For pigz and MPC, compression is more efficient than decompression, though on average, MPC decompression is still over 25 times more energy efficient than CPU decompression.

3.5 Component Count

Fig. 3 illustrates by how much the CR is affected when changing the number of chained components on the double-precision data sets. The bars marked “Best” refer to the single algorithm that yields the highest harmonic-mean CR over all 13 data sets. The other bars refer to the best algorithm found for each individual data set. We used exhaustive search for up to four-component algorithms. Because the runtime of exhaustive search is exponential in the number of components, we had to use a different approach for algorithms with more than four stages. In particular, we chose a genetic algorithm to find good (though not generally the best) algorithms in those vast search spaces. The single-precision results are not shown as they are qualitatively similar.

f13-03-9780128037386
Fig. 3 Best exhaustive-search-based double-precision CRs with 1, 2, 3, and 4 stages and best genetic-algorithm-based CRs with 5, 6, and 7 stages relative to the best CRs with four stages; the y-axis starts at 0.5 and the cut-off bar for msg_sppm reaches 0.4.

Except on two data sets, a single-component suffices to reach roughly 80–90% of the CR achieved with four components. Moreover, the increase in CR when going from one to two stages is generally larger than going from two to three stages, which in turn is larger than going from three to four stages, and so on. For example, looking at the harmonic mean, going from one to two stages, the CR improves by 8.9%; going from two to three stages, it improves by 8.4%, then by 4.8%, 2.3%, and 0.7%; and going from six to seven stages, the harmonic mean does not improve any more at all (which is in part because of the genetic algorithm not finding better solutions). In other words, the improvements start to flatten out, that is, algorithms with large numbers of stages do not yield much higher CRs and are slower than the chosen four-stage algorithms. Note that, due to the presence of the NUL component, the algorithms that can be created with a larger number of stages represent a strict superset of the algorithms that can be created with fewer stages. As a consequence, the exhaustive-search results in Fig. 3 monotonically increase with more stages. However, this is not the case for the genetic algorithm, which does not guarantee finding the best solution in the search space.

4 Summary and Conclusions

The goal of this work is to determine whether effective algorithms exist for floating-point data compression that are suitable for massively parallel architectures. To this end, we evaluated millions of possible combinations of 24 GPU-friendly algorithmic components to find the best algorithm for each tested data set as well as the best algorithm for all 13 data sets together, both for single- and double-precision representations. This study resulted in well-performing algorithms that have never before been described. A detailed analysis thereof yielded important insights that helped us understand why and how these automatically synthesized algorithms work. This, in turn, enabled us to make predictions as to which algorithms will likely work well on other data sets, which ultimately led to our MPC algorithm. It constitutes a generalization of the best algorithms we found using exhaustive search and requires only two generally known parameters about the input data: the word size (single- or double-precision) and the dimensionality.

It rivals the CRs of the best CPU-based algorithms, which is surprising because MPC is limited to using only algorithmic components that can be easily parallelized and that do not use much internal state. In contrast, some of the CPU compressors we tested utilize megabytes of internal state per thread. We believe the almost stateless operation of MPC make it a great algorithm for any highly parallel compute device as well as for FPGAs and hardware implementations, for instance in an NIC.

Our open-source implementation of MPC, which is available at http://cs.txstate.edu/~burtscher/research/MPC/, greatly outperforms all other tested algorithms that compress similarly well in compression and decompression throughput as well as in energy efficiency. Moreover, the throughput of MPC is sufficient for real-time compression/decompression of data transmitted over the PCIe bus or a LAN. Clearly, highly parallel, effective compression algorithms for floating-point data sets do exist.

References

[1] Yang A., Mukka H., Hesaaraki F., Burtscher M. MPC: a massively parallel compression algorithm for scientific data. In: IEEE Cluster Conference, September. 2015.

[2] Balevic A. Parallel variable-length encoding on GPGPUs. In: International Conference on Parallel Processing. Springer-Verlag; 2009:26–35.

[3] Balevic A., Rockstroh L., Wroblewski M., Simon S. Using arithmetic coding for reduction of resulting simulation data size on massively parallel GPGPUs. In: 15th European PVM/MPI Users’ Group Meeting. Springer Verlag; 2008:295–302.

[4] Bicer T., Yiny J., Chiuz D., Agrawal G., Schuchardt K. Integrating online compression to accelerate large-scale data analytics applications. In: International Parallel and Distributed Processing Symposium. 2013.

[5] Burtscher M., Ratanaworabhan P. FPC: a high-speed compressor for double-precision floating-point data. IEEE Trans. Comput. 2009;58(1):18–31.

[6] Burtscher M., Ratanaworabhan P. pFPC: a parallel compressor for floating-point data. In: Data Compression Conference. 2009:43–52.

[7] Chen D., Chiang Y.J., Memon N., Wu X. Lossless geometry compression for steady-state and time-varying irregular grids. In: IEEE Symposium on Visualization. 2006:275–282.

[8] Filgueira R., Singh D.E., Calderón A., Carretero J. CoMPI: enhancing MPI-based applications performance and scalability using run-time compression. In: EUROPVM/MPI. 2009.

[9] Filgueira R., Singh D.E., Carretero J., Calderón A., Garcia F. Adaptive-CoMPI: enhancing MPI-based applications’s performance and scalability by using adaptive compression. Int. J. High Perform. Comput. Appl. 2011;25(1):93–114.

[10] Lindstrom P., Isenburg M. Fast and efficient compression of floating-point data. IEEE Trans. Vis. Comput. Graph. 2006;12(5):1245–1250.

[11] O’Neil M.A., Burtscher M. Floating-point data compression at 75 Gb/s on a GPU. In: Workshop on General Purpose Processing Using GPUs. 2011.

[12] Ozsoy A., Swany M. CULZSS: LZSS lossless data compression on CUDA. In: IEEE International Conference on Cluster Computing. 2011:403–411.

[13] Patel R., Zhang Y., Mak J., Davidson A., Owens J. Parallel lossless data compression on the GPU. Innov. Parallel Comput. 2012;1–9.

[14] Schendel E.R., Jin Y., Shah N., Chen J., Chang C.S., Ku S.H., Ethier S., Klasky S., Latham R., Ross R.B., Samatova N.F. ISOBAR preconditioner for effective and high-throughput lossless data compression. In: 28th Annual IEEE International Conference on Data Engineering (ICDE). 2012:138–149.

[15] http://icl.cs.utk.edu/papi/.

[16] Burtscher M., Zecena I., Zong Z. Measuring GPU power with the K20 built-in sensor. In: Seventh Workshop on General Purpose Processing on Graphics Processing Units, March. 2014.

[17] http://www.bzip.org/.

[18] http://www.gzip.org/.

[19] http://www.lzop.org/.

[20] http://users.ices.utexas.edu/˜burtscher/research/pFPC/.

[21] http://cs.txstate.edu/˜burtscher/research/GFC/.


a This chapter is a modified version of our IEEE Cluster 2015 publication [1]. The chapter contains more results as well as updated results from a new implementation that is up to 2.5 times faster than the one described in the conference paper.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.35.178