Chapter 2. GPUs: A Breakthrough Technology

The foundation for affordable and scalable high-performance data analytics already exists based on steady advances in CPU, memory, storage, and networking technologies. As noted in Chapter 1, these evolutionary changes have shifted the performance bottleneck from memory I/O to compute.

In an attempt to address the need for faster processing at scale, CPUs now contain as many as 32 cores. But even the use of multicore CPUs deployed in large clusters of servers can make sophisticated analytical applications unaffordable for all but a handful of organizations.

A far more cost-effective way to address the compute performance bottleneck today is the graphics processing unit (GPU). GPUs are capable of processing data up to 100 times faster than configurations containing CPUs alone. The reason for such a dramatic improvement is their massively parallel processing capabilities, with some GPUs containing nearly 6,000 cores—upwards of 200 times more than the 16 to 32 cores found in today’s most powerful CPUs. For example, the Tesla V100—powered by the latest NVIDIA Volta GPU architecture, and equipped with 5,120 NVIDIA CUDA cores and 640 NVIDIA Tensor cores—offers the performance of up to 100 CPUs in a single GPU.

The GPU’s small, efficient cores are also better suited to performing similar, repeated instructions in parallel, making it ideal for accelerating the processing-intensive workloads common in today’s data analysis applications.

The Evolution of the GPU

As the name implies, GPUs were initially used to process graphics. The first-generation GPU was installed on a separate video interface card with its own memory (video RAM or VRAM). The configuration was especially popular with gamers who wanted high-quality real-time graphics. Over time, both the processing power and the programmability of the GPU advanced, making it suitable for additional applications.

GPU architectures designed for high-performance computing applications were initially categorized as General-Purpose GPUs (GPGPUs). But the rather awkward GPGPU moniker soon fell out of favor when the industry came to realize that both graphics and data analysis applications share the same fundamental requirement for fast floating-point processing.

Subsequent generations of fully programmable GPUs increased performance in two ways: more cores and faster I/O with the host server’s CPU and memory. NVIDIA’s K80 GPU, for example, contains 4,992 cores. And most GPU accelerator cards now utilize the PCI Express bus with a bidirectional bandwidth of 32 GBps for a 16-lane PCIe interconnect. Although this throughput is adequate for most applications, others stand to benefit from NVIDIA’s NVLink technology, which provides five times the bandwidth (160 GBps) between the CPU and GPU, and among GPUs.

For the latest generation of GPU cards, the memory bandwidth is significantly higher, as illustrated in Figure 2-1, with rates up to 732 GBps. Compare this bandwidth to the 68 GBps in a Xeon E5 CPU at just over twice that of a PCIe x16 bus. The combination of such fast I/O serving several-thousand cores enables a GPU card equipped with 16 GB of VRAM to achieve single-precision performance of over 9 teraFLOPS (floating-point operations per second).

Figure 2-1. The latest generation of GPUs from NVIDIA contain upwards of nearly 6,000 cores and deliver peak double-precision processing performance of 7.5 TFLOPS; note also the relatively minor performance improvement over time for multicore x86 CPUs (source: NVIDIA)

“Small” Versus “Big” Data Analytics

The relatively small amount of VRAM on a GPU card compared to the few terabytes of RAM now supported in servers has led some to believe that GPU acceleration is limited to “small data” applications. But that belief ignores two practices common in “big data” applications.

The first is that it is rarely necessary to process an entire dataset at once to achieve the desired results. Data management in tiers across GPU VRAM, system RAM and storage (direct-attached storage [DAS], Storage Area Networks [SAN], Network-Attached Storage [NAS], etc.) is capable of delivering virtually unlimited scale for big data workloads. For machine learning, for example, the training data can be streamed from memory or storage as needed. Live streams of data coming from the Internet of Things (IoT) or other applications such as Kafka or Spark can also be ingested in a similar, “piecemeal continuous” manner.

The second practice is the ability to scale GPU-accelerated configurations both up and out. Multiple GPU cards can be placed in a single server, and multiple servers can be configured in a cluster. Such scaling results in more cores and more memory all working simultaneously and massively in parallel to process data at unprecedented speed. The only real limit to potential processing power of GPU acceleration is, therefore, the budget.

But whatever the available budget, a GPU-accelerated configuration will always be able to deliver more FLOPS per dollar because CPUs are and will remain far more expensive than GPUs. So, whether in a single server or a cluster, the GPU database delivers a clear and potentially substantial price/performance advantage.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.37.12