The history of high-performance computing

HPC has always pushed the limits in order to deliver scientific discoveries. The fundamental shift in processor architecture and design has helped to cross FLOP barriers, starting from Mega-Floating Point Operations (MFLOPs) to now being able to do PetaFLOP calculation in a second.

Floating-Point Operations (FLOPs) per second is the fundamental unit for measuring the theoretical peak of any compute processor. MegaFLOP stands for 10 to the 6th power of FLOPS. PetaFLOP stands for 10 to the 15th power of FLOPS.
Instruction-Level Parallelism (ILP) is a concept wherein code-independent instructions can execute at the same time. For the instructions to execute in parallel, they need to be independent of each other. All modern CPU architecture (even GPU architecture) provides five to 15+ stages to allow for faster clock rates:

Instr 1: add = inp1 + inp2
Instr 2: mult = inp1 * inp2
Instr 3: final_result = mult / add

Operations for calculating the mult and add variables do not depend on each other, so they can be calculated simultaneously while calculating final_result, which depends on the results of the Instr 1 and Instr 2 operations. Therefore, it cannot be calculated until add and mult have been calculated.

When we look at the history of HPC in terms of technology changes, which resulted in a fundamental shift in designing new processors and its impact on the scientific community, there are three primary ones that stand out and can be referred to as epochs:

  • Epoch 1: The history of the supercomputer goes back to CRAY-1, which was basically a single vector CPU architecture providing peak 160 MegaFLOP/MFLOP compute power.
  • Epoch 2: The MegaFLOP barrier was crossed by moving from single-core design to multi-core design in CRAY-2, which was a 4 Core Vector CPU that gave 2 GigaFLOPs of peak performance.
  • Epoch 3: Crossing GigaFLOP compute performance was a fundamental shift and required compute nodes to work with each other and communicate by a network to deliver higher performance. Cray T3D was one of the first machines that delivered 1 TeraFLOP of compute performance. The network was 3D Torus and provided a bandwidth of 300 MB/s. It was the first significant implementation of a rich shell around a standard microprocessor.

After this, for almost 20 years, there were no fundamental innovations. Technological innovations were primarily focused on three architectural innovations:

  • Moving from an 8-bit to a 16-bit to a 32-bit and now a 64-bit instruction set
  • Increasing ILP
  • Increasing the number of cores

This was supported by increasing the clock rate, which currently stands at 4 GHz. It was possible to deliver this because of the fundamental laws that drove the semiconductor industry.


Moore's Law: This law observes the number of transistors in a dense integrated circuit double every two years.

Moore's prediction proved accurate for several decades and still does. Moore's Law is an observation and projection of a historical trend.

Dennard scaling: This a scaling law that keeps Moore's Law alive. Dennard made an observation with respect to the relationship between transistor size and power density and summarized it in the following formula:

P = QfCV2 + V Ileakage

In this equation, Q is the number of transistors, f is the operating frequency, C is the capacitance, V is the operating voltage, and Ileakage is the leakage current.

Dennard scaling and Moore's Law are related to each other as it's inferred that reducing the size of transistors can lead to more and more transistors per chip in terms of cost-effectiveness. 

With Dennard scaling rules, the total chip power for a given size stayed the same for many processor generations. Transistor count doubled while size kept shrinking (1/S rate) and increased in frequency by 40% every two years. This stopped after the feature size reached below 65 nm as these rules could no longer be sustained due to the leakage current growing exponentially. To reduce the effect of leakage current, new innovations were enforced on the switching process. However, these breakthroughs still were not sufficient to revive how voltage was scaled. The voltage remained constant at 1 V for many processor designs. It was no longer possible to keep the power envelope constant. This is also popularly known as Powerwall.

Dennard scaling held its own from 1977 until 1997 and then began to fade. Due to this, from 2007 to 2017, processors went from 45 nm to 16 nm but resulted in a threefold increase in energy/chip size.

At the same time, the pipeline stages went from five stages to 15+ in the latest architecture. To keep the instruction pipeline full, advance techniques such as speculation were used. The speculation unit involves predicting the program's behavior, such as predicting branches and memory addresses. If a prediction is accurate, it can proceed; otherwise, it undoes the work it did and restarts. Deep pipeline stages and the way legacy software is written resulted in unused transistors and wasted clock cycles, which means that there was no improvement in terms of performance for the application.

Then came GPU, which was primarily used for graphics processing. A researcher named Mark Harris made use of GPU for non-graphics tasks for the first time, and the new term General Purpose Computation using GPU (GPGPU) was coined. GPU was proven to be efficient when it came to certain tasks that fell into the category of data parallelism. Unsurprisingly, most of the compute-intensive tasks in many HPC applications are data-parallel in nature. They were mostly matrix to matrix multiplications, which is a routine in the Basic Linear Algebra Specification (BLAS) and used extensively.

The only problem for users when it came to adapting and using GPU was that they had to understand the graphics pipeline to make use of GPU. The only interface that was provided for any computation work on GPU centered around shader execution. There was a need to provide a more general interface that was known to developers who were working in the HPC community. This was solved by the introduction of CUDA in 2007.

While the GPU architecture is also bound by the same laws (Moore's Law and Dennard scaling), the design of processors takes a different approach and dedicates transistors for different usage and achieves higher performance than traditional homogeneous architectures.

The following diagram shows the evolution of computer architecture from sequential processing to distributed memory and its impact on programming models:

With GPU being added to existing servers, there are two types of processors (CPU and GPU) on which the application runs, which brings in a notion of heterogeneity. This is what we will introduce in the next section. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.6.77