3.10 SM

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3.10 SM

A stream multiprocessor is a type of SIMD or a MIMD machine where the constituent processors are streaming processors (SPs) or thread processors. A stream processor is defined as a processor that deals with data streams, and its instruction set architecture (ISA) contains kernels to process these streams [32]. The concept of stream processing is closely associated with the graphics processing unit (GPU) where the GPU is thereby able to perform general compute-intensive general-purpose computations. The GPU thus becomes a general-purpose GPU. Examples of data streams are vectors of floating point numbers or a group of frame pixels for video data processing. This type of data shows temporal and spatial localities. Temporal locality is when the input data stream is used only a few times to produce the output steam. Spatial locality is when the input data stream is located in the same memory block. A successful example of a stream multiprocessor is the new generations of GPUs like Fermi from NVIDIA [33].

Applications suited for SM must satisfy three characteristics [34]:

1. Compute intensity

2. Data parallelism

3. Consumer–producer locality, that is, temporal and spatial locality

Compute intensity is defined as the number of arithmetic operations per I/O or global memory reference. In applications suitable for stream processing, this ratio could reach 50:1 and above. Data parallelism is when the same operation is performed on all data in an input stream in parallel. Producer–consumer locality is when data are read once and are used once or for a few times to produce the output stream. GPUs such as NVIDIA’s Fermi can sustain tens of thousands of parallel threads.

Data suited for stream multiprocessing use the local caches without cache misses since the data exhibit locality. This eliminates the problem of long memory latency [32]. In short, an SM or a GPU is suited for applications with long sequences of data that can be executed using thousands of threads.

Figure 3.6 shows a block diagram of the Fermi GPU from NVIDIA. Fermi has 3 billion transistors and 512 cores or SPs. Fermi is capable of delivering up to 1.5 tera floating point operations per second (TFLOPS). Figure 3.6a is a simplified view of the Fermi GPU stream multiprocessor. It consists of 16 stream multiprocessor (SM) blocks sharing an L2 cache. Surrounding the SMs are six 64-bit interfaces to dynamic random access memory (DRAM) to give a 384 bits wide path to memory.

Figure 3.6 Simplified view of the Fermi GPU stream multiprocessor. (a) Block diagram of the stream multiprocessor. (b) Block diagram of the stream processor or thread processor. (c) Block diagram of the CUDA core processor. INT: integer unit; FP: floating point unit; LD: load unit; ST: store unit; SFU: special function unit.

Figure 3.6b is a simplified expanded view of one of the 16 SM blocks of Fig. 3.6a. Each SM block consists of 64 stream processors or thread processors labeled SP and arranged in groups of four representing the four columns in the figure. Instructions arrive and are scheduled by the block labeled instruction at the top of the figure. The interconnection network block provides communication between the SMs and the L1 cache at the bottom of the figure. The block labeled SFU is a special function unit capable of evaluating elementary functions such as square root and trigonometric functions so common in scientific applications.

Figure 3.6c is an expanded view of one of the SP blocks in Fig. 3.6b. These blocks are called compute unified device architecture (CUDA) cores and are capable of doing a full integer arithmetic and logic unit (ALU) and floating point arithmetic operations.

Figure 3.7 compares the ratio of the different resources allocated to a CPU versus a GPU. General-purpose computers have a CPU that does sophisticated control like branch prediction. That is why the area allocated to control in a CPU is eight times larger than in a GPU. A GPU eliminates cache misses and long memory latency by using large cache to store data. The ALU resources in a GPU are more since the GPU is a stream multiprocessor that dedicates more area to ALU resources. Finally, the DRAM is almost the same size in both systems.

Figure 3.7 Ratio of the different resources allocated to a CPU versus a GPU.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3.10 SM

Create new playlist

Sign In

Sign Up

Table of Contents for
3.10 SM