Components of TPUs

In all the deep learning models covered in this book, irrespective of the learning paradigm, three basic calculations were necessary: multiplication, addition, and application of an activation function. 

The first two components are part of matrix multiplication: the weight matrix W needs to be multiplied to input matrix X, generally expressed as WTX; matrix multiplication is computationally expensive on a CPU, and though a GPU parallelizes the operation, still there is scope for improvement.

The TPU has a 65,536 8-bit integer matrix multiplier unit (MXU) that gives a peak throughput of 92 TOPS. The major difference between GPU and TPU multiplication is that GPUs contain floating point multipliers, while TPUs contain 8-bit integer multipliers.  TPUs also contain a Unified Buffer (UB), 24 MB of SRAM that works as registers, and an Activation Unit (AU), which contains hardwired activation functions. 

The MXU is implemented using systolic array architecture. It contains an array of Arithmetic Logic Units (ALUs) connected to a small number of nearest neighbors in a mesh-like topology. Each data value is read only once but used many times for different operations as it flows through the ALU array, without storing it back to a register. The ALUs in TPUs perform only multiplications and additions in a fixed pattern. The MXU is optimized for matrix multiplication and is not suited for general-purpose computations. 

Each TPU also has an off-chip 8-GiB DRAM pool, called weight memory. It has a four-stage pipeline and executes CISC instructions. TPUs, as of now, consist of six neural networks: two MLPs, two CNNs, and two LSTMs. 

The TPU is programmed with the help of high-level instructions; some of the instructions used to program TPUs are the following:

  • Read_Weights: Read weights from memory
  • Read_Host_Memory: Read data from memory
  • MatrixMultiply/Convolve: Multiply  or convolve with the data and accumulate the results
  • Activate: Apply the activation functions
  • Write_Host_Memory: Write the results to the memory

Google has created an API stack to facilitate TPU programming; it translates the API calls from Tensorflow graphs into TPU instructions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.77.21