Introduction

Each TensorFlow computation is described in terms of a graph. This allows a natural degree of flexibility in the structure and the placement of operations that can be split across distributed nodes of computation. The graph can be split into multiple subgraphs that are assigned to different nodes in a cluster of servers.

I strongly suggest the reader have a look to the paper Large Scale Distributed Deep Networks Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. NIPS, 2012, https://research.google.com/archive/large_deep_networks_nips2012.html

One key result of the paper is to prove that it is possible to run distributed stochastic gradient descent (SDG) where multiple nodes are working in parallel on data-shards and update independently and asynchronously the gradient by sending updates to a parameter server. Quoting the abstract of the paper:

Our experiments reveal several surprising results about large-scale nonconvex optimization. Firstly, asynchronous SGD, rarely applied to nonconvex problems, works very well for training deep networks, particularly when combined with Adagrad adaptive learning rates.

This is well explained by the following picture taken from the paper itself:

An example of distributed gradient descent with a parameter server as taken from https://research.google.com/archive/large_deep_networks_nips2012.html

Another document you should read is the white paper TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems Martín Abadi, and others November, 2015, http://download.tensorflow.org/paper/whitepaper2015.pdf

Considering some examples contained in it we can see a fragment of TensorFlow code shown on the left in the picture below which is then represented as a graph on the right:

An example of TensorFlow graph as taken from http://download.tensorflow.org/paper/whitepaper2015.pdf

A graph can be partitioned across multiple nodes by having local computations and by transparently adding to the graph remote communication nodes when needed. This is well explained in the following figure which is still taken from the paper mentioned earlier:

An example of distributed TensorFlow graph computation as taken from http://download.tensorflow.org/paper/whitepaper2015.pdf

Gradient descent and all the major optimizer algorithms can be computed in either a centralized way (left side of the figure below) or in a distributed way (right side). The latter involves a master process which talks with multiple workers provisioning both GPUs and CPUs:

An example of single machine and distributed system structure as taken from An example of distributed TensorFlow graph computation as taken from http://download.tensorflow.org/paper/whitepaper2015.pdf

Distributed computations can be either synchronous ( all the workers are updating at the same time the gradient on sharded data) or asynchronous (the update did not happen at the same time). The latter typically allows higher scalability with larger graph computations still working pretty well in terms of convergence to an optimal solution. Again the pictures are taken from the TensorFlow white paper and I strongly encourage the interested reader to have a look to this paper if you want to know more:

An example of synchronous and asynchronous data parallel training

Table of Contents for Introduction

Create new playlist

Sign In

Sign Up

Table of Contents for
Introduction