Each TensorFlow computation is described in terms of a graph. This allows a natural degree of flexibility in the structure and the placement of operations that can be split across distributed nodes of computation. The graph can be split into multiple subgraphs that are assigned to different nodes in a cluster of servers.
I strongly suggest the reader have a look to the paper Large Scale Distributed Deep Networks Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. NIPS, 2012, https://research.google.com/archive/large_deep_networks_nips2012.html
One key result of the paper is to prove that it is possible to run distributed stochastic gradient descent (SDG) where multiple nodes are working in parallel on data-shards and update independently and asynchronously the gradient by sending updates to a parameter server. Quoting the abstract of the paper:
This is well explained by the following picture taken from the paper itself:
Another document you should read is the white paper TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems Martín Abadi, and others November, 2015, http://download.tensorflow.org/paper/whitepaper2015.pdf
Considering some examples contained in it we can see a fragment of TensorFlow code shown on the left in the picture below which is then represented as a graph on the right:
A graph can be partitioned across multiple nodes by having local computations and by transparently adding to the graph remote communication nodes when needed. This is well explained in the following figure which is still taken from the paper mentioned earlier:
Gradient descent and all the major optimizer algorithms can be computed in either a centralized way (left side of the figure below) or in a distributed way (right side). The latter involves a master process which talks with multiple workers provisioning both GPUs and CPUs:
Distributed computations can be either synchronous ( all the workers are updating at the same time the gradient on sharded data) or asynchronous (the update did not happen at the same time). The latter typically allows higher scalability with larger graph computations still working pretty well in terms of convergence to an optimal solution. Again the pictures are taken from the TensorFlow white paper and I strongly encourage the interested reader to have a look to this paper if you want to know more: