How to do it...

We proceed with the recipe as follows:

  1. Consider this piece of code where we specify the cluster architecture with one master running on 192.168.1.1:1111 and two workers running on 192.168.1.2:1111 and 192.168.1.3:1111 respectively.
import sys
import tensorflow as tf

# specify the cluster's architecture
cluster = tf.train.ClusterSpec({'ps': ['192.168.1.1:1111'],
'worker': ['192.168.1.2:1111',
'192.168.1.3:1111']
})
  1. Note that the code is replicated on multiple machines and therefore it is important to know what is the role of the current execution node. This information we get from the command line. A machine can be either a worker or a parameter server (ps).
# parse command-line to specify machine
job_type = sys.argv[1] # job type: "worker" or "ps"
task_idx = sys.argv[2] # index job in the worker or ps list
# as defined in the ClusterSpec
  1. Run the training server where given a cluster, we bless each computational with a role (either worker or ps), and an id.
# create TensorFlow Server. This is how the machines communicate.
server = tf.train.Server(cluster, job_name=job_type, task_index=task_idx)
  1. The computation is different according to the role of the specific computation node:
    • If the role is a parameter server, then the condition is to join the server. Note that in this case there is no code to execute because the workers will continuously push updates and the only thing that the Parameter Server has to do is waiting.
    • Otherwise the worker code is executed on a specific device within the cluster. This part of code is similar to the one executed on a single machine where we first build the model and then we train it locally. Note that all the distribution of the work and the collection of the updated results is done transparently by Tensoflow. Note that TensorFlow provides a convenient tf.train.replica_device_setter that automatically assigns operations to devices.
# parameter server is updated by remote clients.
# will not proceed beyond this if statement.
if job_type == 'ps':
server.join()
else:
# workers only
with tf.device(tf.train.replica_device_setter(
worker_device='/job:worker/task:'+task_idx,
cluster=cluster)):
# build your model here as if you only were using a single machine

with tf.Session(server.target):
# train your model here
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.112.56