As mentioned before, in data parallelism, each model will grab some data from the training set and calculate their own gradient, but somehow we need to synchronize a way before updating the model, given that each worker will have the same model.
In synchronous SGD, all workers calculate a gradient and wait to have all gradients calculated, then the model is updated and distributed to all the workers again: