Create another input layer for spatially replicated and compressed embeddings:
input_layer2 = Input(shape=(4, 4, 128))
Concatenate the output of the downsampling blocks to the spatially compressed embeddings:
input_layer2 = Input(shape=(4, 4, 128))
merged_input = concatenate([added_x, input_layer2])