The sole purpose of the text encoder network is to convert a text description (t) to a text embedding (). This network encodes a sentence to a 1,024-dimensional text embedding. We have already downloaded the pretrained char-CNN-RNN text embeddings. We will use these to train our network.