The sole purpose of the text encoder network is to convert a text description (t) to a text embedding (). In this chapter, we won't train the text encoder network. We will be working with pre-trained text embeddings. Follow the steps given in the Data preparation section to download the pre-trained text embeddings. If you want to train your own text encoder, refer to the paper Learning Deep Representations of Fine-Grained Visual Descriptions, which is available at https://arxiv.org/pdf/1605.05395.pdf. The text encoder network encodes a sentence to a 1,024-dimensional text embedding. The text encoder network is common to both of the stages.