Obtaining a good deep neural network for most modern datasets is a difficult task. Optimizers need to traverse extraordinarily high-dimensional, jagged loss landscapes and differentiate between good and mediocre solutions using a limited set of tools. Hence, you will seldom see modern deep learning designs solving relatively well-studied problem types directly training the neural network on the data right after initialization – it’s simply too difficult to get good results feasibly by training directly from scratch.

Part of the analytical creativity embedded within deep learning design, then, is a need to turn this difficult task – modeling a complex phenomenon with a deep neural network – into a more approachable and efficient process.

In this chapter, we’ll discuss pretraining strategies and transfer learning: in essence, creatively stealing from knowledge contained in the dataset and in the weights of other models for to solve problems.

Developing Creative Training Structures

A model’s training structure consists of the general “flow” with which it is trained – the dataset(s) it is trained on, the order of the entities it is trained on, how those datasets are derived, etc. Here, we are assuming the standard supervised mode of learning, in which the model is presented with a dataset consisting of input data (the x – what the model takes in) and the labels (the y – corresponding to what the model should output) (Figure 2-1). The model’s ultimate task is to take in input data and to output the correct label; this dataset is the task dataset . Although we may train the model on other datasets and to perform other tasks, those auxiliary tasks are performed with the intent of helping the model succeed at the ultimate task. This distinction in terminology will prove to be helpful later.
Figure 2-1

Task dataset structure

The simplest training structure is the initialize-train-complete flow that was discussed in Chapter 1 (Figure 2-2). The model begins training right after initialization, being fed the input data and predicting the labels. After training, we select a final model, which can just be the model after it has completed training or something like restoring the weights of the best performing model (something like Keras’ ModelCheckpoint callback).
Figure 2-2

Simple training structure of initialization and training

Let’s see how we can alter this simple training structure to improve training performance more easily.

The Power of Pretraining

The “pre” in “pretraining” indicates that it refers to a process that occurs before the formal training of the model on the dataset (Figure 2-3).
Figure 2-3

Training structure representation of pretraining

Given that we want pretraining to benefit training, pretraining should orient the model in a way that makes training “easier.” With the pretraining step, the model’s performance should be better than without pretraining. If pretraining makes model performance worse, you should reevaluate your methods of pretraining.

In pretraining, a model must be trained to perform a task different from its ultimate task of predicting the label based on the input data. This pretraining task ideally presents important “context” and skills for the ultimate task such that the model can attain better performance on that ultimate task.

Before we discuss the specific pretraining methods, we need to gain an intuitive understanding of what the vague terms “easier” and “better performance” mean; this will allow us to better understand how to use and design pretraining strategies. Pretraining is often discussed in relation to two of its primary advantages:
  • Time: Pretraining can decrease the time needed for the neural network to converge to some solution.

  • Better metric score: The model attains a metric score higher/lower (depending on the metric) than it would have without pretraining. For instance, it attains a lower cross-entropy score or a higher accuracy.

While these are two dominant attributes of pretraining, they are the result of an underlying phenomenon: conceptually, the process of pretraining brings the optimizer “closer” to the true solution in the loss landscape.

Consider the following loss landscape: it is the objective of the neural network optimizer to find the set of parameters such that the corresponding loss is minimized. This landscape has several features common in most modern loss landscapes: it has several local minima but only one global minimum and sloping, jagged fluctuations (Figure 2-4).
Figure 2-4

Sample loss landscape with global minima marked

A model without pretraining may initialize and follow the following path. Since the first minima it encounters is more or less shallow, say the optimization algorithm overcomes it and discovers the next minima. Since this minima resides in a deeper pit and the optimizer has travelled a long distance, it will likely judge this to be the global optimum and converge at that location (Figure 2-5).
Figure 2-5

Example optimizer movement without pretraining

With pretraining, however, we’re able to get the model “closer” to the true global minimum of the loss landscape such that it converges to a more optimal solution faster, because it already begins from a place “close” to the global optimum. Correspondingly, we can use a less risky or erratically behaving optimizer because it begins from a convenient position (Figure 2-6).

Keep in mind that pretraining does not necessarily bring the model “closer” to the optimum in the literal sense of distance but rather that it becomes more convenient or “easier” for the model to find the optimum than if it hadn’t undergone pretraining. An optimizer very close to the global optimum but with several very deep local minima and very high local maxima would find it more difficult to arrive at the global optimum than an optimizer that was farther but whose path was smoother and easier to descend. What “closer” exactly means is dependent on the shape of the loss landscape, the optimizer’s behavior, and a host of other factors. For simplicity, though, thinking of “closer” in terms of “how easy it would be to navigate to the global optimum” suffices.
Figure 2-6

Example optimizer movement with pretraining

Note that the actual loss landscape for the pretraining task is different from that of the loss landscape of the ultimate task (as is displayed in Figure 2-6). It is not that pretraining operates within the loss landscape of the ultimate task and moves the model to a convenient position – that is just training on the task dataset, not pretraining.

Rather, in pretraining, we rely upon a similarity between the loss landscapes of the pretraining task and the ultimate task the model aims to perform. Therefore, a model that succeeds at the pretraining task should be at a generally successful location in the loss landscape for the ultimate task, from which it can further improve. This also means that pretraining may not be helpful if you believe the loss landscapes of the pretraining dataset and the task dataset are very “different.” Because the two often cannot be compared quantitatively for technical reasons, it is up to you to decide the necessity and performance boost of pretraining in a particular context.

This visual depiction is intended to be more a conceptual method of understanding what pretraining intends to do to aid your designing and usage of pretraining strategies. It is, of course, not entirely representative of the high-dimensional spaces that modern neural network optimizers operate in or the technicalities of transfer learning loss landscape shapes and manipulations. For instance, it’s unlikely that pretraining will get the model to a point where the optimizer encounters no local minima, as presented in Figure 2-6. Nevertheless, we can use this conceptual framework to justify and explicate the two key observed benefits of pretraining – speed and better metric scores.
  • Speed: Pretraining brings the model “closer” to a solution that the optimizer would be “satisfied with” (could converge to), because it’s already done much of the work. Moreover, most modern optimization strategies involve a decreasing learning rate. A neural network that has not had the benefit of pretraining needs to travel more (perhaps in distance, perhaps in overcoming obstacles like local minima or maxima) to near the true solution. By then, its learning rate can be expected to have decayed significantly, and it may not be able to get out of local minima. On the other hand, if a pretrained model begins already near the true solution, its learning rate begins “fresh” and undecayed; it can quickly overcome local minima that lie between it and the true solution.

  • Better metric scores: Pretraining reduces the quantity of obstacles between a model and the true solution. Thus, the model is more likely to converge to the true solution. Moreover, as discussed prior, the optimizer’s learning rate is more “fresh” near the true solution than an optimizer without pretraining and thus is less susceptible to being vulnerable to minor obstacles that may have tricked the optimizer without pretraining.

Next, we’ll build upon this conceptual model to discuss the intuition behind two pretraining methods: transfer learning and self-supervised learning.

Transfer Learning Intuition

Transfer learning is premised upon the idea that knowledge gained from solving one problem can be used, or transferred, in solving another problem. Usually, the knowledge derived from a more general problem is used to aid a model’s ability to address a more specific problem.

Consider two datasets: a general dataset and a task dataset. The general dataset is designated for pretraining, whereas it is the model’s ultimate purpose to perform well on the task dataset (Figure 2-7).
Figure 2-7

General dataset and task dataset visual representations. GD = general dataset, TD = task dataset

In transfer learning, an initialized model is first trained on the general dataset to provide context and skills required to succeed on the task dataset. After the model is pretrained on the general dataset, it is then trained on the task dataset. This training structure is visualized in Figure 2-8. Because the weights are retained from the end of pretraining to the beginning of pretraining, the model has already acquired important skills and representations for pretraining that it can use when training on the general dataset.
Figure 2-8

Training structure visual representation of transfer learning

Usually, the pretraining component of transfer learning is not done in house (by yourself). Selecting an appropriate general dataset and conceptualizing, implementing, and training a model to perform well on that general dataset is a significant amount of work, and it’s not necessary. There is a repository of pretrained models available in most deep learning frameworks (Figure 2-9), like Keras/TensorFlow and PyTorch and other sources, like pypi libraries, online code forums, and hosting sites. Each of these pretrained models is trained on some general dataset and is available for you to use.
Figure 2-9

Pretrained models in a model repository. GD = general dataset. GD1, GD2, …, GDX indicate several different general datasets upon which several pretrained models are correspondingly trained. Note that in practice the relationship between general datasets and pretrained models is one to many (many pretrained models with different architectures are trained on the same general dataset)

You can then choose one of the already pretrained models and begin training them on your specific task dataset (Figure 2-10).
Figure 2-10

Using models from a pretrained model repository to directly begin training on the task dataset

The number of problems with which this set of pretrained models could be used vastly outnumbers the number of pretrained models in the repository. This is fine, though, because we can expect each pretrained model to be capable of being applied to a wide array of problem types, since each model is expected to possess a form of “general knowledge.”

For instance, let’s say that we want to train a model to classify images of dogs and cats (Figure 2-11). We decide we want to use transfer learning and choose a general dataset of all sorts of real-life items and objects, like a parrot, an airplane, a car, a tomato, a fish, etc. For the purposes of this example, let’s say that images of dogs and cats are not present in the general dataset.
Figure 2-11

Example task dataset and general dataset

There may seem to be little connection between the general dataset and the task dataset. However, if we distance ourselves from instinctive human topic-based thinking – “this is a car, that is a dog; they belong to different categories (i.e., transportation and animals)” – we can see many similarities.
  • There are many edges: Edges define the shapes of the airplane, car, and fish; likewise, they are important in defining the shape of a dog or cat’s head.

  • Light is important: Since both datasets are two-dimensional representations of three-dimensional objects, light is important because it helps determine the shape and contour of the object.

  • Texture matters: Real-life objects often have similar shapes but are differentiated by the texture of their surface. Likewise, this seems to be important toward differentiating between images of dogs and cats.

A model that succeeds on the general dataset must have already adapted to and accounted for these features of the general dataset. In a sense, it “understands” how to “interpret” edges, the dynamics of light, texture, and other important qualities. All that is necessary afterward is to adapt those learned skills and representations toward a similar but more specific task.

You may ask, “why not approach task dataset directly? Why develop these auxiliary skills when the model could have developed those skills more specifically and directly for the task dataset?” Indeed, in some cases transfer learning is not a suitable choice. However, in most sufficiently difficult tasks, pretraining via transfer learning helps to set up the foundation for learning that would be difficult for the model to build by itself.

Imagine you are teaching a young child how to add two-digit numbers only through repeated examples. You have two general possible teaching strategies:
  • Teach the child directly to add two-digit numbers by repeatedly showing them examples (e.g., 23+49=72).

  • First, teach the child how to add one-digit numbers by repeatedly showing them examples (e.g., 3+9=12, 2+6=8). Then, teach the child to add two-digit numbers by repeatedly showing them examples (e.g., 23+49=72).

The latter is more likely to be a successful strategy, because the average length of the jump from knowing nothing about addition to adding one-digit numbers to adding two-digit numbers is much smaller than the average length of the jump from knowing nothing about addition to adding two-digit numbers. Transfer learning should operate by this same intuition of teaching the model a general task such that a more specific, ultimate task becomes easier to approach.

Self-Supervised Learning Intuition

Self-supervised learning follows the logic of pretraining – some sort of pretraining task is performed before training the model on the task dataset to orient it toward better attaining representations and skills needed to perform well on the ultimate task.

The difference between self-supervised learning and transfer learning is that in transfer learning, the pretraining dataset is different from the task dataset, whereas in self-supervised learning, the pretraining dataset is constructed from the input data of the task dataset. Thus, while you would need two datasets to build the complete transfer learning training structure, only one is technically needed to build the complete self-supervised learning training structure (Figure 2-12).


Here, we are using a unique definition of what a “different” dataset constitutes. If dataset A can be derived completely from dataset B (for instance, by flipping the images or changing image color), for the purposes of this concept, the two datasets are not different, even though on a technical level there are training instances in one dataset that cannot be found in the other. On the other hand, if dataset A cannot be derived completely from dataset B (for instance, deriving ImageNet from CelebA), the two datasets are different. The main focus of what constitutes “difference” here is the informational content of the dataset, not the individual specific training instances. This allows us to distinguish transfer learning and self-supervised learning.

Figure 2-12

Self-supervised learning datasets. ATD = altered task dataset

It’s important to note that the altered task dataset is generally derived only from the input data of the task dataset. For instance, if a model’s task is to identify whether an image is of a dog or a cat, the altered task dataset could only be built upon the images (the input data of the task dataset), not the labels (the labels of the task dataset).

Designing altered task datasets is a fascinating demonstration of the creativity embedded within successful deep learning solutions. There are many examples of ways you can construct labels from the task dataset to form an altered task dataset:
  • Add noise to some of your data and none to others. Train the model to classify to which data instances noise was added (this is a binary classification problem: noise/no noise). This can help the model better separate noise and data, allowing it to develop an underlying representation of what key features of the data should look like.

  • Add varying degrees of noise to all instances in the dataset. Train the model classify the degree of noise (this is a regression problem). For instance, if you varied the standard deviation of Gaussian noise, the model would predict the standard deviation. This not only helps the model detect if there is noise, but to what extent it exists. While this is a more difficult pretraining task, the model would be encouraged to develop sophisticated representations of key features and structures of the data.

  • Assume there is a dataset of colored images and color is important to the model’s ultimate task. Convert the images to grayscale and train the model to construct a colorized image from the corresponding grayscale one (this is an image-to-image task). In this case, the altered task dataset input data is the grayscale image and the labels are the colorized images. With this self-supervised pretraining exercise, the model gains an understanding of what color certain objects should be.

  • Assume there is a model that needs to perform the NLP (Natural Language Processing) task of text classification. Take the text samples (input data for the task dataset) and randomly hide one of the words. Train the model to predict what the hidden word is. For instance, the model would predict “cats” when given “it’s raining ____ and dogs.” This allows the model to gain an understanding of how words function in relation to each other bidirectionally – it needs to take advantage of information both before and after the hidden word. This method of self-supervised learning is commonly used in modern NLP architectures.

In each of these examples, none of the task dataset’s labels are needed – only the input data, or the x, of the task dataset is used in constructing these supervised datasets. You, as the deep learning engineer, can make some change to the task dataset input data and construct labels from that change.

Once the altered task dataset has been derived from the input data of the task dataset, it can be used in the pretraining training structure (Figure 2-13).
Figure 2-13

Self-supervised learning training structure

Note that in self-supervised learning (and pretraining in general), attaining a high performance metric is not the ultimate goal. In fact, if a model performs too well on the pretraining task, it may be that the pretraining task was too easy and hence did not support the growth of valuable skills and representations of data for formal training. On the other hand, if the model performs very poorly, the pretraining task may be too difficult.

It should be noted that self-supervised learning is technically unsupervised learning because we are extracting insights from the input data without any knowledge of the labels. However, self-supervised learning is a more commonly used and appropriate term: there is undoubtedly a supervised character to this sort of operation that distinguishes it from traditional unsupervised machine learning algorithms like K-means in that while the data generation (constructing altered task dataset) is technically unsupervised, the learning procedure (gradient updates, etc.) is supervised.

Self-supervised learning is also valuable because it allows us to capture valuable information without needing labels. Datamation estimates that the amount of unstructured data is increasing by 55% to 65% every year,2 and the International Data Corporation projects that 80% of data will be unstructured by 2025.3 Unstructured data is defined as data that cannot be easily fit into a standard database model. An auxiliary characteristic of unstructured data, thus, is that there are seldom corresponding labels to train massive deep learning models with. However, the logic of self-supervised learning allows us to find hidden structures in unstructured data without labels and to exploit that knowledge to solve supervised problems.

For this reason, self-supervised learning is especially valuable for training on small datasets, in which there are few labels. Pretraining via self-supervised learning can allow the model to derive important insights it may not have obtained with only the traditional training structure.

Unlike transfer learning, there is no repository of self-supervised pretrained models, because in self-supervised learning, the pretraining task must be designed based on the task dataset. However, altering task datasets for self-supervised learning is generally not difficult if you have a good grasp of programming and data flows (see Chapter 1).

Yann LeCun provided an analog for self-supervised learning in how humans and animals learn at a 2020 AAAI (Association for the Advancement of Artificial Intelligence) conference:

Self-supervised learning … is basically learning to fill in the blanks. Basically, it’s the idea of learning to represent the world before learning a task. This is what babies and animals do. We run about the world, we learn how it works before we learn any task. Once we have good representations of the world, learning a task requires few trials and few samples.

—Yann LeCun, Chief AI Scientist at Facebook, speaking at AAAI 204

Transfer learning and self-supervised learning both fall under the general strategy of pretraining and are similar in that the model engages in some supervised task before it is trained on the task dataset. However, the two pretraining methods are different not only in terms of technicalities (i.e., “different” datasets vs. “same” datasets) as discussed prior but also as a matter of outcome and what knowledge is developed.
  • Transfer learning tends to develop “prediction skills.” That is, the weights from transfer learning are derived from the general dataset, which usually comes from a significantly different context than the task dataset (topic/content-wise). Much of the value of transfer learning is in the predictive skills it develops – learning how to recognize edges, to look for texture, to process color, etc. rather than actual knowledge of the content or topic of the dataset.

  • Self-supervised learning tends to develop “world-representing knowledge.” It builds the fundamental representations of the “world,” or context, that the model will need to understand to perform well on the task dataset. The actual skill of predicting whether – for instance – noise was or was not added to a training instance may not be of much use, but the process of deriving that skill requires gaining an understanding of the data “world.” Self-supervised learning allows for the model to get a glimpse into building fundamental representations and “feeling around” for the dataset’s content and topics.

Of course, this is not to suggest that transfer learning does not develop world-representing knowledge or that self-supervised learning does not develop important skills. Rather, these are the root-level “spirits” or “characters” of transfer learning and self-supervised learning that can be used to guide intuition and what to expect from either method.

You may notice that we have been exploring these concepts through relatively fluid and artistic descriptors and ideas – “extracting insights,” “developing representations,” “world-representing.” While a textbook may attempt to formulate these ideas in equations and mathematical relationships, the truth is that even the most modern deep learning knowledge cannot fully explain and understand the depth of neural network behavior. Having an intuitive grasp of how neural networks function and “learn” – even if it is not mathematically rigorous – is valuable toward successful and efficient design.

Next, we will explore practical concepts in pretraining to understand how to manipulate neural network architectures for the implementation of transfer learning and self-supervised learning.

Transfer Learning Practical Theory

Previously, we discussed conceptual frameworks to gain an intuitive understanding for what transfer learning is, what it does, and how it operates. However, because there are many “hidden” considerations when implementing transfer learning, in this section we will explore the more practical aspect of theory – concepts and ideas to aid your implementation of transfer learning.

The purpose of this section is not to provide specific code or discuss examples (that will be left to the next section), but instead to provide an introduction to important concepts for implementing pretraining, like the structure of pretrained models and how pretrained models are organized in Keras.

Transfer Learning Models and Model Structure

There are several important pretrained models and datasets to know. In this section, we will primarily be exploring image-based models, although you can apply much of the logic to other contexts. Image-based pretrained models generally follow a standard structure, consisting of two key steps: feature extraction and feature interpretation (Figure 2-14).
Figure 2-14

Conceptual components of an image-based pretrained model

Feature extraction serves the purpose of assembling information into meaningful observations, like identifying and amplifying (or reducing) the presence of certain edges, shapes, or color patterns. A well-tuned feature extraction component is able to identify and amplify characteristics that are relevant to the problem. For instance, if the problem is to classify various grayscale images of shapes, the feature extraction component must be highly capable at detecting and amplifying the shape of edges.

Feature interpretation interprets the compiled extracted features to make a final judgment about the image. It takes in extracted information regarding the features of the image and can perform comparisons and other complex analyses across various regions of the image. A well-tuned feature interpretation component is able to effectively aggregate and make sense of extracted features in relation to the target output.

In practice, the roles of “feature extraction” and “feature interpretation” are played by convolutional and fully connected components (Figure 2-15). The convolutional component generally consists of layers like convolutions, pooling, and other image-based feature extraction layers. The fully connected component generally consists of fully connected or dense (in Keras terminology) layers.
Figure 2-15

The practical components of an image-based pretrained model

  1. 1.

    The raw input is passed into the convolutional component for feature extraction.

  2. 2.

    The convolutional component of the pretrained model takes in the raw input and performs convolutions, pooling, and other image-processing layers to extract features from the image. It then passes the extracted features into the flattening component.

  3. 3.

    The flattening component takes in the two-dimensional (spatially, not counting color depth), since the convolutional component performs operations that take in and output data of image form. Flattening converts the two-dimensional extracted features into one-dimensional data so that the fully connected layer can operate upon the features.

  4. 4.

    The fully connected component, also known as the “top” or the “head” of the neural network, takes in the one-dimensional flattened data containing the information from the extracted features. After interpreting the extracted features through a series of fully connected layers (and other corresponding layers, like activations, dropout, etc.), the fully connected layer outputs the output of the neural network.


We’ll talk more about certain architectures in depth in later chapters, but here we’ll do a very quick overview of important models.

In Keras, pretrained models are arranged into modules from which that model and related functions can be found within keras.applications.module_name. For instance, if you wanted to find the pretrained model and processing functions relating to the InceptionV3 model, you would find it in keras.applications.inception_v3. In addition to the pretrained model object, these modules often contain processing functions to apply to your data or the model output. We will see how these processing functions work in relation to the model soon.

The ImageNet Dataset

ImageNet is one of the most important datasets in image recognition. Professor Fei-Fei Li began working on ImageNet at Princeton in early 2007. Throughout the development of neural network applications in the image domain, ImageNet has been a core dataset upon which new methods and designs were conceived, tested, and reinvented.

The ImageNet dataset is structurally organized according to the WordNet hierarchy, in which objects are arranged in hierarchical relation to each other (Figure 2-16).
Figure 2-16

Hypothetical example branch of WordNet hierarchy organization

Each “meaningful concept” in ImageNet is referred to as a “synonym set” or a “synset.” Each “synset” can be described by multiple words or series of words. ImageNet contains over 100,000 synsets; each synset contains an average of 1000 images. Across 22,000 objects, ImageNet contains over 14 million images.

The ImageNet project was inspired by two important needs in computer vision research. The first was the need to establish a clear North Star problem in computer vision… Second, there was a critical need for more data to enable more generalizable machine learning methods. Ever since the birth of the digital era and the availability of web-scale data exchanges, researchers in these fields have been working hard to design more and more sophisticated algorithms to index, retrieve, organize and annotate multimedia data. But good research requires good resources…The convergence of these two intellectual reasons motivated us to build ImageNet.

—ImageNet Research Team5

The ImageNet Large Scale Visual Recognition Challenge ran from 2010 to 2017 and was instrumental to the development of image-processing deep learning work. Hence, many of the pretrained models available in Keras/TensorFlow and other deep learning frameworks are pretrained on the ImageNet dataset. Models pretrained on the ImageNet dataset have developed skills to recognize edges, textures, and other attributes of images of three-dimensional spaces and objects .


The ResNet architectures , or residual neural networks, were introduced in 2015. ResNet won first place in the 2015 ImageNet challenge, obtaining a 3.57% error on the test dataset. The ResNet uses residual connections, or skip connections, which are connections between layers that skip over at least one intermediate layer. We’ll explore these sorts of designs and strategies in neural network architectures in later chapters.

Keras/TensorFlow offers several different versions of the ResNet architecture: ResNet50, ResNet101, ResNet152, ResNet50V2, ResNet101V2, and ResNet152V2. The number after the “ResNet” (e.g., the “50” in “ResNet50”) indicates the number of layers in it. Thus, ResNet152V2 has more layers than ResNet101V2.

ResNet also comes in two versions, with two (very general) key differences:
  • ResNetV1 adds more nonlinearities than ResNetV2. The V2 architecture thus allows for a more direct and clear path for data flow throughout the network.

  • ResNetV1 passes the data through a convolutional layer before batch normalization and an activation layer, whereas ResNetV2 passes data through the batch normalization and activation layers before the convolutional layer (reversed).

You can find Keras ResNet model architectures in corresponding modules within keras.applications:
  • ResNet50: keras.applications.resnet50.ResNet50()

  • ResNet101: keras.applications.resnet.ResNet101()

  • ResNet152: keras.applications.resnet50.ResNet152()

  • ResNet50V2: keras.applications.resnet_v2.ResNet50V2()

  • ResNet101V2: keras.applications.resnet_v2.ResNet101V2()

  • ResNet152V2: keras.applications.resnet_v2.ResNet152V2()

We will explore how to use the architecture given the model object later .


The 2015 InceptionV3 model is a popular model from the Inception family of models, using a module/cell-based structure in which certain sets of layers are repeated. Inception networks are more computationally efficient by reducing the number of parameters necessary and limiting the memory and resources needed to be consumed. InceptionV3 was designed primarily with a focus on minimizing computational cost, whereas ResNet focuses on maximizing accuracy.

You can find the InceptionV3 model architecture at keras.applications.inception_v3.InceptionV3().

Inspired by the high performance of ResNet, the Inception-ResNet architecture uses a hybrid module by incorporating residuals. Correspondingly, Inception-ResNet generally has a low computational cost with high performance.

You can find the Inception-ResNetV2 architecture at keras.applications.inception_resnet_v2.InceptionVResNet2().

For more specific and technical discussion on residual connections, cell-based structures, and the InceptionV3 architecture, see Chapter 6.


The 2017 MobileNet models are designed to perform well on mobile phone deep learning applications. MobileNets use depth-wise separable convolutions, which are convolutions that apply not only spatially but also depth-wise. MobileNet has more parameters than Inception but less than ResNet; correspondingly, MobileNet has been generally observed to perform worse than ResNet but better than Inception.


When we compare models (e.g., “MobileNet vs. ResNet”), keep in mind that we are referring to the complete family of architectures. Most models have architectures of different versions and depths. When comparing model families, we are referring to architectures of comparable versions and layers.

Keras/TensorFlow offers four versions of MobileNet, MobileNetV1, MobileNetV2, MobileNetV3Small, and MobileNetV3Large. MobileNetV2, like MobileNetV1, uses depth-wise separable convolutions but also introduces linear bottlenecks and shortcut connections between bottlenecks. MobileNetV2 can effectively extract features for object detection and segmentation and generally performs faster at achieving the same performance as MobileNetV1. MobileNetV3Small and MobileNetV3Large are MobileNet architectures designated for low-resource and high-resource consumption scenarios, derived from Neural Architecture Search algorithms (we will discuss Neural Architecture Search and other methods in later chapters).

You can find Keras ResNet model architectures in corresponding modules within keras.applications:
  • MobileNetV1: keras.applications.mobilenet.MobileNet()

  • MobileNetV2: keras.applications.mobilenet_v2.MobileNetV2()

  • MobileNetV3Small: keras.applications.MobileNetV3Small()

  • MobileNetV3Large: keras.applications.MobileNetV3Large()


Deep convolutional neural networks have continually been growing larger in an attempt to be more powerful. Exactly how this enlargement is performed, however, has varied. Some approaches increase the resolution of the image by increasing the number of pixels in the image handled by the network. Others increase the depth of the network by adding more layers or the width by increasing the number of nodes in each layer.

The 2019 EfficientNet family of models is a new approach for neural network enlargement via compound scaling, in which the resolution, depth, and width of the network are equally scaled. By using this scaling method upon a small, base model named EfficientNetB0, seven compound-scaled architectures were generated, named EfficientNetB1, EfficientNetB2, …, EfficientNetB7 as the magnitude of scaling increases. The EfficientNet family of models is both powerful and efficient, both computationally and time-wise. Compound scaling allows EfficientNet to improve upon the performance of models like MobileNet and ResNet on the ImageNet dataset.

You can find the Keras implementation of EfficientNet model architectures in the corresponding module within keras.applications:
  • EfficientNetB0: keras.applications.efficientnet.EfficientNetB0()

  • EfficientNetB1: keras.applications.efficientnet.EfficientNetB1()

  • EfficientNetB2: keras.applications.efficientnet.EfficientNetB2()

  • EfficientNetB6: keras.applications.efficientnet.EfficientNetB6()

  • EfficientNetB7: keras.applications.efficientnet.EfficientNetB7()

Find a more specific and technical discussion of EfficientNet in a Chapter 6 case study .

Other Models

Keras/TensorFlow offers a host of other pretrained models in keras.applications. To view them, refer to the TensorFlow documentation, which not only provides information on usage, parameters, and methods but also includes the links to each pretrained model’s corresponding paper: www.tensorflow.org/api_docs/python/tf/keras/applications. Later in this chapter, we will also cover how to convert PyTorch models into Keras/TensorFlow models to take advantage of PyTorch’s library of pretrained models.

Changing Pretrained Model Architectures

To change pretrained model architectures , usually only the pretrained model’s convolutional component is transferred. Recall that transfer learning is primarily concerned with transferring the skills learned from one general dataset for application in a specific task dataset. Because the task dataset is different from the general dataset, while a model operating on the task dataset would still benefit from the feature-extracting skills transferred via transfer learning, it would need to develop its own interpretations.

As a conceptual analogy, imagine you are an art student trained in the analysis of Impressionist-era works – the art of Monet, Renoir, Matisse, and so on. If now you were to approach a contemporary work by the likes of Picasso and Pollock, you would want to
  • …retain the feature-extracting skills: The fundamental feature extraction capabilities that you learned as an art student are important to developing and organizing meaningful and astute observations that can be used for interpretation. Without understanding how to look at and take away observations of art (without actually interpreting them), there are no observations to serve as the basis of interpretation.

  • …but develop a fresh set of interpretations: Now that you have kept your feature extraction skills and have developed a set of observations (the extracted features), it wouldn’t in most contexts make sense to analyze the contemporary work through Impressionist interpretations. Rather, this new data calls for a fresh set of interpretations suited toward the context of contemporary art.

Thus, while the skills for extracting features are valuable and can be transferred almost universally across most generally similar problems, often the interpretations of said features vary widely enough that the most successful strategy is to build a custom fully connected component, instead of transferring it (Figure 2-17).
Figure 2-17

The process of transferring weights from a pretrained model to a custom model. PM = pretrained model, CM = custom model. “Updated” indicates that the corresponding component is being trained/updated in retraining

However, in some occasions the problem set may be similar enough that the fully connected layer weights may also be transferred (Figure 2-18). This method requires no architecture manipulation of the pretrained model, since all the weights from the pretrained model are being transferred. To use this method, simply instantiate the pretrained model and train it on the task dataset without any architecture manipulation.
Figure 2-18

An alternative process of transferring weights from a pretrained model to a custom model in which both the pretrained model’s convolutional and FC components are transferred

Another alternative is to keep the pretrained model’s fully connected component and to add more custom layers afterward (Figure 2-19). Think back to the art student analogy: you’re starting from the interpretative perspective of the Impressionist style that you were trained in, but additionally developing new interpretations of your interpretations to account for the new problem type. However, this method is generally less often used than only transferring weights in the convolutional component, since to be effective the neural network must be significantly deepened (more layers added), which can pose problems for updating weights, time, and model performance. Under certain conditions, though, it can be the optimal strategy.
Figure 2-19

An alternative process of transferring weights from a pretrained model to a custom model in which both the pretrained model’s convolutional and FC components are transferred, and new fully connected layers are added to the custom model

Although we will only explore the first method through examples, all can be implemented through tools covered in Chapter 1, like the Functional API. Models can be treated like functional layers, expressed as functions of an input. See Chapter 3 for more complex exploration into neural network architecture manipulation.

Neural Network “Top” Inclusivity

All Keras/TensorFlow pretrained models have an important parameter, named include_top, which allows us to easily implement the first method of changing pretrained model architectures for training.

Recall that the fully connected component of an image-based neural network architecture is often referred to as its “top.” Thus, if the include_top parameter is set to True, the fully connected component will be retained in the instantiated pretrained model object. On the other hand, if the include_top parameter is set to False, the fully connected component will not be retained and the instantiated pretrained model object will contain only the input and convolutional components.

It’s important to realize that when there is no top, the input image shape is more or less arbitrary (see Note) because convolutional layers do not rely upon an absolute shape. A convolutional layer can act on an image of almost any shape (see Note for the qualifier “almost”), since it simply “slides” filters over an image. On the other hand, a dense layer in the fully connected component can only act upon inputs of a certain shape. Thus, if the include_top parameter is set to True, you are also limited in the shape of your input data to whatever the shape of the data the model was pretrained on.


The input image shape is not completely arbitrary. If the image shape is too small, the image may not be large enough for the whole depth of the network to be able to be valid, since filters rely on a minimum image size. One cannot perform ten convolutions with kernel size (3,3) on a 16x16 pixel image because in the middle of the sequence the feature map will have been reduced so much that it is too small to perform convolutions on. This is likely the culprit if you are using a model architecture with a small input size that throws an error when defining the architecture.

It should also be noted that the fact that the absence of a top means the input size is somewhat arbitrary does not mean that you should be too extreme with how much you change the input shape of your image. Say, a model was trained on 512x512 pixel images and you used transfer learning by extracting the convolutional component of that model (along with its weights). If you were to train the model on 32x32 pixel images (likely with image resizing), even though the operation is technically valid, the transferred weights are rendered more or less useless because the skills in the original model were developed on more high-resolution pixel images. In short, make use of the relatively arbitrariness of image sizes with no top, but also take your freedom in image size with a grain of salt.

Layer Freezing

Layer freezing is a useful tool for the implementation of transfer learning. When a layer is “frozen,” its weights are fixed and it cannot be trained. The purpose of transfer learning is to utilize important skills the pretrained model has developed. Thus, it is common to freeze the convolutional component of the pretrained model and to train only the custom fully connected component (Figure 2-20). This way, the feature-extracting skills learned and stored in the weights of the convolutional component are kept as is, but the interpretative fully connected component is trained to best interpret the extracted features. After this, sometimes the entire network is trained for fine-tuning. This allows the convolutional component to be updated in accordance with the developed interpretations in the fully connected component. Be aware that too much fine-tuning can lead to overfitting.

Think of layer freezing in relation to the art student analogy. Initially, you want to utilize the feature-extracting skills you developed in your art training and to construct new interpretations of contemporary art corresponding to the extracted features. However, after you have constructed new interpretations for the particular problem, you may find it helpful to go back and slightly update your feature-extracting skills to better service your new interpretative capabilities.
Figure 2-20

A multistep layer freezing process

On a practical level, layer freezing is immensely helpful for practical success. Although it can be easy to build massive neural networks and call .fit() with increasing computational power, it remains true that it is not easy to optimize hundreds of millions (or even billions) of parameters by brute-force and achieve good performance. Layer freezing allows the neural network to focus on optimizing one segment of its architectures at a time and allows you to better use the weights gained from pretraining by restricting how much the neural network can deviate from the learned pretrained weights .

Implementing Transfer Learning

In this section, we will discuss three examples of implementing transfer learning with Keras/TensorFlow:
  1. 1.

    No architecture or weight changes: An exercise in using (ImageNet) pretrained models for their original purpose – to predict what object is in an image. Also helpful for dealing with reshaping inputs and using encoding and decoding functions associated with the pretrained model module

  2. 2.

    Transfer learning without layer freezing: Using the standard procedure of transferring the weights from the convolutional component and building a custom fully connected component, without layer freezing. Helpful for manipulating model architectures both using the Functional API and using the include_top parameter

  3. 3.

    Transfer learning with layer freezing: Using the standard procedure of transferring the weights from the convolutional component and building a custom fully connected component, with layer freezing


We’ll also discuss how to take advantage of PyTorch’s pretrained model library by converting them into Keras/TensorFlow models.

General Implementation Structure: A Template

Before we build a transfer learning model, let’s first compile the topics and methods discussed before to lay out a template for the general implementation structure of transfer learning:
  1. 1.

    Choose an appropriate model from the Keras/TensorFlow repository of pretrained models or from another source, like a pypi library offering an implementation of a pretrained model or PyTorch.

  2. 2.

    Instantiate the model with the desired architectural settings (include or do not include top).

  3. 3.

    Set up the model’s input and output flows such that it matches with the data. This may require using a reshaping layer, using an input layer, adding a fully connected component such that the extracted features are interpreted and outputted, some other mechanism, or some combination.

  4. 4.

    Freeze layers, as necessary.

  5. 5.

    Compile and fit.

  6. 6.

    Change which layers are frozen if necessary, and compile and fit again.


It should be also noted that although in this context we are discussing image-based pretrained models, as transfer learning began its extensive development in the domain of images, the logic of a feature-extracting segment and an interpretative segment can be applied to other problem domains as well.

No Architecture or Weight Changes

If we make no changes to the architecture, we cannot adjust the input or the output of the pretrained model. Therefore, if we want to make no changes to the architecture, we must use the pretrained model for its original purpose.

The InceptionV3 model was trained on the ImageNet dataset. Thus, it takes an image of size (299,299,3) and returns an output vector with the probability the image belongs to one of 1000 classes.

Let’s begin by loading an example image to demonstrate how InceptionV3 would function (Listing 2-1). We will use requests to read a URL, PIL to read the corresponding image (Figure 2-21), and numpy to convert the image into an array.
import PIL, requests, numpy as np
from PIL import Image
url = 'https://cdn.pixabay.com/photo/2018/04/13/21/24/lion-3317670_1280.jpg'
im = np.array(Image.open(requests.get(url, stream=True).raw))
Listing 2-1

Loading image of a lion

Figure 2-21

Sample image to be classified by a model architecture trained on the ImageNet dataset

Let’s begin by importing necessary layers, applications, and functions (Listing 2-2):
  • The keras.layers.Input layer defines the input size and is required for constructing a neural network (or in some other form, like the input_shape parameter).

  • The keras.layers.experimental.preprocessing.Resizing layer resizes an image to a new shape. We can use this to reshape an image of any size to the shape (299,299,3) such that it is in the proper input shape for InceptionV3.

  • The keras.applications.inception_v3.InceptionV3 model is the core InceptionV3 model that can be used and manipulated in relation to the other layers.

  • The keras.applications.inception_v3.preprocess_input preprocesses data to be in the same format InceptionV3 was trained on. If the inputs are not preprocessed, the pretrained weights may not be suited toward the unprocessed inputs and yield inaccurate results.

from keras.layers import Input
from keras.layers.experimental.preprocessing import Resizing
from keras.applications.inception_v3 import InceptionV3
from keras.applications.inception_v3 import preprocess_input
Listing 2-2

Importing necessary libraries

We can use these layers to set up a series of data input and preprocessing (Listing 2-3). First, the Input layer takes in data with a shape of (None, None, 3); None indicates that the exact value for that dimension is not exactly specified. This allows us to pass images of any size into the neural network input. The Resizing layer reshapes the input image to the proper shape for InceptionV3. The third layer preprocesses the resized data into InceptionV3 format.
input_layer = Input((None, None, 3))
reshape_layer = Resizing(299,299)(input_layer)
preprocess_layer = preprocess_input(reshape_layer)
Listing 2-3

Building input of a transfer learning neural network

The output of preprocess_layer (the result of the resized, preprocessed input data) is passed as the input to the InceptionV3 model. Note that we are making no changes to the model architecture or weights, so we set inlucde_top=True and weights='imagenet'. We can treat the model as a layer that takes in the output of a previous layer and can be passed as an input to the following layer: Inceptionv3 = InceptionV3(include_top=True, weights="imagenet")(preprocess_layer).

Note that you can set weights to None to just use the model architectures, without the pretrained weights. Although this doesn’t quite count as transfer learning, you may want to use this when the task dataset is so different from the pretraining dataset that any pretrained weights wouldn’t be of much benefit, but want to take advantage of some architecture’s characteristics, like efficiency or power.

We can create a model out of this set of layers using keras.models.Model (note that the pretrained model is treated like a layer and thus is considered an outputs layer): model = keras.models.Model(inputs=input_layer, outputs=Inceptionv3).

Now that the model has been created, we can use it to make predictions (Listing 2-4). We will use the decode_predictions function to help us parse the outputs of the InceptionV3 model, which in its raw form is a list of numbers. The model predictions for the image (stored in the variable im) can be passed into decode_predictions to obtain interpretable results.
from keras.applications.inception_v3 import decode_predictions
reshaped_im = np.array([im])
Listing 2-4

Running and decoding predictions for a pretrained model

Note that the model expects four-dimensional data – for example, data with shape (100,299,299,3), indicating that it consists of 100 299x299 pixel RGB colored images. Even though we are submitting an individual image for prediction, it still needs to be four-dimensional. An easy way to accomplish this is to wrap it as an element in another array with np.array([im]).

Each element of the results of decoded predictions is in the format (class name, class description, probability). The results of the decoded predictions show that InceptionV3 has correctly identified the object of the image – a lion, with 91% confidence (Listing 2-5).
[[('n02129165', 'lion', 0.91112673),
  ('n02112137', 'chow', 0.008575964),
  ('n02130308', 'cheetah', 0.0024228022),
  ('n04328186', 'stopwatch', 0.00097137265),
  ('n02106030', 'collie', 0.00083191396)]]
Listing 2-5

Results of InceptionV3 decoded predictions on an image of a lion

A similar process can be applied to other pretrained models available via Keras. Most Keras pretrained models have associated preprocess_input and decode_predictions functions.

Transfer Learning Without Layer Freezing

If we want to adapt the pretrained model for our own task dataset, we need to make some minimal architecture changes such that the input and the output can accommodate our dataset. This follows a very similar structure to the previous application of no architecture or weight changes.

Let’s begin by importing necessary layers and models (Listing 2-6):
  • The keras.layers.Input layer defines the input shape.

  • The keras.layers.Dense layer provides the neural network’s “interpretative” or “predictive” power.

  • The keras.layers.GlobalAveragePooling2D layer “collapses” image data into the average of its elements. For instance, image data with shape (a, b, c, d) will have shape (a, d) afterward, where (b, c, d) is the shape of each image element and there are a training instances. An alternative to Global Average Pooling is the raw Flatten layer. The Flatten layer (keras.layers.Flatten()), unlike the Global Average Pooling layer, retains all the elements of multidimensional data and converts them into one-dimensional data by stacking them end to end. Thus, image data with shape (a, b, c, d) will have shape (a, b*c*d).

  • The keras.applications.inception_v3.InceptionV3 model provides the architecture and pretrained weights for the InceptionV3 model.

from keras.layers import Input, Dense, GlobalAveragePooling2D
from keras.applications.inception_v3 import InceptionV3
Listing 2-6

Importing important layers for transfer learning

Let’s begin by defining the input layer and the InceptionV3 model it feeds into (Listing 2-7).
input_layer = Input((512,512,3))
inceptionv3 = InceptionV3(include_top=False,
Listing 2-7

Building the input and pretrained model of a transfer learning model

Note that we are keeping the ImageNet weights, but we do not include the top. While we want to keep the feature extraction skills in the convolutional layers, because we are appropriating the pretrained model for our own purposes, we don’t need the weights for the interpretative fully connected component.

Because we are not keeping the top of the neural network, the output of the truncated InceptionV3 model is an image. The Global Average Pooling model allows us to compress the data out of image form (this is the flattening component). Afterward, we can place two more dense layers to “interpret” the extracted features and one more to form the output layer (Listing 2-8).
pooling = GlobalAveragePooling2D()(inceptionv3)
custom_dense_1 = Dense(256, activation='relu')(pooling)
custom_dense_2 = Dense(64, activation='relu')(custom_dense_1)
output_layer = Dense(1, activation='sigmoid')(custom_dense_2)
Listing 2-8

Building the flatten and custom FC components of the transfer learning model

We can form a model as such (Listing 2-9).
model = keras.models.Model(inputs = input_layer, outputs = output_layer)
Listing 2-9

Compiling the transfer learning model into a Keras model

Visualizing the model shows the shape of the data over time (Figure 2-22) with keras.utils.plot_model(model, show_shapes=True). The visualization also demonstrates that the InceptionV3 model is treated very much like a standard layer.
Figure 2-22

Diagram of an example model architecture using transfer learning

The model can be correspondingly compiled and trained. It should be noted that usually the fully connected component is much longer and complex to account for the complex nature of modern deep learning datasets.

Transfer Learning with Layer Freezing

In order to freeze a layer, set layer_obj.trainable=False.

We need to first instantiate the layer to set the trainable attribute to False (Listing 2-10).
inception_model = InceptionV3(include_top = False, weights = 'imagenet')
Listing 2-10

Instantiating the pretrained model as a layer without directly taking in inputs

We can then use this pretrained model object as a standard layer in conjunction with other layers using the syntax of the Functional API (Listing 2-11).
input_layer = Input((512,512,3))
inceptionv3 = inception_model(input_layer)
pooling = GlobalAveragePooling2D()(inceptionv3)
Listing 2-11

Compiling layers into a transfer learning model

After building the remainder of the model and aggregating the layers into a model with keras.models.Model, you can call model.summary() to get a rundown of the model’s parameters (Listing 2-12). Note that most of the parameters are trainable. A small quantity of non-trainable parameters are built into the model’s architecture.
Total params: 22,343,841
Trainable params: 22,309,409
Non-trainable params: 34,432
Listing 2-12

Number of trainable and non-trainable parameters before layer freezing

You can now call inception_model.trainable = False to freeze the layer. By re-aggregating the layers into a new model and calling model.summary(), you can see the results of layer freezing (Listing 2-13). Because most of the parameters in the complete model were transferred weights from the pretrained model, after layer freezing, most of the parameters are non-trainable.
Total params: 22,343,841
Trainable params: 541,057
Non-trainable params: 21,802,784
Listing 2-13

Number of trainable and non-trainable parameters after layer freezing

Calling model.summary() is a good way to double-check if layer freezing worked or not. When several forms and series of neural network manipulations are being performed, it can be difficult to keep track and separate different forms of functions and instantiated objects.

You can unfreeze with inception_model.trainable = True and continue fitting again for fine-tuning.

You can also freeze individual layers with a similar method or by aggregating layers together into models and unfreezing groups of layers.

Accessing PyTorch Models

Keras/TensorFlow offers a large selection of pretrained models, and platforms like the model zoo (https://modelzoo.co/) or pypi libraries offer a wide range of pretrained models and/or architectures in Keras/TensorFlow.

PyTorch , a major alternative framework to Keras/TensorFlow, offers a larger selection of pretrained models. If a PyTorch pretrained model is not already implemented in Keras/TensorFlow, there’s a simple method to convert PyTorch models into Keras model objects so you can work with the model using familiar methods and steps.

PyTorch and Keras/TensorFlow, as well as most other frameworks, are built upon the Open Neural Network Exchange (ONNX), an AI ecosystem that establishes standards for representing various AI algorithms. ONNX was formulated with the purpose of encouraging innovation via conversion of algorithm representations across frameworks. Correspondingly, to convert a model from PyTorch to Keras/TensorFlow, the PyTorch model is first converted into ONNX format and then from ONNX format to the Keras/TensorFlow framework (Figure 2-23).
Figure 2-23

Diagram converting a PyTorch model to a Keras/TensorFlow model

Let’s start by installing PyTorch with pip install torch. It can be imported as import torch.

You can find a list of PyTorch image-based pretrained models at https://pytorch.org/vision/stable/models.html (note that you can access other non-image-based pretrained models as well). These image-based pretrained models can be accessed at torchvision.models.model_name. You can install TorchVision via pip with pip install torchvision.

For the purposes of this example, we will convert the PyTorch implementation of the SqueezeNet model into Keras. The 2016 SqueezeNet model focuses on retaining model performance while decreasing computational cost, notably by using 1x1 convolutions to reduce the number of parameters. The first step is to instantiate the PyTorch model (Listing 2-14).
from torchvision import models
pytorch_model = models.squeezenet1_0(pretrained=True)
Listing 2-14

Using PyTorch to instantiate a PyTorch pretrained model

Now that we have instantiated the PyTorch model, we can convert it into a Keras model using the pytorch2keras model, which can be installed via pip with pip install pytorch2keras.

Let’s begin by importing the pytorch_to_keras function (Listing 2-15).
from pytorch2keras.converter import pytorch_to_keras
Listing 2-15

Importing the necessary function to convert a PyTorch model to a Keras model

The pytorch_to_keras function takes in three arguments: the PyTorch model, the input tensor, and the shape of the input tensor (Listing 2-16).
from torch.autograd import Variable
from torch import FloatTensor
input_np = np.random.uniform(0, 1, (1, 3, 256, 256))
input_var = Variable(FloatTensor(input_np))
keras_model = pytorch_to_keras(pytorch_model,
                               [(3, None, None,)])
Listing 2-16

Converting a PyTorch model to a Keras model. Note that we did not explicitly import it, but np refers to numpy; this can be imported as import numpy as np

The input_np variable defines the shape of the input in channels-first format, where the “depth” of the image is listed as the first element of the shape. The torch.FloatTensor and torch.autograd.Variable functions convert the shape of the numpy array into valid tensor format.

The resulting Keras model can be visualized to double-check that the architecture has been transferred correctly and then compiled and fitted.

Note that not all PyTorch architectures can be converted into Keras/TensorFlow if some layer or operation is not supported for export by ONNX. See a list of transferable layers on the pytorch2keras GitHub: https://github.com/gmalivenko/pytorch2keras .

Implementing Simple Self-Supervised Learning

In this section, we will discuss the implementation of rudimentary level of self-supervised learning. You will find that many deep learning concepts are intertwined, and thus we need to wait until Chapter 3 because we first need to have a strong grasp of what autoencoders are before we understand their effective use in pretraining.

In more complex forms of self-supervised learning, modifications to the architecture are performed after the pretraining task often by adding more layers, usually to get the output into the desired shape to train on the task dataset or to add more processing power. In a simpler application of self-supervised learning, we will design our self-supervised pretraining strategy such that no architectural changes need to be made after pretraining and before formal training on the task dataset.

In order to design a self-supervised pretraining strategy that does not require any architecture changes, we need to focus primarily on the input and the output. If the shapes of the pretraining task’s x and y match that of the ultimate task’s, then there is no need to alter the architecture to accommodate it.

Consider the task of classifying the gender of a person in an image. Here, we have x = image and y = 1-dimensional vector (one value: 0 or 1, indicating gender). Thus, our self-supervised pretraining task should similarly require the model to take in an image and output one value.

For this example, the self-supervised pretraining task will be to identify the degree of noise. For each training example, we will add Gaussian noise with a certain standard deviation, σ. The objective of the model for pretraining is to predict the degree of noise, σ, given the noisy image. In order to perform such a task, the model would need to identify which features in the model count as noise and which ones do not, and thus it works toward discovering and representing fundamental structures and patterns in faces. Adding Gaussian noise changes the color of the images, not the “location” of certain features; since color determines the presence of shadows, edges, and other characteristics that define what a female and male face would generally look like, though, a good understanding of common relationships between the color of pixels between images would be valuable. Moreover, it moves the model closer toward being able to be robust to small perturbations and noise.

Fundamentally, at a rudimentary level without architecture changes, building self-supervised learning pretraining strategies is concerned primarily with data. We can build two datasets – augmented task dataset used for pretraining and a task dataset used for formal training – and fit the model on those two datasets, in that order. After being trained on the augmented task dataset, ideally the model will have developed representations of the “world” it is about to enter and can approach the ultimate task more aware and with more complex understandings than if it had not undergone pretraining.

To construct the augmented task dataset, we will use the .map method of loading images from directories into the efficient TensorFlow dataset format (see Chapter 1). For brevity, here we will provide only the function to be mapped in constructing the altered dataset.

Let us first define a function, parse_file, which takes in the path of a file and uses TensorFlow input/output and image operations to read the image and convert it into proper format (Listing 2-17). The function returns a tensor containing the information from the image, upon which we can perform operations.
def parse_file(filename):
    raw_image = tf.io.read_file(filename)
    image = tf.image.decode_png(raw_image, channels=3)
    image = tf.image.convert_image_dtype(image, tf.float32)
    image = tf.image.resize(image, [512,512])
    return image
Listing 2-17

Function to parse a file path using TensorFlow functions

The alter_data function will be the “ultimate function” used in .map() (Listing 2-18). First, we use the parse_file function to collect information from the image. Next, we will use TensorFlow’s tf.random.uniform() function to randomly choose a number from a uniform distribution ranging from 0 to 0.5; this will be the standard deviation of the Gaussian noise applied. Using the standard deviation and mean 0.0, we will produce a Gaussian noise tensor with the same shape as the image. The noise can be applied to the image by adding the image tensor and the noise tensor together. Lastly, we return the altered image (which now has noise applied to it) and the standard deviation (the label). The task dataset has been altered such that the pretrained task is to predict the degree of noise applied to the image.
def alter_data(filename, label):
    image = parse_file(filename)
    std = tf.random.uniform(shape=[1], minval=0, maxval=0.5)
    noise = tf.random.normal(shape=tf.shape(image),
    mean=0.0, stddev=std, dtype=tf.float32)
    image = tf.add(image, noise)
    return image, std
Listing 2-18

TensorFlow function to implement an alteration to the dataset for pretraining

If you had a multidimensional output, you could also have the model predict changes to brightness, hue, etc. – features that are important to this task. TensorFlow offers a host of image processing and mathematical operations that you can use to alter data. It is best to stick to TensorFlow functions and objects when constructing functions to map to avoid errors or computational inefficiency.

We can see the visual results of our alteration (Figure 2-24).
Figure 2-24

Example images from the altered task dataset. The standard deviations of noise (labels) for images from left to right are 0.003, 0.186, and 0.304

You can then fit the model (after compiling and performing other necessary steps) on the two datasets (Listing 2-19). Note that the number of epochs the model uses fitting on the altered task dataset compared to the number of epochs used fitting on the task dataset varies significantly by the complexity of the task and the design of the altered task dataset, as well as other factors like computational resources available and the size of the datasets. If the task is immensely complex and the altered task dataset offers difficult to obtain but meaningful and deep representations of the task dataset (for instance, modern NLP problems), it may be fruitful to train the model on the altered task dataset for a significant period of time. On the other hand, in simpler tasks, like in this example of classifying the gender of a person in an image, and with altered task datasets that offer meaningful representations but are not integral toward representing the world of the task dataset, pretraining probably does not need to last as long. Getting a feel for the relationship between pretraining and formal training takes experience and time, but it’s good to experiment with your unique conditions and environment.
model.fit(altered_task_dataset, epochs=10)
model.fit(task_dataset, epochs=40)
Listing 2-19

Performing pretraining and ultimate training

It should also be noted that the altered task dataset and the task dataset in this case are different types of problems – the former is a regression problem (predicting a continuous value), whereas the latter is a classification problem (categorizing an image). In this case, this difference is not a problem because all labels in both datasets reside between 0 and 1, meaning that a standard model with a sigmoid activation function output can operate on both. Binary cross-entropy is also a valid loss function for regression problems bounded between 0 and 1, although if you would like you can compile the model with a formally regression-based loss function (e.g., MSE, Huber) before training on the altered task dataset and recompile the model with binary cross-entropy (or some other loss function for classification) before training on the task dataset. In more advanced cases, adjusting from the altered task dataset to the task dataset requires making architectural modifications. You’ll find examples of this in the second case study of this chapter, as well as in Chapter 3 on autoencoders. With the tools of autoencoders, we can perform more complex pretraining strategies, like using image-to-image pretraining tasks (e.g., colorization, denoising, resolution recovery) to pretrain a model to perform an image-to-vector or image-to-image task.

Case Studies

These three case studies allow us to explore new ideas and recent research in transfer learning and self-supervised learning. Some ideas may be out of the scope of this book in terms of implementation but offer fresh perspectives and things to think about when designing your deep learning approaches.

Transfer Learning Case Study: Adversarial Exploitation of Transfer Learning

As deep learning models are increasingly deployed in common production, it is becoming more important to incorporate the findings from the rising field of adversarial learning into deep learning designs. Adversarial learning is concerned with how to exploit weaknesses in deep learning models, often by making small changes that are imperceptible to the human eye but that completely change the model’s output. Vulnerabilities in deep learning can lead to dangerous outcomes – imagine, for instance, if malicious or accidental alterations to a traffic sign cause a self-driving car to suddenly accelerate or brake.

In a 2020 paper entitled “A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning,” Shahbaz Rezaei and Xin Liu6 explore how adversarial learning can be applied to transfer learning. Because many modern applications of deep learning use transfer learning, a malicious hacker could analyze the architecture and weights of the pretrained model to exploit weaknesses in the deployed model.

Rezaei and Liu show that a hacker, with knowledge only of the pretrained model used in the application model, can develop adversarial inputs leading to any desired outcome. Because the hacker does not need to know the weights or architecture of the application model or the data it was retrained on, the method of attack is target agonistic, which allows it to exploit wide swaths of deployed deep learning models that utilize publicly available pretrained models.

Consider the following model structure, which uses a simple form of transfer learning (Figure 2-25). The input to the model is passed through the pretrained model (both convolutional and FC components). The outputs of the pretrained model (a three-dimensional vector) are passed through a custom (newly added) FC layer, which yields the output (a two-dimensional vector). The weights of the FC layer are labeled – for instance, the connection between the top node of the pretrained model’s output and the bottom node of the custom FC layer is 4. For the purposes of simplicity, ignore the bias term in each node.
Figure 2-25

Example transfer learning model architecture with hacker visibility marked

The hacker has access only to the pretrained model, not the custom FC layer and any operations that process the output of the pretrained model.

Say that the hacker wanted to construct an input that would trigger a certain output neuron, like the bottom neuron in the custom FC layer. The hacker can construct an input such that the top neuron output of the pretrained model is some very large number and all the other output neurons are 0. When multiplied by the corresponding weights, the outputs of the complete model yield 10 and 40; passing through the softmax activation function magnifies this difference, and the bottom output neuron of the complete model is the final decision of the model, as desired, as it has the largest associated value/probability (Figure 2-26).
Figure 2-26

Example transfer learning model architecture with a pretrained model output neuron activated to trigger a specific output from the model

Alternatively, if the hacker wanted to activate the top output neuron of the complete model, it could design an input such that the output neurons of the pretrained model were all zero except for the bottom one, which would be an arbitrarily large number (Figure 2-27).
Figure 2-27

Example transfer learning model architecture with a pretrained model output neuron activated to trigger a specific output from the model

However, because the hacker does not know the weights outside of the pretrained model in the custom FC layer, they can perform a brute-force search to attempt to force the model to output a certain result. In this search, the hacker aims to construct an input such that the output of the pretrained model is [a, 0, …, 0, 0] the first time, [0, a, …, 0, 0] the second time, and so on until the last time, [0, 0, …, 0, a], where a is some arbitrarily large number. For each desired output, some node in the complete network’s output will be activated. The hacker simply records which one of the complete network’s output nodes is activated for each designed output of the pretrained model.

It should be noted that for this strategy to be capable of activating every output node of the complete model, for each output node of the complete model, there must exist a connection from an output node of the pretrained model whose weight is larger than the connection from the output node of that pretrained model to any other output node in the complete model. This may not be true in more complex datasets, especially heavy multiclass problems, but in most networks, this strategy works to “force” the network to predict almost any desired result.

Because the problem has now been boiled down to finding the input to the model that will lead to a certain output of the pretrained component, the hacker can use methods like gradient descent to solve the following optimization problem: adjust the values of the input image such that the output of the pretrained model can as closely match the desired output (minimize error between the output of the pretrained model and the desired output). In this case, the desired output is some case where all the output nodes except for one are zero, and the one nonzero node is some very large number. The optimization algorithm begins with a standard input image and makes small changes to the image such that the output of the pretrained model is the desired output, constructing an adversarial image. Passing this adversarial image through a model that uses that pretrained model via transfer learning fools it, even though to the human eye no significant changes were made (Figure 2-28).
Figure 2-28

Left: example input image to the optimization algorithm. Right: example adversarial image generated to produce a certain network output. (By Shahbaz Rezaei and Xin Liu.)

Making small, specialized distortions to the original input image results in the model making different classifications between the first and second images. Moreover, this could be done without any knowledge of the complete networks’ weights or architectures – an impressive feat.

As the number of target classes (of the complete neural network) increases, the performance of the attack method decreases – but it is still alarmingly high. Consider the attack’s performance on a neural network using the VGG face pretrained model on the University of Massachusetts Labeled Faces in the Wild (LFW) dataset (Table 2-1).
Table 2-1

Metrics of attack performance for different numbers of target classes. Target classes: number of target classes/output nodes in the complete neural network. NABAC: number of attempts to break all classes, referring to the number of queries the attack needs to make to “break” a class (force the model to misclassify an item from that class as another class). Effectiveness (x%): the ratio of adversarial inputs that trigger a target class with x% confidence to the total number of adversarial inputs. (By Shahbaz Rezaei and Xin Liu.)

Target Classes


Effectiveness (95%)

Effectiveness (99%)


48.25 ± 42.5

91.68% ± 5.69

87.82% ± 6.98


149.97 ± 132.15

88.87% ± 2.46

83.07% ± 3.31


323.36 ± 253.56

87.79%± 2.42

82.08% ± 2.74





As the number of target classes increases, the attack needs to make more queries to break all classes, and its effectiveness decreases.

In the simplified example, only one fully connected layer was added after the pretrained model. While this is one method of integrating the pretrained model into the model structure, often many more layers are added. As the number of layers added afterward increases, performance also decreases (Table 2-2).
Table 2-2

Metrics of attack performance for different numbers of layers. (By Shahbaz Rezaei and Xin Liu.)

# New Layers


Effectiveness (95%)

Effectiveness (99%)


48.25 ± 42.5

91.68% ± 5.69

87.82% ± 6.98


51.87 ± 39.94

91.57% ± 4.87

86.45% ± 5.35


257.26 ± 387.16

89% ± 8.20

85.67% ± 8.88

Nevertheless, the effectiveness is high, even as the number of target classes and new layers increases. This study is an interesting investigation into the implications of transfer learning for the deployment of models in real-world applications. It’s important to keep an eye toward deployment and the model’s role in the real world rather than in an airtight experimentation laboratory environment when designing deep learning approaches.

This overview was a simplification of the study; to read more, you can find the paper at https://openreview.net/pdf?id=BylVcTNtDS.

Self-Supervised Learning Case Study: Predicting Rotations

In “Unsupervised Representation Learning by Predicting Image Rotations,” Spyros Gidaris, Praveer Singh, and Nikos Komodakis7 propose a simple self-supervised learning method that develops strong representations of the data and yields effective results.

Gidaris et al. propose a pretraining strategy in which images are randomly rotated by 0, 90, 180, or 270 degrees and the model predicts the angle by which the image was rotated. The core intuition behind the design of the rotation-based self-supervised learning pretraining strategy is that the model must first understand the concept of the model depicted in the images in order to identify the rotation that was applied to the image. A successful model, moreover, has a sense of the “natural orientation” of an object, and thus have gravity and other forces should act on objects in an image.

…it is essentially impossible for a ConvNet model to effectively perform the above rotation recognition task unless it has first learnt to recognize and detect classes of objects as well as their semantic parts in images. More specifically, to successfully predict the rotation of an image the ConvNet model must necessarily learn to localize salient objects in the image, recognize their orientation and object type, and then relate the object orientation with the dominant orientation that each type of object tends to be depicted within the available images.

—Gidaris et al., “Unsupervised Representation Learning by Predicting Image Rotations”

With no access to the original, unrotated image, given an image rotated θ degrees (chosen from θ = 0°, 90°, 180°, 270°), the network must maximize the probability that the image is rotated θ degrees (Figure 2-29).
Figure 2-29

The convolutional neural network’s prediction objectives under self-supervised learning by predicting rotations. (By Gidaris et al.)

To visually see the difference in the learned representations for self-supervised learning as opposed to standard supervised learning (only training the model on the task dataset without constructing a pretraining task), Gidaris et al. visualize the attention maps from a model trained with a standard supervised procedure and another trained with a self-supervised procedure (Figure 2-30).
Figure 2-30

Attention maps from various layers from a supervised model and a self-supervised model given three example input images. (By Gidaris et al.)

The attention map of a convolutional layer can be found by computing the feature maps for that layer and aggregating them. You can visually see that the attention maps from the self-supervised model focus more on the high-level objects, like the eyes, nose, tails, and heads – in accordance with important features required to perform object recognition.


You can perform something similar by extracting features from a layer in the middle of a neural network in Keras by referring to individual layers (assuming each layer is associated with a variable you defined using the Functional API). Visualizing layer weights and features can be helpful to understanding the improvement in representation capabilities that self-supervised learning provides.

Interestingly, Gidaris et al. find that classifying 0°-90°-180°-270° rotations performs better than other rotations, like 0°-180° rotations, 90°-270° rotations, and 0°-45°-90°-135°-180°-225°-270°-315° (Table 2-3). This result highlights the importance of experimenting with variations of self-supervised strategies to explore which ones perform the best for your particular problem and approach.
Table 2-3

Performance on CIFAR-10 classification for different rotations needed for prediction. (By Gidaris et al.)

# Rotations


CIFAR-10 Classification Accuracy













We can implement self-supervised learning via predicting rotations relatively easily. We will use the CIFAR-10 dataset, which can be loaded directly from Keras datasets with (x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data(). The first step after loading data is to generate self-supervised altered task data, in this case predicting which degree the image has been rotated (Listing 2-20). Although there are more efficient methods, we will simply loop through every data instance in x_train, randomly rotate it by 0, 90, 180, or 270 degrees, and save the rotated image and corresponding label into a new list. The scipy library’s scipy.ndimage.rotate(img, deg) allows us to easily perform rotations.
from scipy.ndimage import rotate
ss_x_train, ss_y_train = [], []
for ind in range(len(x_train)):
    rot = np.random.choice([0, 90, 180, 270])
    if rot==0:
        ss_x_train.append(rotate(x_train[ind], rot))
    elif rot==90:
        ss_x_train.append(rotate(x_train[ind], rot))
    elif rot==180:
        ss_x_train.append(rotate(x_train[ind], rot))
    elif rot==270:
        ss_x_train.append(rotate(x_train[ind], rot))
ss_x_train = np.array(ss_x_train)
ss_y_train = np.array(ss_y_train)
Listing 2-20

Constructing rotation-based altered task dataset

We’ll use the EfficientNetB4 architecture (not using its weights) to predict which one of four rotations was applied (Listing 2-21). Note that the output of the EfficientNetB4 architecture is 100-dimensional, which is projected via the final layer out into the four desired outputs. A more efficient approach, at first glance, may seem to simply define the number of output classes in the base EfficientNetB4 model to be 4. We will see soon why this design instead is required for this sort of self-supervised learning operation.

Additionally, we use input_tensor=inp rather than the input_shape=(a,b,c) and base_model(params)(inp) syntax discussed prior. If we were to use the latter method, the variable base would store a tensor (i.e., the output of the model) rather than the model itself. We need to be able to access the EfficientNet base model to transfer it into a new architecture after the self-supervised process has been completed.
inp = L.Input((32,32,3))
base = EfficientNetB4(input_tensor=inp,
out = L.Dense(4, activation='softmax')(base.output)
model = Model(inputs=inp,
Listing 2-21

Constructing model architecture

The model can be compiled and fitted on the altered task dataset (Listing 2-22). In this case, because the dataset is so small (largely because of the small image resolution rather than the number of items in the dataset), passing numpy arrays suffices and no TensorFlow dataset conversions are needed.
model.fit(ss_x_train, ss_y_train,
Listing 2-22

Constructing model architecture

After fitting on the altered task dataset, though, we run into a problem: the current model outputs only four values, since the altered task dataset had four unique labels, but the task dataset (i.e., the formal CIFAR-10) dataset has ten classes. This requires an architectural modification.

Herein lies the reason behind defining a specialized output layer for the altered task dataset and not specifying classes=4 in the base EfficientNetB4 model when training on the altered task dataset: we can extract the base model and attach it to a new output layer with ten classes, as follows (Listing 23). Moreover, because we defined base to be a model rather than an output tensor, we can connect it via the input easily via base(inp).
inp = L.Input((32,32,3))
base_out = base(inp)
out = L.Dense(10, activation='softmax')(base_out)
model = Model(inputs=inp,
Listing 2-23

Constructing model architecture

If we instead had trained a base model with output four classes and attached another output layer with ten classes afterward, which is another technically available option (in that the code runs), performance would likely be poor due to the bottleneck imposed by the four-class output. In this case, think of the EfficientNet model not as outputting 100 classes but as outputting 100 features that are compiled and used to make decisions in the output layer. The base model still contains all the weights learned from self-supervised learning. The new model can then be compiled and trained on the task dataset.

This self-supervised learning approach drastically improves the state-of-the-art results on unsupervised feature learning for a variety of object detection tasks, including ImageNet, PASCAL, and CIFAR-10.

The work of Gidaris et al. points to the existence of simple yet effective approaches to apply self-supervised learning to boost deep learning model performance.

Self-Supervised Learning Case Study: Learning Image Context and Designing Nontrivial Pretraining Tasks

Earlier , we discussed relatively simpler self-supervised learning examples, like adding noise to the task dataset and pretraining the model to predict the degree of noise. Other examples were simple conceptually (although perhaps difficult to implement or accomplish), like pretraining the model to colorize grayscale images. While sometimes simpler self-supervised strategies like predicting the degree of rotation are effective, in other occasions for a bigger performance boost, a more complex pretraining task needs to be used.

Carl Doersch, Abhinav Gupta, and Alexei A. Efros propose a complex, creative, and effective self-supervised learning pretraining task in their 2016 paper, “Unsupervised Visual Representation Learning by Context Prediction.”8 Their design is a great case study of the implications and possibilities that need to be considered to design a successful self-supervised pretraining strategy.

In the context prediction pretraining task, the model must predict the relative location, or the context, of one patch to another (Figure 2-31). Patches are small, square-shaped regions of the image. The model is fed a base patch (outlined in blue) and a neighboring patch selected from the eight regions around the base patch (outlined in dotted red) and trained to predict the location of the neighboring patch in relation to the base patch. In the example, the label would be “top left.”
Figure 2-31

Example context and patch relationships

Each general direction that could describe the location of a neighboring patch relative to the base patch is associated with a number. Thus, the model takes in two images (the base patch and the neighboring patch) and outputs a vector indicating which region relative to the base patch the neighboring patch is most likely to be located (Figure 2-32).
Figure 2-32

Assigning numbers to neighbor patch directions

To perform this pretraining task, Doersch et al. constructed a multi-input network that takes in two inputs, processes the images through a series of image-based layers in separate “branches,” merges the two branches, processes the merged results, and outputs the location of one input relative to the other (Figure 2-33). Dotted lines in Figure 2-33 indicate shared weights, which means that the weights for the two layers are the same. We’ll discuss multi-input networks, branching, and weight sharing in later chapters.
Figure 2-33

Two-input architecture used to perform the pretraining task

This pretraining task, at least in theory, requires a deep understanding of the core features of certain items and the high-level relationships between attributes of objects. This makes it difficult, even for humans. Try answering these two visual questions (Figure 2-34).
Figure 2-34

Example “questions” with two patch pairs of inputs

However, Doersch et al. found that the neural network was learning “trivial solutions” – that is, it was finding shortcuts and easy approaches to the pretraining task that were not related to the concepts Doersch et al. had hoped it would learn.

For instance, low-level cues like boundary patterns or the continuation of textures across patches serve as trivial shortcuts that the neural network can exploit without learning the high-level relationships it would ideally learn. To address this, Doersch et al. added both a gap between patches about half the length of the patch width and a random “jitter” for each patch location by up to 7 pixels. (These are shown visually in Figure 2-32.) With these modifications, the network could not reliably exploit trivial shortcuts like the continuation of textures and lines.

However, another unexpected problem arose: a phenomenon called chromatic aberration, in which light rays pass through a lens focus at different points depending on their wavelength. Thus, in camera lenses that are affected by chromatic aberration, one color – usually green – is “shrunk” toward the center of the image relative to other color channels (red and blue). The network can thus parse the absolute location of patches by detecting the shape and position of the separation between green and magenta (the color comprising the non-green color channels, red and blue). Finding the relative position given knowledge about the absolute location of two patches thus becomes a trivial task.

To address this, Doersch et al. randomly dropped two of three color channels from each patch and replaced the dropped channels with Gaussian noise. This method prevents the neural network from exploiting information in the color channels and instead to focus on the general shapes and edges – the higher-level features – in its pretraining task.

By making modifications to the architecture, like only using one “branch” and adding more convolutional layers, Doersch et al. converted the two-image-to-vector architecture into an image-to-vector architecture. The authors tested this pretrained architecture on the PASCAL Visual Object Classes (VOC) Challenge and found that this self-supervised learning strategy was the best result without using labels not included in the dataset (to their knowledge, and at the time of paper release). The authors found in other datasets that this context prediction task yielded a significant boost to model performance.

This paper is both an example of a well-thought-out self-supervised learning pretraining task and of the process of modifying the pretraining task to accomplish the ultimate objective by discouraging trivial solutions. Self-supervised learning offers us a clever way to find, extract, and utilize rich insights and information without needing to search for costly labels and wandering outside of the dataset.

Key Points

In this chapter, we discussed the intuition and implementation behind pretraining, transfer learning, and self-supervised learning:
  • Pretraining is a task that is performed before the formal training of a model and can orient the model toward better performing on its ultimate task.

  • Because pretraining moves the model into a more convenient location in the loss landscape, it can arrive at a better solution more quickly than a model without pretraining. Thus, pretraining generally has two key advantages that boost model performance: decreased time and better metric scores.

  • Transfer learning uses two separate datasets – a general dataset and a task dataset. The model is first trained on the general dataset, and then the learned weights and “skills” from the task are transferred to another model that is trained on the task dataset. Pretrained models are often kept in publicly available repositories and can be accessed and used in your own model.
    • Image-based pretrained models generally consist of two components: a feature extraction component and a fully connected component (the “top”). You can (a) keep only the feature extraction component and build a custom FC component, (b) keep both the feature extraction component and the fully connected component only, or (c) keep both the feature extraction component and the FC component, but add more layers afterward.

    • Layer freezing allows you to selectively train certain parts of a model using transfer learning to better take advantage of pretrained weights. Usually, weights from the pretrained model are frozen and the custom added layers are trained. After this step, sometimes the pretrained model is unfrozen and the entire network is trained to fine-tune on the data.

    • Keras/TensorFlow has built in a repository of pretrained models organized into modules, in which each module contains the pretrained model(s) and associated functions (usually for encoding and decoding). Important pretrained models include the ResNet family, the InceptionV3 model, the MobileNet family, and the EfficientNet family. Keras/TensorFlow models are pretrained on the ImageNet dataset. You can also convert PyTorch models into Keras/TensorFlow models to take advantage of PyTorch’s large repository of pretrained models.

  • Self-supervised learning uses pretrains on an altered task dataset derived from input data of the task dataset. The x and y of the altered task dataset come completely from only the x of the task dataset, often by performing distortions or actions to generate labels. Self-supervised learning helps orient the model to representing and understanding the “world” of the dataset so it is prepared to understand the dynamics and core features of the task dataset.
    • Self-supervised learning is especially valuable for training on small datasets with few labels, or when there is an abundance of unlabeled data that otherwise would not have been used.

  • There are two key differences between transfer learning and self-supervised learning. Firstly, the pretraining dataset and the task dataset generally come from completely different sources in transfer learning, whereas in self-supervised learning, the pretraining dataset is derived from the input data of the task dataset. Secondly, transfer learning aims to develop “skills,” while self-supervised learning encourages “representations of the world.”

Although we implemented relatively simple models and methods in this chapter, with these conceptual tools and the design intuition of what transfer learning and self-supervised learning are, you will be able to build many more complex training structures, like stacking pretrained models (“double transfer learning”) or designing innovative self-supervised pretraining methods.

In the next chapter, we’ll expand upon our study of advanced neural network structures and manipulations with autoencoders, a versatile tool often used in developing successful training structures.

