Separable convolution

If we consider an image X ∈ ℜ^{w × h} (single channel) and a kernel k ∈ ℜ^{n × m}, the number of operations is nmwh. When the kernel is not very small and the image is large, the cost of this computation can be quite high, even with GPU support. An improvement can be achieved by taking into account the associated property of convolutions. In particular, if the original kernel can be split into the dot product of two vectorial kernels, k⁽¹⁾ with dimensions (n × 1) and k⁽²⁾ with dimensions (1 × m), the convolution is said to be separable. This means that we can perform a (n × m) convolution with two subsequent operations:

The advantage is clear, because now the number of operations is (n + m)wh. In particular, when nm >> n + m, it's possible to avoid a large number of multiplications and speed up both the training and the prediction process.

A slightly different approach has been proposed in Xception: Deep Learning with Depthwise Separable Convolutions, Chollet F., arXiv:1610.02357 [cs.CV]. In this case, which is properly called depthwise separable convolution, the process is split into two steps. The first one operates along the channel axis, transforming it into a single dimensional map with a variable number of channels (for example, if the original diagram is 768 × 1024 × 3, the output of the first stage will be n × 768 × 1024 × 1). Then, a standard convolution is applied to the single layer (which can have indeed more than one channel). In the majority of implementations, the default number of output channels for the depthwise convolution is 1 (this is conventionally expressed by saying that the depth multiplier is 1). This approach allows a dramatic parameter reduction with respect to a standard convolution. In fact, if the input generic feature map is X ∈ ℜ^{w × h × p} and we want to perform a standard convolution with q kernels k⁽ⁱ⁾ ∈ ℜ^{n × m}, we need to learn nmqp parameters (each kernel k⁽ⁱ⁾ is applied to all input channels). Employing the Depthwise Separable Convolution, the first step (working with only the channels) requires nmp parameters. As the output has still p feature maps and we need to output q channels, the process employs a trick: processing each feature map with q 1 × 1 kernels (in this way, the output will have q layers and the same dimensions). The number of parameters required for the second step is pq, so the total number of parameters becomes nmp + pq. Comparing this value with the one required for a standard convolution, we obtain an interesting result:

As this condition is easily true, this approach is extremely effective in optimizing the training and prediction processes, as well as the memory consumption in any scenario. It's not surprising that the Xception model has been immediately implemented in mobile devices, allowing real-time image classification with very limited resources. Of course, depthwise separable convolutions don't always have the same accuracy as standard ones, because they are based on the assumption that the geometrical features observable inside a channel of a composite feature map are independent of each other. This is not always true, because we know that the effect of multiple layers is based also on their combinations (which increases the expressivity of a network). However, in many cases the final result has an accuracy comparable to some state-of-the-art models; therefore, this technique can very often be considered as a valid alternative to a standard convolution.

Since version 2.1.5, Keras has introduced a layer called DepthwiseConv2D that implements a depthwise separable convolution. This layer extends the existing SeparableConv2D.

Table of Contents for Separable convolution

Create new playlist

Sign In

Sign Up

Table of Contents for
Separable convolution