Classifying videos with pre-trained nets in six different ways

Classifying videos is an area of active research because of a large amount of data needed for processing this type of media. Memory requirements are frequently reaching the limits of modern GPUs and a distributed form of training on multiple machines might be required. Research is currently exploring different directions with increased levels of complexity, let's review them.

The first approach consists of classifying one video frame at a time by considering each of them as a separate image processed with a 2D CNN. This approach simply reduces the video classification problem to an image classification problem. Each video frame emits a classification output, and the video is classified by taking into account the more frequently chosen category for each frame.

The second approach consists of creating one single network where a 2D CNN is combined with an RNN. The idea is that the CNN will take into account the image components and the RNN will take into account the sequence information for each video. This type of network can be very difficult to train because of the very high number of parameters to optimize.

The third approach is to use a 3D ConvNet, where 3D ConvNets are an extension of 2D ConvNets operating on a 3D tensor (time, image_width, image_height). This approach is another natural extension of image classification, but again, 3D ConvNets might be hard to train.

The fourth approach is based on smart intuition. Instead of using CNNs directly for classification, they can be used for storing offline features for each frame in the video. The idea is that feature extraction can be made very efficient with transfer learning, as shown in a previous recipe. After all of the features are extracted, they can be passed as a set of inputs into an RNN which will learn sequences across multiple frames and emit the final classification.

The fifth approach is a simple variant of the fourth, where the final layer is an MLP instead of an RNN. In certain situations, this approach can be simpler and less expensive in terms of computational requirements.

The sixth approach is a variant of the fourth, where the phase of feature extraction is realized with a 3D CNN which extracts spatial and visual features. These features are then passed to either an RNN or an MLP.

Deciding on the best approach is strictly dependant on your specific application and there is no definitive answer. The first three approaches are generally more computationally expensive, while the last three approaches are less expensive and frequently achieve better performance.

In this recipe, we show how to use the sixth approach by describing the results presented in the paper Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks, Montes, Alberto and Salvador, Amaia and Pascual, Santiago and Giro-i-Nieto, Xavier, 2016 (https://arxiv.org/abs/1608.08128) . This work aims at solving the ActivityNet Challenge http://activity-net.org/challenges/2016/. This challenge focuses on recognizing high-level and goal-oriented activities from user-generated videos, similar to those found in internet portals. The challenge is tailored to 200 activity categories in two different tasks:

  • Untrimmed Classification Challenge: Given a long video, predict the labels of the activities present in the video
  • Detection Challenge: Given a long video, predict the labels and temporal extents of the activities present in the video

The architecture presented consists of two stages, as depicted in the following figure. The first stage encodes the video information into a single vector representation for small video clips. To achieve that, the C3D network is used. The C3D network uses 3D convolutions to extract spatiotemporal features from the videos, which previously have been split into 16-frame clips.

The second stage, once the video features are extracted, is that the activity on each clip is to be classified. To perform this classification an RNN is used, more specifically an LSTM network which tries to exploit long-term correlations and perform a prediction of the video sequence. This stage is the one which has been trained:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.200.109