Conclusion to Part 2

Benjamin ALLAERT1, Ioan Marius BILASCO2 and Chaabane DJERABA2

1IMT Nord Europe, Lille, France

2University of Lille, France

C2.1. Summary

The facial expression analysis systems proposed in the literature achieve very good performance in conditions where the environment (uniform background scene, homogeneous light) and facial expressions (fixed pose, face facing the camera, high-intensity expression) are controlled. However, these data do not reflect the conditions encountered in a natural interaction situation (surveillance camera, videoconference) in which a person is free to move.

In this context, the environment (light changes, indoor/outdoor) as well as the interaction (pose variations, large displacements in the scene, face occlusion, variable expression intensity) are poorly constrained. All these constraints significantly decrease the performance of facial expression analysis systems and highlight several unresolved scientific issues.

Part 2 of this book has presented several contributions to address different problems related to the analysis of facial expressions in an image sequence, for an acquisition system where the interaction is natural. In particular, we have presented innovative solutions to the problems of facial motion intensity variations and pose variations.

C2.1.1. Variation of the intensity of the movement

In natural interaction situations, the intensities of facial movements tend to vary from one person to another because they do not display their expression in the same way (low or high expression). Moreover, as Ekman suggests, facial expressions are composed of a set of movements of low (micro-expression) and high (macro-expression) intensities. The combination of these two movements allows for a better nuance of the expressions.

The difficulty related to variations in motion intensity lies in the fact that the motion characteristics are very different between low- and high-intensity expressions. In most cases, this requires preprocessing techniques adapted to the motion being analyzed. As a general rule, preprocessing applied to faces allows us to better characterize a specific movement at the expense of another movement. This implies that there is no single solution that takes into account both small and large variations in motion.

For this purpose, we have presented an innovative descriptor called LMP (Local Motion Patterns) based on the deformable physical characteristics of the face, in order to keep only the main directions of the facial motion. The motion is computed in different regions of the face. These regions are defined in relation to the FACS system and allow us to directly analyze the coherent movements induced by the facial muscles. Based on the different works on macro- and micro-expression analysis, this descriptor uses a dense motion approach (Farneback algorithm) to extract the facial motion. Then, it exploits the motion coherence assumptions and improves the distinction between motion-related information and noise present in the data.

We presented the effectiveness of the LMP motion-based descriptor on several learning bases for both macro- and micro-expression characterization. The results obtained on the CASME2 (70.20%) and SMIC-VIS (86.11%) training databases show that the proposed descriptor performs better than recent state-of-the-art methods for micro-expression analysis. Without artificially increasing the initial data of the learning bases, as deep learning methods do, the descriptor obtains competitive results for macro-expression recognition (97.25% on CK+, 84.58% on Oulu-CASIA and 78.26% on MMI). These different bases show the robustness of the descriptor in the presence of intensity variations and small pose variations in data acquired by visible and near-infrared sensors.

By considering only a fraction of the movement related to the expressions, the method manages to recognize facial expressions. The performances obtained by considering 100%, 75%, 50% and 25% of the expression activation sequences correspond respectively to the following rates: 96.94%, 96.69%, 95.11% and 87.16%. These results highlight the fact that the method has the specificity of being able to recognize a facial expression under different levels of intensity, which tends to recognize expressions more quickly, or even anticipate them.

As a conclusion to this problem, the motion-based LMP descriptor proposed an innovative solution based on movements, allowing us to simultaneously characterize micro- and macro-expressions by using, for both intensity categories, the same analysis system (descriptor, facial model, configuration).

C2.1.2. Head pose variation

Current methods generally require the face to be fixed and frontal to the camera to analyze expressions. This ensures that the face is in optimum condition to be analyzed: no pose variation (occlusion) and no movement (deterioration of information). This explains why current systems do not perform well in a natural interaction situation in which a person is free to move.

From the literature, we have seen that methods based on motion information are more efficient in a controlled context. However, this is not the case in natural interaction situations. Indeed, in these conditions, the presence of pose variations and large displacements induce motion discontinuities which reinforce the difficulty of the analysis. The noise induced by the head movement has a negative impact on the extracted movement within the face. It is thus difficult to extract only the movement induced by facial expressions.

Normalization solutions are used to correct the geometric deformation of a face in the scene in order to bring it into an ideal analysis frame (frontal to the camera). Although these solutions make it possible to reduce the distance between two faces extracted from successive images, normalization induces harmful deformations to the geometry of the face. These result in the appearance of incoherent movements on the face.

Currently, it is difficult to accurately quantify the robustness of normalization methods for facial expression analysis. Until now, these methods have been used directly in complete facial expression analysis systems. The results obtained from the output of these systems inform us informally about the contribution of normalization on the performance of the system. However, this does not allow us to accurately distinguish the weaknesses of the normalization methods; thus, it is difficult to effectively improve their robustness in the presence of facial expressions.

In Part 2 of this book, we have presented an innovative acquisition system called SNaP-2DFe (Simultaneous Natural and Posed 2D Facial expression) to improve the robustness of normalization methods for facial expression analysis. This system allows us to simultaneously capture a face in a fixed plane and a moving plane (following the head movement). Thanks to this, these databases provide an understanding of the face to be reconstructed despite the occlusions induced by the 3D (out-of-plane) rotations of the head.

We show that the normalization methods used in current systems are not perfectly adapted for facial expression analysis. Each of the analyzed methods (eyes, 2D shape, 3DMM) have their advantages and disadvantages depending on the type of head movement. Although these methods do not provide performances similar to the results obtained under controlled conditions, the methods based on 3D models seem to be the most robust.

The proposed system raises awareness among researchers in the field of pose variation problems for facial expression analysis. We mainly focus on the fact of judiciously choosing the most suitable method according to the observed head movements. Finally, we hope that this kind of system will allow us, in the near future, to improve the robustness of normalization methods in the presence of facial expressions.

C2.2. Perspectives

As with most of the approaches in the literature, faces are mainly analyzed in a controlled context in order to quantify the robustness of the descriptor to characterize facial expressions. Currently, the majority of descriptors do not allow us to analyze, in a reliable way, facial expressions in the presence of head movement. This is why, in the near future, research work will tend to integrate a normalization method upstream of the extraction of the movement related to facial expressions in order to abstract from head movements. However, we have observed that current normalization methods do not fully solve this problem, as they induce artifacts that add false movements. In order to solve this problem, the exploration of new solutions to normalize a face based on dense optical flow information seems to be promising, for example, a solution that is able to separate the source of motion related to head movements and the source of motion related to facial expressions. Unlike other normalization methods, we believe that normalizing and correcting motion in the optical flow domain will have a less negative impact than directly reconstructing the face geometry in the image. Recent work by Yang et al. (2017) tends to confirm this hypothesis.

Several perspectives are also possible to improve the robustness of a motion-based like descriptor. Research adding new additional spatio-temporal constraints to filter out motion discontinuities may be of interest. For example, it would be interesting to add a constraint to check that the motion remains locally coherent between several successive images as LBP-TOP does. With the rise of neural approaches, new architectures (3D, recurrent or graph-based) or new cost functions can be studied. We can imagine designing a temporal architecture where the evolution of synaptic weights between neurons would be directly related to the different deformable constraints of the face (skin elasticity). We can also add several modalities to the architecture in order to take into account the geometry and texture of the face to improve the final decision-making at the output of the network.

Finally, the application of motion descriptors can be envisaged in different contexts of use, other than face analysis. In particular, we believe that the work on fine-grained motion analysis, highlighted in Part 2 of this book, can be applied to action analysis (behavior, sport) or gesture analysis (human–machine interaction), which is essentially based on motion, where a strong constraint (muscular, skeletal) exists. This type of descriptor can also be used to improve decision-making by healthcare staff, such as extracting medical parameters from perceived facial information, such as heart rate or oxygen saturation (Rahman et al. 2019).

C2.3. References

  1. Rahman, H., Ahmed, M.U., Begum, S. (2019). Non-contact physiological parameters extraction using facial video considering illumination, motion, movement and vibration. IEEE Transactions on Biomedical Engineering, 67(1), 88–98.
  2. Yang, S., An, L., Lei, Y., Li, M., Thakoor, N., Bhanu, B., Liu, Y. (2017). A dense flow-based framework for real-time object registration under compound motion. Pattern Recognition, 63, 279–290.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.151.158