Chapter 8

Classifying Recurrent Dynamics on Emotional Speech Signals

Sudhangshu Sarkar; Anilesh Dey     Department of Electrical Engineering, Narula Institute of Technology, Kolkata, India
Department of Electronics and Communication Engineering, Narula Institute of Technology, Kolkata, India

Abstract

The current study was an attempt to understand the dynamics of emotional speech signals. The nonlinear dynamics of two speech signals having two different emotions were investigated using recurrence analysis. First, two types of speech signals were collected when the subject spoke eight sentences in the Bengali language in angry and normal emotion. The recurrence plot and the phase space plot were constructed for each signal and recurrence-based features, namely %isometry and %consecutive isometry were extracted. A marked distinction was visually observed between the normal and the angry emotion-based signals from the recurrence and the phase space plots. The variation was further confirmed from the values of the recurrence parameters. The normal emotion exhibited higher %isometry and %consecutive isometry as compared to the angry emotion. The robustness of the recurrence-based method for classifying the emotional states was established from the consistency of the results provided by the recurrence parameters in both noise-free and noisy environments.

Keywords

Nonlinear dynamics; Recurrence plot; Isometry; Consecutive isometry; Speech signal

8.1 Introduction

Speech signal processing is a vast field of study, which contributes highly to human computer interaction (HCI) studies. Speech is simply the best carrier of information in human communication systems. It contains a lot of information apart from the verbal message, that is, speaker identification, speaker’s age, sex, locality, emotion, etc. Recognition of emotion contained in a speech signal is one of the fastest growing areas of interest in HCI study. Emotion in a speech plays an important role in expressing feelings. Based on different emotions, human speak in different ways, and the characteristics of speech changes, accordingly. Humans don’t need practice to recognize the emotional state of a speaker; it comes naturally [1]. However, it is a complex process when it’s implemented in a machine.

In this regard, a literary review of the past studies on speech-based emotion recognition systems was made. Researchers have proposed many techniques for emotion-based speech recognition [2, 3]. Schuller et al. [4] reported the continuous use of hidden Markov models (HMM) for speech-emotion recognition. The same group further extended their work in 2004 by combining acoustic features with linguistic data for a healthy emotion detection using support vector machine (SVM) [5]. Lin et al. [6] used the aforementioned two methods, namely HMM and SVM for classification of five dissimilar states of emotion, that is, annoyance, pleasure, sorrow, shock, and impartial emotion. Lalitha et al. [7] reported the use of time-domain speech features like pitch and prosody for recognition of seven different emotional states. Kamal et al. [8] predicted protein structures from images using HMM. Chapman Kolmogrov. Dey et al. [9,10] analyzed the progressiveness of acoustic waves in biomedical technology.

Identifying suitable features that characterize different emotions is an important process for developing a speech emotion recognition (SER) system [11]. Altrov et al. [12] identified the power of verbal communication and civilization on the accepting of language emotions. An effective discussion mainly depends on how we communicate our own emotions, how we recognize those of others, and how sufficient our reaction is to their emotions. Cowie et al. [13,14] illustrated that emotions have an essential role in our lives since they are typically present in everyday communication. Park and Sim [15] showed emotion detection by DRNN. Their paper found that pitch was a significant component in the identification of emotion. Therefore the basic accurate detection acoustical features [16,17] were analyzed for speech sound with emotion. The study of event-related potentials (ERPs) is recognized as a useful technique for exploring intelligent mechanisms of processing emotional speech [18]. Tao et al. [19] attempts to create exciting dialogue via “strong,” “average,” and “weak” classifications using various models like a linear modification model (LMM), a Gaussian mixture model (GMM), a classification and regression tree model (CART). Kang and Li [19] analyzed neutral-emotional speech by using prosody conversion. In 2010, Wu et al. [20] presented an advance to hierarchical prosody translation for an exciting language mixture. Jia et al. [21] adopted “Emotional Audio Visual Speech Synthesis Based on PAD,” while Dey and Ashour [22, 23] discussed arrival estimation of localized multispeech sources. An emotional text-to-speech system [24] is required for emotion-based speech recognition. Neural networks [2527] have exhibited remarkable success to link the responsive space in communication signals. The direction of speech resources on a localized level has been eminently described by Dey and Ashour [2830]. least squares regression [31] is one of the noted methods for speech emotion recognition. An ideal scientific SER system would be one that can develop real life and loud talking to recognize different state of emotions. In this paper, we have attempted to classify the recurrent dynamics of two different emotions, namely anger and normal, with the help of recurrence plot, phase space plot, and recurrence based parameters. The investigation was performed in both noise-free and on noisy environment to establish the suitability of the proposed method.

8.2 Data Collection and Processing

A healthy male volunteer (age 23 years old) was asked to participate in the study. He was informed about the details of the study, and a written consent to participate was obtained. Two types of speech signals were acquired from the volunteer using microphone Behringer C-1U, when he uttered eight different sentences (in the Bengali language) in angry and normal emotion. In order to process the speech signals, the sampling frequency was taken as 16 KHz in using Audacity version 1.3.6. in the Electronics and Communication Engineering Department, Narula Institute of Technology, Agarpara, Kolkata.

8.3 Research Methodology

Phase space approach was used to investigate the nonlinear properties of the speech signals. A phase space was reconstructed for each speech signal with appropriate time delay and proper embedding measurement.

8.3.1 Phase Space Reconstruction

The condition of a dynamical method is able to be illustrated in a space called phase space. A phase space is a multidimensional space, in which every point correlates with one state of the dynamical system [32]. The path traced by the phase space diagram of a system over time describes its evolution from an initial state. This is known as phase space trajectory.

The basic problem is that the information about all the variables governing the system is usually not obtained from its time series. Most of the time the series is single valued. Although numerous concurrent measurements can be performed; they may not coat each degree of freedom of the arrangement. Though the use of the time-delay embedding theorem [33] allows the recovery of lost information, and it becomes possible to construct the phase space diagram of a scheme from its period sequence. This method of phase space reconstruction requires the determination of the principles of suitable time delay τ and proper embedding measurement m.

Determining the most favorable value of time delay τ for the phase space reformation, for any time series {x(t)}t = 1N at a given state x(t), τ is one of the proper values of time delay, which divulges utmost novel information through dimension at x(t + τ). The auto mutual information (AMI) technique [34] is usually adapted the proper value of τ. The AMI of a time series for a given τ is calculated using Eq. (8.1) [35]. The optimal value of τ is that one for which AMI(τ) reaches its first minimum [34].

AMIτ=t=1NτPxtxt+τlogPxtxt+τPxtPxt+τ

si1_e  (8.1)

where τ = [1, 2, …, N − 1] and P[ ] denotes the probability.

Embedding dimension is a measure of the least element of the phase space of the reconstructed characteristic of a dynamical system [36, 37]. Kennel et al. [38] have anticipated the method of false nearest neighbor (FNN) to determine the minimum satisfactory embedding dimension m. The FNN algorithm can be described as follows.

For every point Risi2_e in the time series, its adjacent neighbor Rjsi3_e is searched in an m- dimensional space. The space RiRjsi4_e is calculated. Both the points are iterated and Risi2_e is computed as given in Eq. (8.2).

Ri=Ri+1Rj+1RiRj

si6_e  (8.2)

If the computed value of Risi2_e goes beyond a specified heuristic verge Rtsi8_e, this point is regarded as having a fake nearby neighbor. The minimal embedding length is obtained when the percentage of FNN at a given dimension reaches zero.

8.3.2 Recurrence Plot Analysis

Recurrence plot (RP) is one of the efficient graphical methods designed to find the hidden nonlinear structure of the phase spaces, introduced by Eckmann et al. [39]. For any two points xi, xj in a phase space, the distance among xi and xj can be calculated by ‖xi − xj‖. Then, the recurrence between two points xi and xj is given by.

Rij=Θɛxixj

si9_e  (8.3)

where Θ represents Heaviside function.

From the definition, it follows that the entries in the matrix (Rij)N × N (N being the span of the trajectory of the phase space) are either 1 or 0. The number “1” is represented by a black dot. On the other hand, “0” is represented by a white dot. So, an RP is a visual representation of a phase space by two colors. From the structure of the RP, various dynamical patterns of a complex dynamic can be described, such as periodicity, quasiperiodicity, noise effect, nonstationary behavior, and a chaotic nature. It indicates that classification between two different dynamics can be made by RP analysis. Fig. 8.1(A) and (B) show the recurrence plot of a speech signal in angry emotion and normal emotion, respectively.

Fig. 8.1
Fig. 8.1
Fig. 8.1 (A) Recurrence plot of angry emotional speech signal with proper time delay and embedding dimension. (B) Recurrence plot of normal emotional speech signal with proper time delay and embedding dimension.

If two points xi and xj are recurrent, we say that there is an isometry. Two points xi and xj in a phase space are said to be in consecutive isometry if.

Rij=Θɛxi+Lxj+L

si10_e  (8.4)

whereL ∈ Z+.

Since periodicity and aperiodicity of a phase space are the reflection of isometry, so the nature of the dynamics can be described by it. In fact, complexity of the RP decreases as the consecutive isometry increases [40, 41].

8.4 Numerical Experiments and Results

In this work, speech signals in angry and normal emotion were analyzed using RP analysis for two cases, that is, noise free and noisy conditions. Two informative parameters, namely isometry and consecutive isometry were calculated for each case. For both cases, proper time delay and embedding dimension were recalculated. Fig. 8.2(A) shows the time delay for a speech signal in both the angry and normal emotion, which are represented by red and blue lines respectively. In order to calculate the probability, 17 bins were considered.

Fig. 8.2
Fig. 8.2
Fig. 8.2
Fig. 8.2 (A) Fluctuation of AMI with time delay τ ∈ [1, 60]. (B) Fluctuation of FNN for a speech signal in the angry emotion with embedding. (C) A phase space diagram for speech signals in two different emotional conditions.

From the figure, it is observed that the AMI was minimum at τ = 60 in the case of normal speech, whereas it was minimum at τ = 43 for the angry speech signal. It suggested that the optimal time delay was different for the same speech in two different emotions. In fact, it was higher in the case of normal speech than that of the speech in angry emotion.

The percentage of FNN was calculated for embedding dimensions ranging from 1 to 10 with variable time delay τ varies from 1 to 60. Fig. 8.2(B) shows the fluctuation of FNN with embedding dimension for a speech signal in the angry emotion. The dimension, for which the percentage of FNN was minimum, was taken as the suitable embedding dimension.

Dimension ranging from 1 to 10 with τ = 43. The phase space diagram was reconstructed for the speech signal for angry and normal emotion with proper time delay [42, 43] for each case as given in Fig. 8.2(C).

From Fig. 8.2(C), it can be seen that the number of outliers increased in the angry emotion compared to the normal emotion. It also can be seen that the normal emotion speech signal exhibited denser orbit than the angry emotion speech signal. These results suggest that different dynamics benefit different emotional states.

8.4.1 Noise-Free Environment

As stated above, isometry is the measure of recurrence in any phase space. It is calculated as

Isometry=1N2i,j=1NRi,jɛ

si11_e  (8.5)

where N is the number of speech samples and Ri, j(ɛ) is the repetition matrix equivalent to a threshold of ɛ. Isometry is represented as the quantity of isometric recurrences articulated as a fraction of the entire quantity of pairs of vectors contrast in the sample (i.e., %isometry) [44]. The design of %isometry or any of its copied frameworks as a function of embedding dimension is considered as an embedding plot. The embedding plot of %isometry, represented in Fig. 8.3, shows the comparison between the %isometry of two speech signals in angry and normal emotion in a noise-free environment.

Fig. 8.3
Fig. 8.3 Embedding plot of %isometry with proper time delay for angry and normal emotional speech in a noise-free environment.

Fig. 8.3 clearly shows that the %isometry of the normal emotional speech signal was higher than that of the angry speech signal. It suggests that the autocorrelation was higher in the case of the normal speech signal. The same observation is also made for %consecutive isometry, represented in Fig. 8.4. (See Fig. 8.5.)

Fig. 8.4
Fig. 8.4 Embedding plot of %consecutive isometry with proper time delay for angry and normal emotional speech in a noise-free environment.
Fig. 8.5
Fig. 8.5 Power spectral density estimation (spectrum) of one speech signal in an angry emotion in a noise-free and noisy environment.

8.4.2 Noisy Environment

To investigate the effect of noise on the isometry, a Gaussian noise was added, given by φξ=eξ22si12_e, where φ(ξ) is a Gaussian random variable.

When the energy of the signal is strong in the region of a restricted time interim (especially if its total energy is limited), one may calculate the energy spectral compactness. However, more commonly used is the power spectrum. In the experiment, it was noticed that the power frequency of the signal became higher in the power spectrum after the addition of noise.

Fig. 8.6(A) and (B) shows the recurrence plot of a speech signal in angry emotion and usual emotion, respectively. After the addition of noise.

Fig. 8.6
Fig. 8.6
Fig. 8.6 (A) Recurrence plot of an angry emotional speech signal with s proper time delay and embedding dimension after the addition of noise. (B) Recurrence plot of a normal emotional speech signal with a proper time delay and embedding dimension after the addition of noise.

In a noisy condition, it was observed that the recurrence plots were more complex with lots of dark dots. It suggested that the addition of noise in the signal increased the recurrence rate.

Both %isometry and %consecutive isometry were again calculated for the speech signals in angry and normal emotions after adding the noise. Figs. 8.7 and 8.8 show the embedding plots of %isometry and %consecutive isometry, for speech signals in angry and normal emotion, respectively, after adding noise. It is evident from the figures that the speech signal in normal emotion exhibited more %isometry and %consecutive isometry than the speech signal in an angry emotion, similar to that of a noise-free condition. These results indicated the robustness of the two parameters even in the noisy condition.

Fig. 8.7
Fig. 8.7 Embedding plot of %isometry with a proper time delay for angry and normal emotional speech.
Fig. 8.8
Fig. 8.8 Embedding plot of %consecutive isometry with a proper time delay for angry and normal emotional speech.

8.5 Conclusion

Speech signals were acquired from a healthy male volunteer at a sampling rate of 16 KHz, when the volunteer uttered eight sentences in the Bengali language in normal and angry emotion. Phase space analysis and recurrence analysis were performed on the speech signals to classify normal and angry emotion conditions. During the editing of the phase space diagram, the proper values of time delay and the embedding dimension were approximated using auto mutual information (AMI) and false nearest neighbor (FNN) mechanism, respectively. The proper time delay was found to be more for normal emotion than angry emotion. The phase space plot for normal emotion had a denser orbit and lower number of outliers than that of the angry emotion. The recurrence plots were drawn and correlated parameters, specifically %isometry and %consecutive isometry, were calculated for the speech signals acquired during normal and angry emotion in noise-free and noisy environments. In both noise-free and noisy environments, %isometry and %consecutive isometry values were greater for normal emotion than angry emotion. This suggested that these parameters are robust to environmental noise and can be used to develop a SER system for classifying different emotional states.

References

[1] Gogoi N.J., Kalita J. Emotion recognition from acted Assamese speech. Int. J. Innov. Res. Sci. Eng. Technol. 2015;4(6).

[2] Ververidis D., Kotropoulos C. Emotional speech recognition: resources, features, and methods. Speech Comm. 2006;48:1162–1181.

[3] El Ayadi M., et al. Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 2011;44:572–587.

[4] Schuller B., Rigoll G., Lang M. Hidden Markov model-based speech emotion recognition. In: 2003 International Conference on Multimedia and Expo. ICME'03. Proceedings (Cat. No.03TH8698), Baltimore, MD, USA; 2003:1–401.

[5] Schuller B., Rigoll G., Lang M. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Que; 2004:1–577.

[6] Lin Y.-L., Wei G. Speech emotion recognition based on HMM and SVM. In: 2005 International Conference on Machine Learning and Cybernetics, Guangzhou, China; 4898–4901. 2005;vol. 8.

[7] Lalitha S., et al. Speech emotion recognition. In: 2014 International Conference on Advances in Electronics, Computers and Communications (ICAECC); 2014:1–4.

[8] Kamal M.S., Chowdhury L., Khan M.I., Ashour A.S., Tavares J.M.R., Dey N. Hidden Markov model and Chapman Kolmogrov for protein structures prediction from images. Comput. Biol. Chem. 2017;68:231–244.

[9] Dey N., Ashour A.S., Mohamed W.S., Nguyen N.G. Acoustic sensors in biomedical applications. In: Acoustic Sensors for Biomedical Applications. Cham: Springer; 2019:43–47.

[10] Dey N., Ashour A.S., Mohamed W.S., Nguyen N.G. Acoustic wave technology. In: Acoustic Sensors for Biomedical Applications. Cham: Springer; 2019:21–31.

[11] Koolagudi S.G., et al. IITKGP-SESC: speech database for emotion analysis. In: International Conference on Contemporary Computing; 2009:485–492.

[12] Altrov R., Pajupuu H. The influence of language and culture on the understanding of vocal emotions. J. Estonian Finno-Ugric Linguistics. 2015;6(3).

[13] Cowie R., Sussman N., Ben-Ze’ev A. Emotions: concepts and definitions. In: Petta P., Pelachaud C., Cowie R., eds. Emotionoriented Systems: The HUMAINE Handbook. Berlin, Heidelberg: Springer; 2011:9–31.

[14] Douglas-Cowie E., Campbell N., Cowie R., Roach P. Emotional speech: towards a new generation of databases. Speech Comm. 2003;40(1):33–60.

[15] Park C.-H., Sim K.-B. Emotion recognition and acoustic analysis from speech signal. In: Proceedings of the International Joint Conference on Neural Networks, 2003, Portland, OR; 2594–2598. 2003;vol. 4.

[16] Ntalampiras S., Fakotakis N. Modeling the temporal evolution of acoustic parameters for speech emotion recognition. IEEE Trans. Affect. Comput. 2012;3(1):116–125.

[17] Zão L., Cavalcante D., Coelho R. Time-frequency feature and AMS-GMM mask for acoustic emotion classification. IEEE Signal Process. Lett. 2014;21(5):620–624.

[18] Chang J., Zhang X., Zhang Q., Sun Y. Investigating duration effects of emotional speech stimuli in a tonal language by using event-related potentials. IEEE Access. 2018;6.

[19] Kang Y., Li A. Prosody conversion from neutral speech to emotional speech. IEEE Trans Audio Speech Lang. Process. 2006;14(4):1145–1154.

[20] Wu C., Hsia C., Lee C., Lin M. Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis. IEEE Trans. Audio Speech Lang. Process. 2010;18(6):1394–1405.

[21] Jia J., Zhang S., Meng F., Wang Y., Cai L. Emotional audio-visual speech synthesis based on PAD. IEEE Trans. Audio Speech Lang. Process. 2011;19(3):570–582.

[22] Dey N., Ashour A.S. Direction of Arrival Estimation and Localization of Multi-Speech Sources. Springer International Publishing; 2018.

[23] Dey N., Ashour A.S. Applied examples and applications of localization and tracking problem of multiple speech sources. In: Direction of Arrival Estimation and Localization of Multi-Speech Sources. Cham: Springer; 2018:35–48.

[24] Navas E., Hernaez I., Luengo I. An objective and subjective study of the role of semantics and prosodic features in building corpora for emotional TTS. IEEE Trans. Audio Speech Lang. Process. 2006;14(4):1117–1127.

[25] Zhang S., Zhang S., Huang T., Gao W. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimedia. 2018;20(6):1576–1590.

[26] Weninger F., Eyben F., Schuller B. On-line continuous-time music mood regression with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, 2014; 2014:5412–5416.

[27] Avila A.R., Monteiro J., O'Shaughneussy D., Falk T.H. Speech emotion recognition on mobile devices based on modulation spectral feature pooling and deep neural networks. In: IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Bilbao; 360–365. 2017;2017.

[28] Dey N., Ashour A.S. Challenges and future perspectives in speech-sources direction of arrival estimation and localization. In: Direction of Arrival Estimation and Localization of Multi-Speech Sources. Cham: Springer; 2018:49–52.

[29] Dey N., Ashour A.S. Sources localization and DOAE techniques of moving multiple sources. In: Direction of Arrival Estimation and Localization of Multi-Speech Sources. Cham: Springer; 2018:23–34.

[30] Dey N., Ashour A.S. Microphone array principles. In: Direction of Arrival Estimation and Localization of Multi-Speech Sources. Cham: Springer; 2018:5–22.

[31] Zong Y., Zheng W., Zhang T., Huang X. Cross-corpus speech emotion recognition based on domain-adaptive least-squares regression. IEEE Signal Process. Lett. 2016;23(5):585–589.

[32] Sivakumar B., et al. River flow forecasting: use of phase-space reconstruction and artificial neural networks approaches. J. Hydrol. 2002;265:225–245.

[33] Chandrasekaran S.P. A nonlinear dynamic modelling for speech recognition using recurrence plot—a dynamic Bayesian approach. In: IEEE International Conference on Signal Processing and Communications, ICSPC 2007; 2007:516–519.

[34] Fraser A.M., Swinney H.L. Independent coordinates for strange attractors from mutual information. Phys. Rev. A. 1986;33:1134.

[35] Dey A., et al. A new kind of dynamical pattern towards distinction of pre-meditative and meditative states through HRV. Science. 2012;3.

[36] Kennel M.B., Abarbanel H.D. False neighbors and false strands: a reliable minimum embedding dimension algorithm. Phys. Rev. E. 2002;66:026209.

[37] Dey A., Bhattacha D.K., Tibarewala D.N., Dey N., Ashour A.S., Le D.-N., Gospodinova E., Gospodinov M. Chinese-chi and kundalini yoga meditations effects on the autonomic nervous system: comparative study. Int. J. Interact. Multimedia Artif. Intell. 2016;3(7):87–95 9 p.

[38] Kennel M.B., et al. Determining embedding dimension for phase-space reconstruction using a geometrical construction. Phys. Rev. A. 1992;45:3403.

[39] Eckmann J.P., et al. Recurrence plots of dynamical systems. EPL (Europhys. Lett.). 1987;4:973.

[40] Marwan N., et al. Recurrence plots for the analysis of complex systems. Phys. Rep. 2007;438:237–329.

[41] Nayak S.K., Pande K., Patnaik P.K., Nayak S., Patel S.J., Anis A., Dey A., Pal K. Understanding the effect of cannabis abuse on the ANS and cardiac physiology of the Indian women paddy-field workers using RR interval and ECG signal analyses. In: IEEE International Conference APSIPA ASC 2017, Aloft Kuala Lumpur Sentral Sentral, Kuala Lumpur; 2017:doi:10.1109/APSIPA.2017.8282047.

[42] Dey A., Palit S.K., Mukherjee S., Bhattacharya D.K., Tibarewala D.N. A new technique for the classification of pre-meditative and meditative states. In: IEEE International Conference, “ICCIA-2011”; 2011 Print ISBN 978-1-4577-1915-8.

[43] Das M., Jana T., Dutta P., Banerjee R., Dey A., Bhattacharya D.K., Kanjilal M.R. Study the effect of music on HRV signal using 3D poincare plot in spherical co-ordinates—a signal processing approach. In: IEEE International Conference on Communication and Signal Processing, April 2–4, 2015, India; 2015 ISBN 978-1-4799-8080-2.

[44] Sabelli H., Lawandow A. Homeobios: the pattern of heartbeats in newborns, adults, and elderly patients. In: Nonlinear Dynamics, Psychology, and Life Sciences. 381. 2010;vol. 14.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.255.127