Tanmoy Roy1, Tshilidzi Marwala1, and Snehashish Chakraverty2
1Electrical and Electronic Engineering Science, University of Johannesburg, Johannesburg, 2006, South Africa
2Department of Mathematics, National Institute of Technology, Rourkela, Odisha, 769008, India
Speech recognition models are presently in a very advanced state, and examples such as Siri, Cortana, and Alexa are the technical marvels that not can only hear us comfortably, but those bots can reply to us with equal comfort as well. However, speech recognition systems cannot detect the underlying emotion in speech signals. Speech emotion recognition (SER) is the field of study that explores the avenues of detecting emotions concealed in speech signals. The initial study on emotion started as a study of psychology and acoustics of emotions. The first detailed study on emotions was reported way back in 1872 by Charles Darwin [1]. Fairbanks and Pronovost [2] were among the first who studied pitch of voice during simulated emotion. Since the late 1950s, there has been a significant increase in interest by researchers regarding the psychological and acoustic aspects of emotion [3–6]. However, in the year 1995, Picard [7] introduced the term “affective computing,” and after that, the study of emotional states has become an integral part of artificial intelligence (AI) research.
SER is a machine learning (ML) problem where the speech utterances are classified depending on their underlying emotions. This chapter gives an overview of the prominent classification techniques used in SER. Researchers used different types of classifiers for SER, but in most of the situations, a proper justification is not provided for choosing a particular classification model [8]. Two plausible explanations are that classifiers that are successful in automatic speech recognition (ASR) are assumed to be working well in SER (like hidden Markov model [HMM]), and secondly, those classifiers that perform well in most classification problems are chosen [9]. There are two broad categories of classifiers (Figure 3.1) used in SER: the linear classifiers and the non‐linear classifiers. Although many classification models have been used for SER, but few among them become effective and popular among the researchers. Four classification models, namely HMM, Gaussian mixture model (GMM), support vector machine (SVM), and deep learning (DL), have been identified as prominent for this work and they are discussed in Sections 3.4.1–3.4.4.
Researchers are trying to solve SER as an ML problem, and ML approaches are data‐driven. Moreover, SER research field is not mature enough to identify the underlying emotion of a random spoken conversion. Thus, the SER research depends heavily on emotional speech databases [10,11]. The speech databases created by researchers and organizations to support the SER research. The database naturalness, quality of recordings, number and type of emotions considered, and speech collection strategy are critical inputs for the classification stage because those features of the database will decide the classification methodology [12–15]. The design of the speech database can have different factors [13,16]. First of all, the existing databases can be categorized into three categories: (i) simulated by actors, (ii) seminatural, and (iii) natural. The simulated databases, created by enacting emotions by actors, are usually well annotated, adequately labeled, and are of better quality because the recordings are performed in a controlled near noise‐free environment. The number of recordings is also usually high for simulated databases. However, acted emotions are not natural enough, and sometimes, an expression of the same emotion varies a lot depending on the actor, which makes the feature selection process very difficult. Seminatural databases are also the collection of the enactions by professional or nonprofessional actors, but here, the actors are trying to keep it as natural as possible. Natural emotional databases are difficult to label because manually labeling a big set of speech recordings is a daunting task, and there is no method available yet to label the emotions automatically. As a result, the number of emotions covered in a natural dataset is low, and the number of data points is also low. Natural recordings usually depict continuous emotions, which can create hurdles during the classification phase because of the presence of overlapping emotions.
Feature selection and feature extraction are vital steps toward building a successful classification model. In SER, various types of features are extracted from the speech signals. Speech signals carry an enormous amount of information apart from the intended message. Researchers agreed that speech signals also carry vital information regarding the emotional state of the speaker [17]. However, researchers are still undecided over the right set of features of the speech signals, which can represent the underlying emotional state. This section contains the details of feature sets that are heavily used so far in SER research and performed well in the classification stage. There are three prominent categories in speech features used in SER: (i) the prosodic features, (ii) the spectral or vocal tract features, and (iii) the excitation source features.
Prosody features are the characteristics of the speech sound generated by the human speech production system, for example, pitch or fundamental frequency () and energy. Researchers used different derivatives of pitch and energy as various prosody features [18–20]. These are also called continuous features and can be grouped into the following categories [8,16,21,22]: (i) pitch‐related features, (ii) formant features, (iii) energy‐related features, (iv) timing features, and (v) articulation features. Several studies tried to establish the relationship between prosody speech features and the underlying patterns of different emotions [21–28].
The features used to represent glottal activity, mainly the vibration of glottal folds, are known as the source or excitation source features. These are also called voice quality features because glottal folds determine the characteristics of the voice. Some researchers believed that the emotional content of an utterance is strongly related to voice quality [21,29,30]. Voice quality measures for a speech signal includes harshness, breathiness, and tenseness. The relation of voice quality features with different emotions is not a well‐explored area, and researchers have produced contradictory conclusions. For example, Scherer [29] associated anger with tense voice, whereas Murray and Arnott [22] associated anger with a breathy voice.
Spectral features are the characteristics of various sound components generated from different cavities of the vocal tract system. They are also called segmental or system features. Spectral features extracted in the form of
There is a particular type of spectral feature called the cepstral features, which are extensively used by SER researchers. Cepstral features can be derived from the corresponding linear features such as linear predictor cepstral coefficients (LPCC) is derived from linear predictor (LP). Mel‐frequency cepstral coefficients are one such cepstral feature that along with its various derivatives is widely used in SER research [35–39].
SER deals with speech signals. The analog (continuous) speech signal is sampled at a specified time interval to get the discrete time speech signal. A discrete time signal can be represented as follows:
where is the total number of sample points in the speech signal. First, only the speech utterance section is extracted from the speech sound by using a speech endpoint detection algorithm. In this case, an algorithm proposed by Roy et al. [40] is used.
This speech signal contains various information that can be retrieved for further processing. Emotional states guide human thoughts, and those thoughts are expressed in different forms [41], such as speech. The primary objective of an SER is to find the patterns in speech signals, which can describe the underlying emotions. The pattern recognition task is carried out by different ML algorithms. Features are extracted from the speech signal in two forms
Let, there be number of sample points after the feature extraction process. If local features are computed from by assuming 10 splits, then there will be 10 data points from . Now, suppose there is a total of 100 recorded utterances, and each utterance is split into 10 frames, then will be total data points available. When global features are computed, then each utterance will produce one data point. The selection of local or global feature depends on the feature extraction strategy. Now, suppose, is the number of data points such that , where is the index. If number of features are extracted from , then each data point is a dimensional feature vector. Each utterance in the speech database is labeled properly, so that it can be used for supervised classification. Therefore, the dataset is denoted as
Table 3.1 List of literatures on SER grouped by classification models used.
No. | Classifiers | References |
1. | Hidden Markov model | [42–51] |
2. | Gaussian mixture model | [12,15,39,52–59] |
3. | ‐Nearest neighbor | [5760–64] |
4. | Support vector machine | [48,49,55,65–73] |
5. | Artificial neural network | [43,55,57,74–78] |
6. | Bayes classifier | [43,53,60,79] |
7. | Linear discriminant analysis | [16,64,80–82] |
8. | Deep neural network | [35–37,83–94] |
where is the label corresponding to a data point and . Once the data is available, the next step is to find a predictive function called predictor. More specifically, the task of finding a function is called learning so that . Different classification models take different approaches to learning. For SER, the prediction task is usually considered as a multiclass classification problem.
Table 3.1 shows a list of classifiers commonly used in SER along with literature references. Although in Table 3.1, there are eight classifiers listed, but not all of them become prominent for SER tasks. In Sections 3.4.1–3.4.4, four most prominent classifiers (HMM, GMM, SVM, and deep neural network (DNN)) for SER are discussed to depict the SER‐specific implementation technique.
HMMs are suitable for the sequence classification problems that consist of a process that unfolds in time. That is why, HMM is very successful in ASR systems, where the sequence of the spoken utterance is a time‐dependent process. The HMM parameters are tuned in the model training phase to best explain the training data for the known category. The model classifies an unseen pattern based on the highest posterior probability.
HMM comprises two processes. The first processes consist of a first‐order Markov chain whose states capture the temporal structure of the data, but these states are not observable that is hidden. The transition model, which is a stochastic model, drives the state transition process. Each hidden state has an observation associated with it. The observation model, which is again a stochastic model, decides that in a given hidden state, the probability of the occurrence of different observations [95–97].
Figure 3.2 shows a generic HMM. Assuming the length of the observation sequence to be so that is an observation sequence, and is the number of hidden states. The observation sequences are derived from in Eq. 3.1 by computing features of the frames. The state probability matrix is denoted by , whereas the probability matrix is denoted by . Also, let be the initial state probability for the hidden Markov chain.
In the training phase, the model parameters are determined. Here, the model is denoted by , which contains three parameters and ; thus, . The parameters are usually determined using the expectation maximization (EM) algorithm [98] so that the probability of the observation sequence is maximum. After the model is determined, the probability of an unseen sequence that is can be found to get the sequence classification results.
SER researchers used HMM for a long time and used it with various types of feature sets. For example, some researchers [42–45] used prosody features, and some [44–46] used spectral features. Researchers using the HMM achieved the average SER classification accuracy between 75.5% and 78.5% [45,47–51], which is comparable with other classification techniques, but further improvement possibilities are low. Moreover, that is why HMM has been replaced by other classification techniques in later studies such as SVM, GMM, or DNN.
An unknown distribution can be described by a convex combination of base distributions like Gaussians, Bernoulli's, or Gammas, using mixture models. GMM is the special case of mixture models, where the base distribution is assumed to be Gaussian. GMM is a probabilistic density estimation process where a finite number of Gaussian distributions of the form is combined, where is a ‐dimensional vector, i.e. , is the corresponding vector and is the covariance matrix, such that [99,100]
where are the mixture of weights, such that . In addition, denotes the collection of parameters of the model .
Now, consider the dataset , extracted similarly as in Eq. 3.2, and it is assumed that are drawn independent and identically distributed (i.i.d.) from an unknown distribution . The objective here is to find a good approximation of by means of a GMM with mixture components and for that the maximum likelihood estimate (MLE) of the parameters need to be obtained [100,101]. The i.i.d. assumption allows the to be written [99,100] as follows:
where the individual likelihood term is a Gaussian mixture density as in Eq. 3.3. Then, it is required to get the log‐likelihood [99,100]
Therefore, the MLE of the model parameters that maximize the log‐likelihood defined in Eq. 3.5 need to be obtained. The maximum likelihood of the parameters is estimated using the EM algorithm, which is a general iterative scheme for learning parameters in mixture models.
GMM is one of the most popular classification techniques among SER researchers, and many research works are based on GMM [12,15,39,52–59]. Although average accuracy achieved not up to the mark, around an average of 74.83–81.94%, but least training time of GMM among the prominent classifiers made it an attractive choice as SER classifier.
GMMs are efficient in modeling multimodal distributions [101] with much have been data points compared to HMMs. Therefore, when global features are extracted from speech for SER, fewer data points are available, but GMMs work better in those scenarios [8]. Moreover, the average training time is minimal for GMM [8].
Following difficulties in using GMM have been identified:
SVM is fundamentally a two‐class or binary classifier. The SVM provides state‐of‐the‐art results in many applications [104]. Possible values for the label or output are usually assumed to be so that the predictor becomes , where is the predictor and is the dimension of the feature vector. Therefore, given the training dataset of data points , where and and , the problem is to find the with least classification error. Consider a linear model of the form
to solve this binary classification problem, where is the weight vector and is the bias. Also, assume that the dataset is linearly separable in the feature space and the objective here is to find the separating hyperplane that maximizes the margin between the positive and negative examples, which means when and when . Now, the requirement that the positive and the negative examples nearest to the hyperplane to be at least one unit away from the hyperplane yields the condition [100]. This condition is known as the canonical representation of the decision hyperplane. Here, the optimization problem is to maximize the distance to the margin, defined in terms of as , which is equivalent to minimizing , that is
Equation 3.7 is known as the hard margin, which is an example of quadratic programming. The margin is called hard because the formulation does not allow any violation of margin condition.
The assumption of a linearly separable dataset needs to be relaxed for better generalization of the data because, in practice, the class conditional distributions may overlap. This is achieved by the introduction of a slack variable where , with each training data point [105,106], which updates the optimization problem as follows
where manages the size of the margin and the total amount of slack that we have. This model allows some data points to be on the wrong side of the hyperplane to reduce the impact of overfitting.
Various methods have been proposed to combine multiple two‐class SVMs to build a multiclass classifier. One of the commonly used approaches is the one‐versus‐the‐rest approach [107], where separate SVMs are constructed where is the number of classes and . The th model is trained using the data from class as the positive examples and the data from the remaining classes as negative examples. There is another approach called one‐versus‐one, where all possible pairs of classes are trained in different two‐class SVM classifiers. Platt [108] proposed the directed acyclic graph support vector machine (DAGSVM), directed acyclic graph SVM.
SVM is extensively used in SER [48,49,55,65–71]. Performance of SVM for SER task in most of the research studies carried out yielded nearly close results, and accuracy is varying around 80% mark. However, Hassan and Damper [69] achieved 92.3% and 94.6% classification accuracy using linear and hierarchical kernels, respectively. They have used a linear kernel instead of a nonlinear radial basis function (RBF) kernel because of very high dimensional features space [72]. Hu et al. [73] explored GMM supervector‐based SVM with different kernels like linear, RBF, polynomial, and GMM KL divergence and found that GMM KL performed the best in classifying emotions.
There is no systematic way to choose the kernel functions, and hence, the separability of transformed features is not guaranteed. Moreover, in SER, complete separation in training data is not recommended to avoid overfitting.
Deep feedforward networks or multilayer perceptrons (MLPs) are the pure forms of DL models. The objective of an MLP is to approximate some function such that a classifier, , maps an input to a category . An MLP defines a mapping , where is a set of parameters and learns the value of the parameters that result in the best function approximation. Deep networks are represented as a composition of different functions, and a directed acyclic graph describes how those functions are composed together. For example, there might be three functions , , and connected in a chain to form , where is the first layer of the network, is the second layer, and so on. The length of the chain gives the depth of the model, and this depth is behind the name deep learning. The last layer of the network is the output layer.
The training phase of neural network is altered to approximate . Each training example has a corresponding label , and training decides the values for such that the output layer can produce values close to say . However, the behavior of the hidden layers are not directly specified by training data, and that is why those layers are called hidden. The hidden layers bring the nonlinearity into the system by transforming the input , where is a nonlinear transform. The whole transformation process is done in the hidden layers, which provide a new representation of . Therefore, now, it is required to learn and the model is now becomes , where is used to learn and parameters maps to the desired output. The is the so‐called activation function of the hidden layers of the feedforward network [109].
Most modern neural networks are trained using the MLE, which means the cost function is the negative log‐likelihood or the cross‐entropy function between the training data and the model distribution. Therefore, the cost function becomes [109]
where is the model distribution that varies depending on the selected model, and is the target distribution from data. The output distribution determines the choice of the output unit. For example, Gaussian output distribution requires a linear output unit, Bernoulli output distribution requires a Sigmoid function, Softmax Units for Multinoulli Output Distributions, and so on. However, the choice of a hidden unit is still an active area of research, but rectified linear units (ReLU) are the most versatile ones that work well in most of the scenarios. Logistic sigmoid and hyperbolic tangent are other two options out of many other functions researchers are using.
Therefore, in the forward propagation, is produced and the cost function is computed. Now, the information generated in the form of is appropriately processed so that parameters can be appropriately chosen. This task is accomplished in two phases, first computing the gradients using the famous back‐propagation algorithm, and in the second phase, the values are updated based on the gradients computed by the backprop algorithm. The values are updated through methods such as stochastic gradient descent (SGD). The backprop algorithm applies the chain rule recursively to compute the derivatives of the cost function .
Different variants of DL exist now, but convolutional neural networks (CNNs) [110,111] and recurrent neural networks (RNNs) [112] are the most successful ones. Convolutional networks are neural networks that use convolution in place of general matrix multiplication in at least one of their layers, whereas when feedforward neural networks are extended to include feedback connections, they are called RNNs. RNNs are specialized in processing sequential data.
SER researchers have used CNNs [35,37,83–87], RNNs [36,84,88], or a combination of the two extensively for SER. Shallow one‐layer or two‐layer CNN structures may not be able to learn effectively the affective features that are discriminative enough to distinguish the subjective emotions [85]. Therefore, researchers are recommending a deep structure. Researchers [36,37,89,90] have studied the effectiveness of attention mechanism.
Researchers are applying end‐to‐end DL systems in SER [8491–94], and most of them use arousal‐valence model of emotions. Although using end‐to‐end DL, the average classification accuracy for arousal is 78.16%, which is decent, for valence, and it is pretty low 43.06%. Among other DNN techniques, very recently, a maximum accuracy of 87.32% is achieved by using a fine‐tuned Alex‐Net on Emo‐DB [85]. Han et al. [113] used extreme learning machine (ELM) for classification, where a DNN takes as input the popular acoustic features within a speech segment and produces segment‐level emotion state probability distributions, from which utterance‐level features are constructed.
Although researchers are using the DL architectures extensively for solving the SER problem, DL methods pose the following difficulties:
SER mystery is not yet solved, and it has been proved to be difficult. Here are the prominent difficulties faced by the researchers.
This chapter reviewed different phases of SER. The primary focus is on four prominent classification techniques used in SER to date. HMM was the first technique that has seen some success and then GMM and SVM propelled that progress forward. Also, presently DL techniques, mainly CNN–long short‐term memory (LSTM) combination, is providing state of the art classification performance. However, things have not changed much in case of selecting a feature set for SER because low‐level descriptors (LLD) are still one of the prominent choices, although some researchers in recent times are trying DL techniques for feature learning. The nature of SER databases are changing, and features such as facial expressions and body movements are being included along with the spoken utterances. However, the availability of quality databases is still a challenge.
This work is a survey of current research work in the SER system. Specifically, the classification models used in SER are discussed in more detail, while the features and databases are briefly mentioned. A long list of articles have been referred to during this study, and eight classification models have been identified, which are being used by researchers. However, four classification models have been identified as more effective for SER than the rest. Those four classification models are HMM, GMM, SVM, and DL, which are discussed in the context of SER with a relevant mathematical background. Their drawbacks are also highlighted.
Research results show that DL models are performing significantly better than HMM, GMM, and SVM. The classification accuracy for SER has improved with the introduction of DL techniques because of their higher capacity to learn the underlying pattern in the emotional speech signal. The DL architectures are evolving each day with more and more improvement in classification accuracy of SER speech signals. It is observed that the SER field is still facing many challenges, which are barring research outputs from being implemented as an industry‐grade product. It is also noticed during this study that research work related to feature set enhancement is much less compared to the works done on enhancing classification techniques. However, the available classification techniques in ML field are in a very advanced state, and with the right feature set, they can yield very high classification accuracy rates. DL as a sub‐field of ML, even achieved state‐of‐the‐art classification accuracy in many fields like computer vision, text mining, automatic speech recognition, to name a few. Therefore, the classification technique should not be a hindrance for SER anymore; only the appropriate feature set needs to be fed into the classification system.
3.138.113.188