30

Machine Assessment of Speech Communication Quality

 

     Wai-Yip Chan and Tiago H. Falk

30.1 Telephone Speech Communication

30.2 E-Model

30.3 Listening Quality and Tests

30.4 Machine-Based Quality Measurement

30.5 Reference-Based Algorithms

30.6 Reference-Free Algorithms

30.7 Parametric Methods

30.8 Usage Considerations

References

 

Internet access has been declared a human right in some countries, ahead of reaching a consensus on this issue at the United Nations. Prior to the Internet, speech telephony provided the most widely available form of instant worldwide connected interactivity. Speech communication does not require the user to be literate, that is, having the ability to read and write. Speech naturally conveys human touch in numerous ways: conveying identity and emotions, enabling interactivity, and so on. With relatively low bandwidth requirement, speech communication can have greater geographical reach and can be supported even when resources are scarce. Thus, speech communication will always remain an essential component of the global communication network. The burgeoning growth of the global network is accomplished by many rapid trials and deployments of new networks and network technology. In such activities, the ability to rapidly and accurately assess end-to-end speech communication quality is needed to ensure that new network connections perform to customer expectations. “Rapid” can only be obtained with machines. Speech-processing algorithms that can run on machines have matured to the point of being able to accurately assess subjective listening quality. This chapter provides a study of up-to-date speech quality assessment algorithms and methods, aiming to enable communication engineers to use them effectively.

30.1 Telephone Speech Communication

 

The ideal telephone service enables distantly located participants to communicate with one another as if they were collocated. While evolving teleconferencing technologies are moving closer to that ideal, conventional telephone services still provide only voice communication. Telephony speech communication typically involves two persons conversing with each other using terminals (“telephones”) connected to a (tele)communication network. Normal speaking produces speech sounds, which are acoustic signals. Telephones serve to convert between speech as an acoustic signal and speech as an electrical signal. The latter can be manipulated by and transmitted through the network. Telephone users are accustomed to the loss of visual contacts; however, they would prefer their conversations to be unimpeded by the telephone equipment and network, as if they were conversing with one another through an opaque but acoustically transparent wall. This imagined ideal scenario can serve as a reference for gauging the effectiveness of telephone network-mediated conversations. When conversing over the network, if the acoustic signals that arrive at the two ears of each talker are identical to those arriving in the ideal scenario, the network is “transparent” to the signals exchanged in the conversation. The acoustic signals could be a mixture of signals from different sources: the talker, the third-party talker, and sounds from the surrounding background.

Realistic telephone communication deviates from the above ideal scenario in a number of ways. First, for economy, the talker's acoustic speech signal is captured monaurally by a single microphone. In natural settings, humans listen binaurally, with their left and right ears. A faithful binaural capture of speech sounds would require using two microphones, either set inside the ear canals of a head-and-torso (dummy) model of the listener, or with the duo-microphone captured free-field signals postprocessed to replicate the effect of immersing the listener's (upper) body in the sound field. Second, telephones and network equipment cannot preserve the fidelity of the electrical speech signal captured by the microphone. Processing of the electrical signal for transmission and the physical act of transmission degrade the speech signal. The degradations may appear to the ear as speech distortions, dropouts, extraneous signals, noise, and so on. Third, the speech signal takes longer to reach the receiving end through a network than the acoustic speech signal propagating through the ambience to the third-party talker. Fourth, the talker may hear his/her own voice differently from when speaking free of the presence of the telephone and transmission network. A telephone handset may block or otherwise modify the talker's mouth-to-ear propagation path. Moreover, the network may reflect the talker's speech back to his/her telephone earpiece or loudspeaker. The reflected signal is heard as an annoying “echo” when the reflected signal is delayed by more than about 50 ms relative to the talker's own speech signal. A properly engineered transmission network would attenuate echoes to a nonperceptible level before the echoes reach the talker.

30.2 E-Model

 

Telecommunication engineers have developed models to quantify the severity of the above impediments to carrying out a conversation. These models may serve one or more of the following functions: planning, research and development, benchmarking, monitoring, and maintaining speech telephony equipment and services. Among existing models, the ITU-T G.107 standard (ITU-T Rec. G.107, 2005), often called the “E-model,” was originally developed for provisioning (planning) network equipment. Here, we introduce the E-model as it captures the major conversation impairment factors and has wider usage than planning, as we shall show later. Moreover, E-model updates that track emerging technology trends are being developed, and the model has even served as a paradigm for quantifying the quality of other communication modalities such as audio-visual conferencing.

The E-model specifies the computation of a quality measure called “transmission rating” for a conversational connection between two talkers. The quality rating R is expressed as a function of “impairment factors” as follows:

R=R0IsIdIe,eff+A(30.1)

where R0 is a basic rating determined by the signal-to-noise-ratio of the connection. The three I-factors quantify impairments, with subscript “s” denoting “simultaneous” (with the speech signal), “d” denoting “delayed,” and “e,eff” denoting “equipment, effective.” Factor Is encapsulates the effect of impairments occurring at the same time as the speech signal, such as an inappropriate sidetone level, quantization noise produced by mu-law and A-law quantizers used on 64/56 kilobits per second (kbps) digital connections. Sidetone is the talker's speech signal picked up by the telephone mouthpiece and fed back attenuated to the earpiece, in order to give the talker the sensation of hearing his own voice. Quantizers provide the key function of converting the analog amplitudes of signal samples to discrete amplitudes, which are then encoded into bits to form a digital representation. Factor Id captures rating reduction caused by echoes and other impairments that arise later (greater than 50 ms) than the time point in the speech signal that triggered the impairment. Echoes may occur inside the network, commonly across four-wire-two-wire interfaces called “hybrids” (Gibson 2002), or outside the network, such as due to acoustic coupling between loudspeakers and microphones that terminate the telephone connection. Factor Ie,eff accounts for degradations related to nonlinear and/or time-varying phenomena such as distortions caused by low-bit-rate (LBR) speech codecs, packet loss concealment (PLC), and dejitter buffers. LBR speech codecs and PLC are deployed in mobile wireless links and some voice-over-Internet-protocol (VoIP) connections. An essential component of VoIP connections, dejitter buffers serve to reduce dropouts and distortions in the decoded speech signal caused by randomly varying intervoice-packet arrival times at the receiving end. Factor “A,” denoting “advantage factor,” is used to adjust the baseline connection quality rating to reflect variation of user utility expectation with the service setting. “A” measures the amount of rating loss users are willing to sacrifice in order to acquire the additional convenience offered by the service. For instance, LBR speech codes used in mobile phones deliver lower speech quality than 64/56 kbps wireline connections. Mobile phone users gain the convenience of mobility and are willing to accept somewhat lower speech quality than a wired connection. “A” can be set to the amount of quality rating mobile phone users are willing to give up to exchange for mobility.

Equation 30.1 embodies the idea that transmission impairments can be quantified on the same perceptual scale, and can be combined additively on the scale to form an overall quality measure. Numerically, “R” ranges from 0 to 100. A “toll quality” wireline connection would have an R value of around 93. Note that the quality of an R = 100 connection falls far short of our imagined ideal conversational setting. The E-model is for conventional telephone connections, which convey monaural speech, cut out speech signal frequencies above 3300 Hz (“narrowband speech”), and provide a smaller loudness range than possible with spoken speech and normal hearing. With rapidly expanding new technologies and increasing quality expectations, new rating scales are being considered. They can extend the upper limit to above 100, or redefine the scale so that 100 corresponds to a more enriched quality of experience than conventional telephone conversation.

The E-model offers a methodology for rating the quality of conversations held over telephone connections. The model covers quality factors related to listening, speaking, and interaction. For instance, long transmission delay increases Id through two phenomena. First, for a given intensity of heard echoes, its annoyance level increases with the delay. Second, large delay inhibits interaction. Talkers may become more hesitant and confused as they find it harder to promptly exchange turn to speak and to interrupt one another.

From the E-model, we see that in order to estimate the conversational quality of a telephone connection, a number of connection parameters have to be measured, such as delay, noise level, echo attenuation, and so on. This can be done using in-service nonintrusive measurement devices (INMDs). The ITU-T P.561 standard specifies requirements that INMDs on the telephone network have to satisfy but does not specify any measurement algorithm (ITU-T Rec. P.561, 2002).

30.3 Listening Quality and Tests

 

A critical component of conversational quality is listening quality, which refers to how telephone users rate the quality of the speech signals they hear. A common practice to assess listening quality is by conducting listening tests. These are done by assembling a panel of human subjects to listen to samples of speech signals that have been either recorded in the network or processed to simulate speech signals received under specific transmission conditions. Each listener is asked to assess the quality of each sample heard using an absolute category rating (ACR), that is, by selecting one descriptor out of five: excellent, good, fair, poor, and bad. For analysis of the test results, each rating is assigned an integer value from 5 to 1, with 5 for excellent and 1 for bad. The numerical scores can then be averaged to obtain statistics, such as averaging over all the listener ratings for the speech samples in each transmission condition, to obtain a mean opinion score (MOS) for each condition. MOS values are real numbers between 1 and 5 and are treated as estimates of the opinion of a “typical user” from the user population on the listening quality of the speech signals provided by a particular (set of) transmission condition(s). To facilitate interpreting MOSs, the ITU-T P.800 standard provides guidelines on conducting MOS listening tests (ITU-T Rec. P.800, 1996). The MOSs obtained from such tests are sometimes written as MOS–LQS, where “LQS” stands for “listening quality subjective.” (Note: “MOS” is often used synonymously with “ACR” even though ACR is a discrete scale and MOS is obtained from averaging ACR scores obtained from tests performed according to P.800. The ACR scale may in other occasions be used in other kinds of subjective tests.)

A prominent application of MOS listening tests is assessment of speech codec algorithms. By averaging the listener ratings obtained for each codec algorithm under a judiciously selected set of transmission conditions, one obtains a figure of merit for the speech quality provided by each codec. Subjective listening tests have played a pivotal role in the development and selection of speech codecs used in successive generations of mobile phone networks. These tests are nevertheless labor intensive, time consuming, and costly (e.g., $100K+ per test). These drawbacks mean that subjective tests were done sparingly, at critical junctures during codec development and selection competition. With rapid introduction of new mobile and VoIP networks and services, there is a great need to assess the voice quality of newly deployed connections, some of which may comprise a tandem of links, with some links using legacy and others new technologies, that is, combination of wireless, digital wireline, analog wireline, and VoIP links.

MOS is just one of a variety of quality measures that have been proposed and/or used; interested readers can refer to (Deller et al. 1999) and (Quackenbush et al. 1988) for other measures. However, rightly or wrongly, MOSs have been commonly treated as a currency of telephone speech quality, and MOSs have often been used to discriminate between different speech transmission equipment and services. The MOS measure is popular enough that a mapping from the E-model R-rating to MOS is provided (ITU-T Rec. G.107, 2005). The popularity of MOS has led to the ACR scale being used in other contexts. For instance, ITU standard P.835 recommends using the ACR scale to rate three different quality aspects (background only, speech only, and overall) of noise corrupted and noise-reduction processed speech signals (ITU-T Rec. P.835, 2003). P.835 embodies a multidimensional quality measure. The diagnostic acceptability measure (DAM) (Deller et al. 1999) is a well-known multidimensional measure (though the assessment scale of DAM is distinct from ACR). DAM provides altogether 16 perceptual quality attributes, covering the speech signal, background, and overall quality. Each attribute is assessed on a sliding scale to indicate how severe is a specific perceptual element (e.g., “thin” or “nasal” sounding). As a result, multidimensional measures can more precisely “diagnose” quality degradations and pinpoint system problems. MOS-like measures have also been used to rate image (Wang et al. 2004) and video quality (Winkler and Mohandas 2008).

Quality measurement may need to be done when connections are in-service, out-of-service, during installation, or undergoing development in the lab. For such needs as line monitoring, the measurement must be done in real time. Subjective listening tests, however, cannot provide real-time and low-cost assessments. Thus, there has been great interest in developing machine algorithms for estimating subjective listening quality. This can be viewed as a MOS estimation problem, and the MOSs estimated by machines are denoted MOS–LQO, where the suffix “O” stands for “objective.”

30.4 Machine-Based Quality Measurement

 

Today, sophisticated and standardized algorithms are available for measuring listening quality. These algorithms are implemented in software to compute MOS–LQO values from input speech signals. With such capability, MOS–LQO values can be made available instantly if the speech signals were recorded beforehand, or in the case of real-time monitoring, immediately after a sufficient amount of speech signals (and other information that may be required by the algorithm) has transpired. After decades of development, recent algorithms can provide accurate estimates of MOS–LQS, when the algorithms are applied to transmission conditions that have been validated. “Accurate” can be taken to mean that the statistical error of the computed MOS–LQO is roughly of the same extent as the statistical uncertainty of the MOS–LQS obtained from P.800-guided MOS listening tests. Recent standards often provide a computer source code realization of the algorithm, though running and/or commercializing the code might require licensing from intellectual property rights holders.

Quality measurement algorithms can be classified by the type of their inputs: signal-based, parametric, and hybrid. Signal-based algorithms require the input of one or more speech signals. Parametric algorithms input only parameter values and no signals. Hybrid algorithms need both signal and parameter inputs. Signal-based algorithms can be further divided into reference-free, reference-based, and partial-reference. A reference-free algorithm needs only input the (“target”) speech signal whose listening quality is to be determined. A reference-based algorithm needs to input two signals: the target itself and the original nondegraded (“clean”) version of the target. A partial reference algorithm inputs the target signal as well as parameters that characterize the clean signal. Reference-free algorithms are also characterized as “no-reference,” “single-ended,” or “nonintrusive”; reference-based algorithms are characterized as “full-reference,” “double-ended” or “intrusive”; and partial reference as “reduced-reference.” A nonintrusive algorithm can listen in on the target speech signal in order to determine its quality, whereas an intrusive algorithm relies on injecting a clean reference speech signal into a connection to produce the target speech signal. Taking a connection out of service in order to inject a clean signal disrupts (intrudes on) network service. Note that it is possible for the computation of one or more parameters of a parametric algorithm to require reference signal knowledge.

To illustrate, suppose an application based on the E-model is to be developed to monitor speech quality at the receiving end of a VoIP connection. Each of the parameters in Equation 30.1 is computed using formulas or algorithms which require their own input parameters (“subparameters”) or signals. The application may be able to obtain parameter values from devices it can communicate with and information carriers it can assess, such as the speech decoder, dejitter buffer, packet headers, INMDs, and so on. In such cases, the application is purely parametric. If in addition to acquiring parameters, the application has to process the received speech signal (e.g., the encoded speech in the packet payloads), then the application embodies a hybrid of signal-based and parametric measurement. If the application needs to input the clean speech signal or parameters that rely on it, the application is reference based or partial-reference based, respectively. The application may be developed to measure listening or conversational quality. For the latter, the application needs to be able to calculate the delay-related impairment factor Id using parameters measured by INMDs and other network devices. For listening-quality-only measurement, the E-model is operated by nulling unused impairment (sub)parameters such as Id in Equation 30.1.

In MOS listening tests, human subjects rate the quality of degraded speech signals without listening to their clean versions. The listener compares the heard speech signal with his/her listening-quality expectation established over years of telephone use. A signal-based algorithm can mimic human judgment by comparing the target signal with an expectation model. While reference-free algorithms do provide such a function, building an algorithm that closely mimics human perceptual and cognitive processes is beyond the state of current knowledge about these processes. Instead, state-of-the-art reference-free algorithms are designed to detect abnormal signal behaviors, gauge their severity, and map the severities of the abnormalities to a quality estimate. These algorithms thus rely on knowledge of perceptually relevant abnormalities, and signal and statistical models to represent such knowledge. Though there are numerous ways of transforming a speech signal that can change its perceived quality, the types of degradation that telephony transmission can impart on speech signals are known to a great extent and can be modeled. Moreover, apart from telephony, a multitude of signal models exists emulating the production and auditory perception of speech signals. These can be deployed to build models of normative behaviors, complementing models of abnormal behaviors. As we shall see below, a variety of models have been employed to build the existing algorithms. Besides such modeling, double-ended algorithms are provided with a clean version of the target speech signal. The clean signal provides a precise description of normative behavior for the target speech signal. Double-ended algorithms that exploit this well can offer more accurate estimates of subjective quality than single-ended algorithms. Thus, much effort has been invested in developing standard double-ended algorithms, the most widely deployed being ITU P.862 (PESQ) (ITU-T Rec. P.862, 2001). The recently standardized successor of PESQ, ITU-T P.863 (ITU-T Rec. P.863, 2011), also known as POLQA for “Perceptual Objective Listening Quality Assessment,” caters to a wider range of application conditions, and has been reported to provide more accurate quality estimates.

30.5 Reference-Based Algorithms

 

Reference-based algorithms (RBAs) rely on having the clean original speech signal x(t) on hand. Usually, we think of this signal as being processed and/or transmitted to produce the target speech signal y(t) whose listening quality is to be estimated. However, as we describe below, the ITU-T P.563 reference-free algorithm estimates a pseudoreference signal, in lieu of a “clean” signal, from the distorted target speech signal. The idea underlying RBAs is simple: the more dissimilar is y(t) from x(t), the greater is the distortion and hence greater quality degradation relative to the quality of x(t). If one takes care to ensure that the x(t) used is chosen to have high quality, for example, Q = 4.4 (MOS–LQS), an RBA can estimate the degradation d(x(t), y(t)) and the estimated MOS–LQO of y(t) can be taken as Q − d. The notation d(x(t), y(t)) denotes a mapping of two signals to a nonnegative real value. A common measure of distortion is signal-to-noise-ratio (SNR) which can be written as ||x(t)||^2/||e(t)||^2, where the notation ||s(t)||^2 denotes calculating the energy of signal s(t). The “noise” signal e(t) is in fact the “error” signal e(n) = x(t) − y(t). Thus, SNR measures how much y(t) differs from x(t) in terms of the power of the error signal relative to the power of the reference signal. Depending on how x(t) is distorted to produce y(t), SNR can be a very poor measure of degradation. For instance, the two signals y_1(t) = 2 x(t) and y_2(t) = x(t − 0.2) would have (nearly) identical listening quality to x(t) but the SNR is 0 dB for y_1(t) and below 0 dB for y_2(t). Thus, the RBA mapping d(x(t), y(t)) should only be sensitive to differences between x(t) and y(t) that contribute to the subjective listening quality judgment process. As such, many existing algorithms map the signals x(t) and y(t) into a perceptual domain which mimics the processing done by the human auditory system. A psychoacoustically motivated degradation measure is then used to characterize perceptually relevant distortions and mapped to a final quality rating. It is interesting to observe that in MOS tests, the listeners do not hear x(t), so that no precise differencing between x(t) and y(t) is available to form the subjective rating. Despite this, the recent standard RBAs demonstrate that there exist strong correlations between the measured differences and MOS degradation.

To get an idea of RBA processing, we now examine the ITU-T P.862 PESQ algorithm. At the time of writing (2011), PESQ has been the most widely employed double-ended algorithm for narrowband (up to 4 kHz speech content) speech codecs and networks. PESQ's algorithm structure is similar to that of the newly standardized ITU-T P.863 POLQA algorithm, whose new features are also highlighted below.

PESQ was standardized in 2001 and developed by a consortium of British, Dutch, and German companies (ITU-T Rec. P.862, 2001). PESQ developed around the time packet-based transmission was increasing in popularity and its predecessor standard quality measurement algorithm, PSQM (for perceptual speech quality measure) (ITU-T Rec. P.861, 1996) failed at predicting the quality of such variable delay communication systems. A key advancement of PESQ over PSQM was the introduction of constant and variable delay estimation techniques between the clean (original) input x(t) and the degraded output y(t) (Rix et al. 2002). Since RBAs depend on distortion parameters extracted by comparing the two signals frame-by-frame, even the smallest frame misalignment (in the order of 1 ms) can result in erroneous quality estimates (Rix et al. 2002). It is thus critical that the delays in y(t) relative to x(t) be detected and accounted for before the degradation measure d(x(t),y(t)) is computed. The PESQ algorithm assumes that the delay of the system is piecewise constant, a reasonable assumption for a wide range of systems, including VoIP, and introduces a two-stage constant delay estimation procedure. First, a crude delay estimation step is used based on the location of the maximum cross correlation between the original and processed signal envelopes. It is argued that if the speech signals contain at least 500 ms of speech content, the crude delay estimate could be accurate to within ±8 ms (Rix et al. 2002). Once a crude delay estimate is found, it is further refined via a weighted histogram method of the estimated frame-by-frame delays.

To estimate varying delay, a similar two-stage process is performed on speech utterances, which correspond to 300+ millisecond speech bursts containing no silent period longer than 200 ms. Crude delay estimates are obtained for each utterance and further refined using the weighted histogram approach. To compensate for continuous varying delays during speech, an utterance splitting method is also used. Each utterance is divided in two and the crude/fine delay estimation procedures are applied. If there is evidence supporting a delay change (e.g., an absolute change in delay of 4 ms or greater), subutterance splitting proceeds in a recursive manner to each new “half” until no further delay changes are seen. A last time alignment step incorporated in PESQ consists of a bad frame identification procedure in the perceptual domain (more on this topic to follow). More specifically, it assumes that bad frame alignment will result in abnormally high perceptual disturbances. In such instances, realignment is performed.

Once the original and degraded speech signals are time-aligned, a perceptual auditory mapping (Beerends et al. 2002) is performed to map the signals into an internal representation of perceived loudness both in time and in frequency. This step has been borrowed from PSQM and consists of Bark-scale frequency bin grouping and loudness mapping using frequency-dependent thresholds and exponents. The difference in internal representations of the degraded and reference speech signals is then calculated, representing the audible difference between the two signals. A positive difference indicates that (power-)additive components are present such as background noise; a negative difference indicates that (power-)components were lost, such as in VoIP networks with packet losses. The last processing stage consists of the so-called cognitive modeling step where disturbances are evaluated and aggregated using cognitive “insights” to form a final PESQ quality estimate. For example, positive disturbances and those occurring during active speech periods are thought to be more objectionable than negative disturbances and those occurring during silent intervals, respectively, thus receive higher weights. Additionally, disturbance integration is performed in three steps, first over frequency, then over short-time utterance intervals, and finally over the whole speech signal. Different p values are used in the Lp norm integrations of the three steps to maximize the correlation with subjective quality ratings. It was later observed that an additional mapping was needed in order to fit the PESQ raw scores onto the five-point MOS–LQO scale; this mapping is described in ITU standard P.862.1 (ITU-T Rec. P.862.1, 2003).

In its original form, PESQ can only be used with narrowband speech content. Given the advances in wideband (up to 8 kHz bandwidth) technologies, a wideband extension of PESQ was later standardized in 2005 (ITU-T Rec. P.862.2, 2005). Despite these extensions, the PESQ application guide published at the end of 2007 suggested that the scope of PESQ was inadequate for burgeoning speech networks and technologies. PESQ did not cover electroacoustic transducers, voice quality enhancement algorithms (e.g., noise reduction, bandwidth extension, and automatic gain control), and time-warping/scaling algorithms (ITU-T Rec. P.862.3, 2007). PESQ performance was also compromised for CDMA codecs (e.g., EVRC), strong linear distortions (e.g., phone's frequency shaping), gain variations, and VoIP-transmitted speech with variable delays around 1 s. In 2011, the new ITU-T P.863 POLQA algorithm was standardized to overcome these limitations and allow for the majority of existing fixed, mobile, IP-based, and hybrid telephone network scenarios to be covered, either in narrow-, wide-, or superwide-band (up to 24 kHz speech bandwidth) modes.

The POLQA RBA is a joint development between three companies from Germany, Switzerland, and the Netherlands, two of which were involved in the development of PESQ. While the core of the POLQA algorithm is similar to that of PESQ, two key modifications have been introduced to tackle the PESQ weaknesses described above. First, a new time alignment algorithm has been proposed. In superwide-band mode, the original reference signal needs to be sampled at 48 kHz. The degraded signal, however, may have been downsampled due to bandwidth constraints and this band limitation is considered a degradation that has to be scored accordingly. POLQA is able to handle this sampling rate discrepancy in conjunction with the delay estimation procedure by analyzing the histogram of estimated delay variations. Similar to PESQ, the delay estimation algorithm operates in multiple steps. First, an overall delay is estimated based on using the cross correlation between (i) the x(t) and y(t) signals in their entirety, and (ii) the first and second half of the signals, respectively. So-called “reparse points,” which correspond to transition points between silence and active speech (and vice-versa), are then found using a complex procedure involving the use of a voice activity detector (VAD) on both the reference and degraded speech signals. Once reparse points are found in both signals, their time difference is computed as a coarse estimate of the delay for a particular reparse section. This procedure is conceptually similar to that of PESQ in the sense that it provides delay estimates for each “utterance.” Unlike PESQ, however, the refinement of this coarse estimate involves a multidimensional search in the per-frame signal energy and fractal dimension (used as a measure of signal complexity which is assumed to be indirectly related to noise) domains using two different frame sizes. A backtracking algorithm similar to Viterbi's algorithm is used to estimate a finer delay estimate for each speech frame. An additional refinement step is used based on the cross correlation to estimate the exact delay expressed in samples. As can be expected, this time alignment procedure is fairly complex and time consuming. For applications in which fixed or piecewise fixed delays are present, POLQA uses a simpler alternative prealignment method. It is similar to that in PESQ in the sense that it finds a coarse estimate based on the cross correlation of signal envelopes and then refines this estimate by matching segmentwise parts of the original signal with that of the degraded signal. This segmentwise matching procedure is also used to find missing segments in the degraded signal due to, for example, packet losses. This fast alternative may result in very bad delay estimates if time-varying delay is present. As such, an exclusion criterion is run to verify the applicability of the fast prealignment method. If the criterion is not met, the more complex method described above is used.

The second major difference between POLQA and PESQ is in the perceptual model (see Figure 30.1, where shaded blocks represent the different modules available with POLQA) and in the processing of the reference and degraded speech signals. More specifically, POLQA introduces the concept of idealization. Since listeners in a MOS test are not presented with a reference signal, participants likely use an “ideal” reference signal during their quality judgment process. This idealization process is modeled based on subjective experiments that measured human sensitivity to nonoptimal presentation levels and timbre, and low levels of noise when listening to the reference speech signal. As such, POLQA modifies the reference signal in three different stages (see shaded blocks in Figure 30.1) to model this idealization process. Modifications to the degraded signal and its internal representation are also performed but motivated by higher-level cognitive processes. For example, humans are relatively insensitive to linear frequency response distortions and to steady-state wideband noises, thus such disturbances are partially compensated for by the POLQA RBA. Additional processing steps available exclusively with POLQA are frequency dewarping (e.g., in the case of speech processed by bandwidth extension methods), spectral and temporal masking, and playback-level calibration. The latter is performed as the impact of playback level on perceived quality is important, particularly in superwide-band mode, and needs to be accounted for. In essence, these modifications are performed to preserve only relevant speech information and discard unwanted signal components. The goal is to focus only on speech-related distortions and have additional modules or indicators characterize disturbances related to frequency response, noise, and reverberation, as shown in Figure 30.1. The signal processing steps shown in Figure 30.1 are performed four different times using different internal parameters such that four variants of the time–frequency “disturbance densities” are computed. These variants focus on general distortions, large general distortions, additive distortions only, and large additive distortions. Disturbance densities are aggregated over pitch, short-time utterance intervals, and time using different Lp norm integrations. Aggregate disturbance densities are then combined with level and spectral flatness indicators into a final POLQA raw score. Finally, the raw score is mapped to the MOS–LQO scale using a third-order polynomial mapping that was optimized for the databases used in the P.863 standardization process. In narrow-band mode, the maximum score is 4.5 while in superwide-band mode, this value is 4.75 which corresponds to the extended MOS-scale used with high-definition (HD) voice technologies.

Images

FIGURE 30.1 Block diagram of perceptual model used by PESQ and POLQA. Shaded blocks are exclusive to POLQA and are not available with the PESQ RBA. (*In PESQ, the bark scale mapping is followed by a loudness scale mapping, which is omitted in the block diagram due to space constraints.)

The combined advancements available with ITU-T P.863 POLQA allow for a new range of applications that could not previously be assessed using PESQ, such as voice quality enhancement algorithms and terminals using head-and-torso simulators (HATS). At the time of writing, third-party (i.e., outside the standardization process) POLQA performance results under these special conditions are not available. ITU-T standardization tests did suggest, however, that POLQA provided inaccurate predictions when used with acoustical recordings using free-field microphones without HATS or ear-canal simulators and with some enhanced variable rate codec (EVRC) speech codecs. Moreover, POLQA has not been validated for use with very LBR coding technologies (<4 kilobits/s).

30.6 Reference-Free Algorithms

 

Unlike RBAs, Reference-free algorithms (RFAs) do not depend on a clean reference signal. RFAs have gained much attention recently as they can be used while the network is in service and obviate the need to take the network “offline” to inject controlled reference signals for quality measurement. RFAs function like participants in MOS listening tests who are only presented with the degraded speech signal and then rate its quality. Humans, through extensive usage of telephony services, have acquired knowledge of normal and abnormal phenomena in speech sounds and use this prior knowledge to judge signal quality. Unlike the POLQA RBA which uses an idealization process to modify the reference signal to take into account this prior knowledge, existing RFAs employ models of normative speech production (Gray et al. 2000), speech perception (ANSI Rec. ANIQUE+ 2006), speech signal feature likelihood (Falk and Chan 2006), or a combination thereof for single-ended quality assessment. The latter scenario is applicable to the ITU-T P.563 algorithm, which was standardized in 2004 after a two-year competition (ITU-T Rec. P.563, 2004). In this competition, two systems stood out: the standardized P.563 algorithm and the runner-up ANIQUE+ algorithm, which later became an American National Standard Institute standard (ANSI Rec. ANIQUE+ 2006). Focus will be placed here on the P.563 algorithm due to its popularity within the quality assessment community.

Since RFAs do not have access to the clean reference speech signal, assumptions have to be made about the received signal. The P.563 algorithm combines three basic principles, such that different distortion types can be covered (ITU-T Rec. P.563, 2004; Malfait et al. 2006). The first principle focuses on the human voice production system, modeling the vocal tract as a series of acoustic tubes of different time-varying cross-sectional areas. From the speech signal, cross-sectional areas are evaluated for unnatural behavior, thus indicating the presence of unnatural distorted speech. More specifically, pitch marks and voiced/unvoiced classification is first performed using a normalized cross-correlation function. Pitch synchronous vocal tract modeling is then applied where tube section areas of eight concatenated lossless acoustic tube sections are obtained via linear predictive (LP) analysis of voiced speech segments. The resulting eight tube sections are then averaged to model three cavity articulators per pitch cycle, namely the rear, middle, and front cavities. Due to human articulatory limitations, temporal changes in vocal tract cavities are smooth. Unnaturally fast variations in the acoustic tube model or excessively large tube sections suggest that the speech signal being analyzed is distorted. In addition to vocal tract analysis, LP and cepstral analyses are performed on voiced speech segments. The skewness and kurtosis of the coefficients are then computed and investigated to see if they lie within the restricted range expected for natural speech. Deviations from this normative range provide further evidence of introduced distortions in the speech signal. The second principle used by P.563 is to reconstruct a pseudoreference signal from the degraded signal by modifying the computed LP coefficients to fit the vocal tract model of a typical human speaker. This is achieved by converting the LP coefficients to line spectral frequencies (LSFs) which are then quantized using two previously trained codebooks. The better of the two LSF approximations is used to reconstruct the pseudoreference signal. A “basic voice quality” measure is then computed using a RBA with the pseudoreference and degraded signals as input. The RBA used is a simplified version of PESQ without the temporal alignment module and with a modified “cognitive” mapping function. The third principle used by P.563 is to identify specific distortions encountered in voice channels, such as noise (additive and multiplicative), temporal clippings, and robotization effects (voice with metallic sounds). Additive noise is characterized by an estimated SNR level which is obtained using VAD decisions computed in the preprocessing stage; both global and local (i.e., between phonemes) levels are estimated. Other parameters such as high-frequency spectral flatness and spectral clarity are also computed to characterize the noise source. Similarly, a segmental SNR measure is used to characterize the multiplicative noise (speech-correlated noise) introduced by cascaded logarithmic pulse-code modulation (PCM) and adaptive differential PCM systems or by waveform speech codecs (Malfait et al. 2006). The analysis of multiplicative noise is based on the evaluation of spectral statistics (i.e., range and deviation) where it is assumed that the noise will have a flat spectral characteristic which forms a noise floor during active speech periods. Temporal clippings, in turn, are commonly encountered in packet-based networks. P.563 assumes two types of clippings: mutes/interruptions and front/back clippings. Mutes or speech interruptions can occur when silence or comfort noise is inserted as a simple PLC strategy during active speech periods. Front and back clippings may occur in systems that use VADs, such as digital circuit multiplication equipment (Malfait et al. 2006). These abrupt variations in signal energy are detected and estimated by analyzing abrupt variations in the signal envelope and are defined as signal level drops of more than 30 dB within 8 ms or rises by more than 30 dB within 16 ms (Malfait et al. 2006). Lastly, robotization effects often result from transmission errors in mobile radio networks (e.g., GSM) or with LBR codecs, where voiced speech segments sound highly periodic or “robotic.” To detect such events, P.563 monitors signal periodicity in the 2.2–3.3 kHz range by means of cross-correlation analysis; similar periodicity measurements are used to detect frame-repeat and unnatural beep impairments. The last step in the P.563 signal processing chain consists of distortion classification and distortion-dependent perceptual weighting. Using a subset of the extracted (internal) parameters, the P.563 algorithm detects one of the following six major distortion classes, listed in decreasing order of “annoyance”: high level of background noise, signal interruptions, signal-correlated noise, speech robotization, and unnatural male and female speech. The assumption is that human listeners will focus only on the dominant distortion when different types of degradations occur simultaneously (Malfait et al. 2006). For each dominant distortion class, a subset of the extracted parameters is used to compute an intermediate quality rating. Once a major distortion class is detected, the intermediate score is linearly combined with eleven other parameters to derive a final P.563 raw score. During the standardization process, P.563 attained acceptable performance for a wide range of test factors, coding technologies, and applications. It was not suitable, however, for LBR codecs or applications involving suboptimal listening levels, talker echoes, or sidetones.

30.7 Parametric Methods

 

Machine “intelligence” can play a critical enabling role in assuring that customers serviced by machines are provided with satisfying quality of experience (QoE). An important component of QoE assurance is quality monitoring. Network-wide automated connection quality monitoring enables not only QoE assurance, but also diagnosis and amelioration of faults. Implementation cost, however, is a critical factor in assessing the viability of network-wide deployment. If a monitoring function is located on portable devices, power consumption also becomes an issue. A distinct advantage of parametric methods is that they can offer very low computational complexity, a key determinant of implementation cost and power consumption. Parametric algorithms can derive their computation economy from having access to or being able to compute one or a few statistics that sufficiently capture the degree of speech degradation. As an example, let us consider monitoring a set of VoIP connections which use a specific speech codec. At the transmit end, the encoder bit stream is loaded onto voice packets in a specific manner, for example, every two 10 ms speech frame onto one voice packet. Under perfect transmission conditions, the voice packets will arrive at the speech decoder with no errors and in exactly the order they were generated, and the difference between the times of arrival of consecutive packets at the decoder will be exactly equal to the difference of the time instances the consecutive packets were dispatched. The decoded speech signal would suffer no transmission-related degradation and would be identical, except for a constant time delay due to transit through the network, to the best speech signal the speech encoder could synthesize from the encoded bits. However, due to network transmission faults such as bit errors, packet discards, and variable transmission delay, some speech packets will not be available error-free by the deadlines the decoder needs to start decoding the packet contents in order to maintain the integrity (e.g., time-continuity) of the speech signal. In such cases, the cause of speech degradation after speech encoding is solely due to packet losses. Without any packet loss, the listening quality of the decoded speech signal is identical to the quality that can be synthesized by the encoder before transmission. Thus, akin to the E-model wherein the overall impairment due to different degradation sources are combined by summing negative quality ratings corresponding to the sources, the listening quality of the VoIP speech at the receiving end can be estimated by measuring the quality degradation due to packet losses and then using the measurement to adjust the speech quality estimated at the encoder.

Packet loss events can be observed at the decoder and can be modeled as a binary-valued discrete-time random process. A simple approach to estimate quality degradation is to compute statistics of the process and map them to a degradation estimate. For instance, it is easy to see that the severity of speech degradation increases with the average rate of packet loss, and for a given loss rate, the average length of loss bursts. Thus, it is reasonable to expect the severity of degradation to be a monotonically increasing function of each of these two statistics. A mapping from these to a degradation estimate can be obtained by applying regression or machine-learning methods on training data. The data comprises “examples” where each example is a triplet of three values: the two loss statistics and their corresponding amount of quality degradation. The last item can be obtained either from subjective listening tests or from an RBA such as PESQ or POLQA. Through experimenting with different choices of regression function analysis or machine-learning methods, one would obtain a suitable function or mapping from the loss statistics to a degradation estimate. The goal is to find mappings that perform well on novel data, that is, data not in the training set. This goal can be better met by selecting from mappings that have very few parameters that need to be adjusted by the training or regression algorithm, and by validating the trained mapping using novel data. Readers interested in exploring machine-learning techniques can refer to (Bishop 2007). The degradation estimate provided by the mapping can be combined with the speech quality estimated at the encoder end, if the estimate is made available to the decoder end, for example, when piggybacked onto VoIP packets. If unavailable, the degradation estimate can be treated as an amount of reduction of the average speech quality provided at the encoder end.

An embodiment of the above parametric method that has been deployed on VoIP links is VQmon (Clark 2001), which uses a Markov chain to model the packet loss process observed at the receiving end. The empirically observed chain parameters are mapped to an estimate of degradation severity. The parametric method can also be applied to map analog link parameters. For instance, (Karlsson et al. 1999) reported mapping radio link quality parameters measured at GSM radio receivers to a “speech quality index” and obtained more accurate speech quality estimates than using PSQM, the best available RBA at the time. The parametric approach can work well when the unknown amount of (additional) degradation can be attributed to one phenomenon that can be readily measured. When the scenario is complicated by having a greater number and variety of degradation sources and interactions among them, hybrid parametric-cum-signal-based methods can be used. An embodiment of this approach is described in (Falk and Chan 2008), which proposes using a mapping that combines a “base quality” with measurements made on the received speech signal to obtain a quality estimate. The base quality can be obtained from a lookup table. Discrete attributes (e.g., codec type, number of speech frames in a packet) and discretized values (e.g., packet loss rate, average packet-loss-burst length) are used to index the table entries. The signal measurements are then used to cover degradations that are not included in the table, for example, level of background noise and severity of artifacts due to noise reduction processing.

The above parametric measurement examples are not based on standard algorithms. Existing signal-based standard algorithms consume considerably more computational resources than the above parametric algorithms. Network and device design engineers will have to decide whether it is necessary to conform to standards, when nonconformance may garner significant cost savings. Conformance may not be necessary for measurements done within “closed” proprietary (sub)systems. On the other hand, standard algorithms may be required to ensure interoperability between equipment built by different manufacturers. Also, conformance to standards is usually demanded by customers of link and network test equipment.

30.8 Usage Considerations

 

Speech quality measurement algorithms are designed and tested for a variety of degradation conditions. For ITU standard algorithms, the conditions in which the algorithms have been tested and the outcomes of the tests are documented. Such information provides important clues to users of the algorithms about the reliability of the quality estimates provided by the algorithms. For each of the conditions listed in the documentation, the algorithm may be characterized as being known to provide reliable or unreliable estimates, or to have undetermined or unresolved reliability. If one desires to apply a standard algorithm to conditions for which the algorithm is not known to provide reliable estimates, additional tests need to be done to validate the estimates, or additional evidence gathered to calibrate them. For instance, suppose one seeks to determine whether a standard algorithm can be “customized” to assess quality under conditions for which the reliability of the algorithm's estimates are uncertain. Suppose a set of speech signals covering the conditions of interest can be listed in decreasing order of preference by listening quality but their MOS–LQS values are unknown. The standard RBA or RFA can be applied to the signals to obtain a corresponding set of MOS–LQO values. These values provide an objective ranking of the signals. The matching between the subjective and objective ranks can be assessed using Spearman's rank correlation coefficient. If this coefficient turns out to be statistically significant and has a value close to 1, the standard algorithm is deemed to be able to discriminate well between the different quality levels represented by the speech signals. This capability could satisfy applications which need only discrimination by quality category as opposed to absolute MOS values. If in addition, MOS–LQS values are available for the set of speech signals, a mapping similar to the one described in Recommendation ITU-T P.862.1 (ITU-T Rec. P.862.1 2003) or the one used in P.863 to map raw scores to MOS–LQO can be designed to maximize the Pearson correlation (or minimize the mean square error) between the MOS–LQS values and the mapping outputs.

Technology for automating measurement of user-perceived quality is rapidly evolving. Proliferation of mobile electronic devices and networking makes human machine interactions a fact of live and provides unprecedented opportunities for machines to amass human-generated data and responses. Mining such information will accelerate the development of automated quality assessment technology and standards.

References

ATIS Recommendation PP-0100005.2006. 2006. Auditory non-intrusive quality estimation plus (ANIQUE+): Perceptual model for non-intrusive estimation of narrow-band speech quality. American National Standards Institute (ANSI).

Beerends, J., Hekstra, A., Rix, A., and Hollier, M. 2002. Perceptual evaluation of speech quality (PESQ) the new ITU standard for end-to-end speech quality assessment. Part II: Psychoacoustic model. Journal of the Audio Engineering Society 50(10): 765–778.

Bishop, C. M. 2007. Pattern Recognition and Machine Learning, 2nd edition. New York, NY: Springer.

Deller, J. R., Jr., Hansen, J. H. L., and Proakis, J. G. 1999. Discrete-Time Processing of Speech Signals. Piscataway, NJ: Wiley-IEEE Press.

Clark, A. D. 2001. Modeling the effects of burst packet loss and recency on subjective voice quality. Proceedings of the IP-Telephony Workshop 123–127.

Falk, T. and Chan, W.-Y. 2006. Nonintrusive speech quality estimation using Gaussian mixture models. IEEE Signal Processing Letters 13(2): 108–111.

Falk, T. and Chan, W.-Y. 2008. Hybrid signal-and-link-parametric speech quality measurement for VoIP communications. IEEE Transactions on Audio, Speech & Language Processing 16(8): 1579–1589.

Gibson, J.D., Ed. 2002. The Communications Handbook, 2nd edition. Boca Raton, FL: CRC Press.

Gray, P., Hollier, M., and Massara, R. 2000. Non-intrusive speech quality assessment using vocal tract models. IEE Proceedings—Vision, Image, and Signal Processing 147(6): 493–501.

ITU-T Recommendation G.107. 2005. The E-model, a computational model for use in transmission planning. International Telecommunication Union.

ITU-T Recommendation P.561. 2002. In-service non-intrusive measurement device—Voice service measurements. International Telecommunication Union.

ITU-T Recommendation P.563. 2004. Single ended method for objective speech quality assessment in narrow-band telephony applications. International Telecommunication Union.

ITU-T Recommendation P.800. 1996. Methods for subjective determination of transmission quality. International Telecommunication Union.

ITU-T Recommendation P.835. 2003. Subjective test methodology for evaluating speech communication systems that include noise suppression algorithms. International Telecommunication Union.

ITU-T Recommendation P.861. 1996. Perceptual speech quality measure: Objective quality measurement of telephone-band speech codecs. International Telecommunication Union.

ITU-T Recommendation P.862. 2001. Perceptual evaluation of speech quality: An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. International Telecommunication Union.

ITU-T Recommendation P.862.1. 2003. Mapping function for transforming P.862 raw result scores to MOS-LQO. International Telecommunication Union.

ITU-T Recommendation P.862.2. 2005. Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs. International Telecommunication Union.

ITU-T Recommendation P.862.3. 2007. Application guide for objective quality measurement based on Recommendations P.862, P.862.1 and P.862.2. International Telecommunication Union.

ITU-T Recommendation P.863. 2011. Perceptual objective listening quality assessment. International Telecommunication Union.

Karlsson, A., Heikkila, G., Minde, T. B., Nordlund, M., Timus, B., and Wiren, N. 1999. Radio link parameter based speech quality index—SQI. Proc. 1999 IEEE Workshop on Speech Coding 147–149.

Malfait, L., Berger, J., and Kastner, M. 2006. P.563: The ITU-T standard for single-ended speech quality assessment. IEEE Transactions Audio, Speech, and Language Processing 14(6): 1924–1934.

Quackenbush, S., Barnwell, T., and Clements, M. 1988. Objective Measures of Speech Quality. Englewood Cliffs, New Jersey: Prentice-Hall.

Rix, A., Hollier, M., Hekstra, A., and Beerends, J. 2002. Perceptual evaluation of speech quality (PESQ) the new ITU standard for end-to-end speech quality assessment. Part I: Time-delay compensation. Journal of the Audio Engineering Society 50(10): 755–764.

Wang, Z., Bovik, A.C., Skeikh, H.R., and Simoncelli, E.P. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13(4): 600–612.

Winkler, S. and Mohandas, P. 2008. The evolution of video quality measurement: From PSNR to hybrid metrics. IEEE Transactions on Broadcasting 54(3): 660–668.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.19.31.73