1.4. Speaker Recognition: Reality and Challenge

The goal of speaker recognition is to verify an individual's identity based on his or her voice. Because voice is one of the most natural forms of communication, identifying people by voice has drawn the attention of lawyers, judges, investigators, and law enforcement agencies [355]. The recent proliferation of home banking has also opened up business opportunities for vendors marketing speaker verification products. The following list summarizes some of the potential applications of speaker recognition.[3]

[3] For more potential applications and a list of vendors marketing speaker recognition products, visit http://www.biometrics.org.

  • Securing online transactions. Financial institutions and banks can use speaker verification to enhance their e-banking and phone banking services. Customers' voices can be used together with passwords to verify the identity of individuals before transactions take place. For example, in 1999, speaker verification technologies were used to enhance the user-friendliness and security of BACOB's phone banking systems [144]. Customers register with the system by uttering three short, random texts or passwords. Before a transaction, the system prompts the customers to utter one of these passwords.

  • Securing critical medical records. Speaker verification offers a means of verifying the identity of an individual who needs to access his or her own medical records via phone or Internet. Medical personnel can also use this technology to authenticate themselves before accessing the medical records of patients.

  • Preventing benefit fraud. Speaker verification can be used by governments to track individuals claiming benefits. If the voices of social benefit recipients are stored in a database, any fraudulent attempts to claim benefits twice can be detected.

  • Reset passwords. A high proportion of phone calls to help desks are requests for resetting passwords. Speaker verification can help automate the password reset process.

  • Voice indexing. Speaker verification can be applied to create indexes for broadcast news. Given hours of news recordings containing the speech of several news reporters, it is possible to use a small part of the recordings to build a speaker model for each reporter. Once speaker models have been created, the time intervals during which a particular reporter is speaking can be spotted automatically.

Voice Biometrics

To better understand what current technologies can offer, this section examines the results of a recent NIST speaker recognition evaluation [288]. In this evaluation, a system trained on a two-minute cellular phone conversation for each target speaker achieved a false alarm rate (FAR) of 5% and a miss rate (FRR) of 10% given test segments (also cellular phone conversations) of 26 to 35 seconds. It was also reported that handset mismatch plays an important role in degrading performance. For example, in Przybocki and Martin [288], it was reported that at a 5% miss rate, the FAR could increase from 1.5% to 10% when enrollment and verification sessions use different handsets.

More recently, it was found that fusing low-level spectral features with high-level speaker information, such as idiolectal and prosodic information, can dramatically reduce error rates. For example, a recent report shows that with eight conversations (2.5 minutes each) for training a speaker model, and 2.5 minutes of speech for each verification session, the equal error rate can be reduced from 0.7% to 0.22% [42]. Compared to systems that use low-level features only, this represents a dramatic 66% reduction in the equal error rate. While high-level features can significantly improve speaker verification performance, they require long utterances to be effective. This may limit their applicability.

Commercial products have also been evaluated. For example, in 2000, the Centre for Communication Interface Research at the University of Edinburgh performed a large-scale evaluation of the Nuance Verifier [143]. The evaluation involved 1,000 participants making calls via the U.K. phone network to simulate phone banking services. The results show that with test utterances consisting of 19 digits, the Nuance Verifier achieved an equal error rate of 0.9%.

Although considerable progress has been made during the last decade, there are still many unsolved problems that prevent voice biometrics from appearing everywhere. In particular, variations in speakers' voices over time could considerably affect system performance (e.g., as a result of changes due to aging [108, 248]). Another challenge is that users may use different devices (e.g., mobile phones, fixed-line handsets, speakerphones) for accessing a system. As different transducers introduce different degrees of distortion to speech signals, it is very difficult to compensate for their effect on speaker characteristics. The increasing popularity of mobile devices introduces another problem—coder distortion. For instance, if a person uses a carbon button handset over a wired network for enrollment and later uses an electret mobile handset over the cellular network for verification, the combination of handset and coder difference is likely to make the system classify him or her as an impostor. Finally, many speakers can alter their voice voluntarily. This ability enables impersonators to attack speaker verification systems. Because of these challenges, it is not surprising to see the following conclusion, which was presented at a recent conference on speech [31]:

Despite the existence of technological solutions to some constrained applications, at the present time, there is no scientific process that enables one to uniquely characterize a person's voice or to identify with absolute certainty an individual from his or her voice.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.66.185