Explainer: Speaker Recognition
Speaker, or voice, recognition is a biometric modality that uses an individual’s voice for recognition purposes. (It is a different technology than “speech recognition”, which recognizes words as they are articulated, which is not a biometric.) The speaker recognition process relies on features influenced by both the physical structure of an individual’s vocal tract and the behavioral characteristics of the individual.
It is a popular choice for remote authentication due to the availability of devices for collecting speech samples (e. g., telephone network and computer microphones). Due to its ease of integration, speaker recognition is different from some other biometric methods in that speech samples are captured dynamically or over a period of time, such as a few seconds. Analysis occurs on a model in which changes over time are monitored, which is similar to other behavioral biometrics such as dynamic signature, gait, and keystroke recognition.
The physiological component of voice recognition is related to the physical shape of an individual’s vocal tract, which consists of an airway and the soft tissue cavities from which vocal sounds originate. To produce speech, these components work in combination with the physical movement of the jaw, tongue, and larynx and resonances in the nasal passages. The acoustic patterns of speech come from the physical characteristics of the airways.
Motion of the mouth and pronunciations are the behavioral components of this biometric. There are two forms of speaker recognition: text dependent (constrained mode) and text independent (unconstrained mode).
In a system using “text dependent” speech, the individual presents either a fixed or prompted phrase that is programmed into the system and can improve performance especially with cooperative users.
A “text independent” system has no advance knowledge of the presenter’s phrasing and is much more flexible in situations where the individual submitting the sample may be unaware of the collection or unwilling to cooperate, which presents a more difficult challenge.
Speech samples are waveforms with time on the horizontal axis and loudness on the vertical access. The speaker recognition system analyzes the frequency content of the speech and compares characteristics such as the quality, duration, intensity dynamics, and pitch of the signal.
In “text dependent” systems, during the collection or enrollment phase, the individual says a short word or phrase (utterance), typically captured using a microphone that can be as simple as a telephone. The voice sample is converted from an analog format to a digital format, the features of the individual’s voice are extracted, and then a model is created. Most “text dependent” speaker verification systems use the concept of Hidden Markov Models (HMMs), random based models that provide a statistical representation of the sounds produced by the individual. The HMM represents the underlying variations and temporal changes over time found in the speech states using the quality I duration / intensity dynamics / pitch characteristics mentioned above.
Another method is the Gaussian Mixture Model, a state-mapping model closely related to HMM, that is often used for unconstrained “text independent” applications. Like HMM, this method uses the voice to create a number of vector “states” representing the various sound forms, which are characteristic of the physiology and behavior of the individual.
These methods all compare the similarities and differences between the input voice and the stored voice “states” to produce a recognition decision. After enrollment, during the recognition phase, the same quality / duration / loudness / pitch features are extracted from the submitted sample and compared to the model of the claimed or hypothesized identity and to models from other speakers. The other-speaker (or “anti-speaker”) models contain the “states” of a variety of individuals, not including that of the claimed or hypothesized identity. The input voice sample and enrolled models are compared to produce a “likelihood ratio,” indicating the likelihood that the input sample came from the claimed or hypothesized speaker. If the voice input belongs to the identity claimed or hypothesized, the score will reflect the sample to be more similar to the claimed or hypothesized identity’s model than to the “anti-speaker” model.
The seemingly easy implementation of speaker recognition systems contributes to the process major weakness and susceptibility to transmission channel and microphone variability and noise.
Systems can face problems when end users have enrolled on a clean landline phone and attempt verification using a noisy cellular phone. The inability to control the factors affecting the input system can significantly decrease performance. Speaker verification systems, except those using prompted phrases, are also susceptible to spoofing attacks through the use of recorded voice. Anti-spoofing measures that require the utterance of a specified and random word or phrase are being implemented to combat this weakness.
For example, a system may request a randomly generated phrase, to prevent an attack from a pre-recorded voice sample. The user cannot anticipate the random sample that will be required and therefore cannot successfully attempt a “playback” spoofing attack on the system.
Current research in the area of “text independent” speaker recognition is mainly focused on moving beyond the low-level spectral analysis previously discussed. Although the spectral level of information is still the driving force behind the recognitions, fusing higher-level characteristics with the low level spectral information is becoming a popular laboratory technique.
Speaker recognition characteristics such as rhythm, speed, modulation and intonation are based on personality type and parental influence; and semantics, idiolects, pronunciations and idiosyncrasies are related to birthplace, socio-economic status, and education level.
Higher-level characteristics can be combined with the underlying low-level spectral information to improve the performance of “text independent” speaker recognition systems.