Gender equality in speech recognition inherently challenging
Voice recognition technology is less accurate when applied to women than men due in part to the design of speech systems, but also because of inherent physiological differences, according to a blog post by Delip Rao, co-founder of AI speech recognition startup R7 Speech Sciences.
The differential error rates of speech samples from male and female speakers make training AI systems to recognize both equally difficult, Rao writes, and the problem is often exacerbated by commonly-used technologies such as MFCCs (Mel-frequency cepstral coefficients).
Mean fundamental frequency, or mean F0, which is related to the perception of pitch, is usually around 120Hz for men, and closer to 200Hz for women, and can also depend on ethnicity, smoking, sickness, and other factors. Rao also notes that the notion of gender in mean F0 is limited to biological gender at puberty.
“Speech systems designed without mindfulness to the extent of this problem can make an already hard problem worse,” he writes. “Fortunately, with recent deep models for speech, we can build models that directly learn from raw waveforms, throw a lot of data and compute at it, and hope the models have enough capacity to reliably encode class-specific variation. This is appealing but also sort of favors large companies than smaller startups that push out new technologies all the time. But with sufficient thought, many of these over-provisioned deep models may be replaced with simpler deep models.”
Kaggle Data Preparation Analyst Rachael Tatman told The Register that while MFCCs are not inherently less effective for modeling women’s speech, “there’s a slightly less robust acoustic signal for women, it’s more easily masked by noise, like a fan or traffic in the background, which makes it harder for speech recognition systems. That will affect whatever you use for your acoustic modelling, which is what MFCCs are used for.”
Rao suggests that with the increasing popularity of voice-activated digital assistants like Apple’s Siri, the opinions of women speech researchers should be sought about the speech models in production, and how to improve them.
Facial recognition systems have been shown to perform less accurately both for women, and for darker skinned people, leading to consideration of the problem by a congressional subcommittee seeking to guide government application of AI.