Synthetic data model shows promise for biometric bias mitigation

The limitations of real-world biometric training datasets, including the introduction of bias through unbalanced demographic representation, are well established. Synthetic training data offers promise, but has its own limitations. A novel method of avoiding those limitations was presented at the Norwegian Biometrics Laboratory Annual Workshop 2024, hosted by the EAB earlier this month.
Pietro Melzi of the Autonomous University of Madrid presented the GANDiffFace model, which generates synthetic faces for the purpose of mitigating demographic bias in training data. The research project was a collaboration between UAM, secunet and Hochschule Darmstadt University of Applied Sciences.
Using generative data allows researchers to control the attributes of samples in the dataset, in addition to advantages for privacy, availability, and regulatory compliance. Generative Adversarial Networks (GANs), however, deliver synthetic datasets that incorporate biases found in the training data, and can fail to provide enough intra-class variation to train effective facial recognition.
Diffusion models generate a wider variety of images, so Melzi and his colleagues proposed the GANDiffFace model, which combines both kinds of models. It uses a latent space manipulation method previously proposed by researchers at Idiap. Melzi and company used DreamBooth to bind new words with specific subjects to fine-tune text-to-image models.
Melzi described the details of the model’s development, and how it reduces the average of mated scores distribution, compared to a dataset composed only of images generated by a GAN, making it more similar to datasets composed of photos.
In the datasets traditionally used for training facial recognition, demographic distribution is skewed towards Caucasians, but the image quality also differs from one demographic to another, Melzi points out.
By using a dataset created with GANDiffFace, Melzi and his team were able to fine-tune the ArcFace model for significantly lower false match rates (FMRs) for different demographic groups.
Inaugural FRCSyn Challenge results
The FRCSyn Challenge was launched at WACV 2024 to interrogate whether synthetic data can replace real data for facial recognition training, whether it can mitigate known limitations in face biometrics, and what its limits are.
GANDiffFace was one of four databases made available to the 15 teams that entered the challenge. Most involve academic institutions, either on their own or in collaboration, but Facephi also appears among the top eight.
They were set several sub-tasks, and the trade-off between accuracy and fairness measured by subtracting the standard deviation from the average accuracy.
The winning teams were able to reduce bias with synthetic data, but even more participants were able to mitigate bias with a combination of real and synthetic data. Likewise, the combination of real and synthetic data produced higher overall accuracy scores.
This shows the effectiveness of synthetic data for mitigating the limitations of face biometrics algorithms, when combined with real data, Melzi says.
A second edition of the FRCSyn Challenge will run again later this year.
Article Topics
biometric-bias | biometrics | biometrics research | demographic fairness | EAB | EAB 2024 | European Association for Biometrics | FRCSyn Challenge | secunet | synthetic data | synthetic faces
Comments