Innovatrics CEO advises careful use of synthetic data to improve biometrics, cut bias
Synthetic data can be used to improve biometric machine learning models and AI applications, says Innovatrics CEO Ján Lunter, but there are limits and reasons for caution.
These insights are from a research paper on the topic penned by Lunter. ‘Synthetic data: a real route to eliminating bias in biometrics’ was published in the 2023 volume of the journal biometric Technology Today.
One of the most obvious potential benefits of synthetic data in biometric algorithm training is the balancing of training datasets that are made up disproportionately of white men, leading to demographic performance disparities, or ‘bias.’ Gaps are often found in gender, race, and other minorities, writes Lunter.
Synthetic data can also help address problems related to “cold starts” and insufficient quantities of data for training latent fingerprint biometric recognition systems, according to the paper.
Innovatrics has had success using only synthetic data to train OCR algorithms to read ID documents issued by different countries, in another example.
Overreliance on synthetic data could have negative consequences too. As it becomes cheaper and easier to obtain, organizations could be tempted to forego data from real life, introducing a risk that their AI systems could drift further than further from reality.
“From our experience, synthetic data can be very beneficial for training but one has to be extremely cautious when using them in testing and product validation,” Lunter tells Biometric Update in an emailed statement.
“If testing and validation data is realistic and without bias, we see close to zero risk in using synthetic data in training.”
As the realism of synthetic data improves, Lunter also says it could be used in product testing and validation.
“The big advantage of synthetic data is that it can be easily shared across organizations and therefore can be independently verified on multiple levels, with multiple parties involved,” Lunter says. “Real data is typically secret and cannot be easily tested for bias by independent parties as a result it can surprisingly be even less realistic than synthetic data.”
Lunter advises organizations considering the use of synthetic data to build up a strong culture capable of understanding the risks associated with its use among its employees, including legal teams. Companies should also verify the value of synthetic data they outsource, and consider when real-world data is more appropriate.
Research has shown that solid biometric accuracy can be achieved with only synthetic data, but not enough to compete with the state of the art.