IBM launches public data set to further research into diversity and facial biometrics
IBM has announced the launch of a new data set specifically created to further research into, and ultimately the development of fair and accurate facial recognition algorithms by both the company and the broader artificial intelligence community. The announcement was made in a blog post by IBM Fellow and Manager of AI Tech Dr. John Smith, explaining the data-driven deep learning methods that are a strength of the technology can also be a weakness without sufficiently robust and diverse data sets.
The new “Diversity in Faces” biometric data set is made up of a million publicly available images, annotated according to 10 of the top coding schemes in industry literature, Smith told Biometric Update in an interview. IBM announced its intention to create a million-image data set to help understand how training data diversity effects algorithmic outcomes last June. Since then, attention has continued to increase on the fairness of facial biometric systems, or discrepancies in matching accuracy they exhibit between different groups of people. MIT researcher Joy Buolamwini said at the World Economic Forum in Davos that striking improvement by IBM’s facial recognition algorithm in identifying females with dark skin shows the problem is a matter of prioritization.
With Diversity in Faces, IBM is demonstrating its priorities, according to Smith.
“We’ve been very focused on concerns around ensuring our services, including visual recognition and face recognition are fair and accurate,” he says. “There’s really been a continuum of efforts here, however, with this data set release, we’re raising it up to another level. This is also an effort by us to galvanize the larger research community around this important topic.”
Diversity in Faces includes images from the YFCC-100M creative commons image set, which company analysis shows provides a more balanced distribution and broader coverage of facial images than previous sets. The images are augmented with coding schemes which mainly consist of objective measures of such as craniofacial features, as well as some more subjective annotations, such as human predictions of age and gender.
“These are some of the strongest coding schemes that we’ve identified in the scientific literature, all referenced work, but digging much more deeply into the dimensions of facial diversity that matter,” Smith explains.
Launching the new data set provides AI researchers with a “jumping off point” for assessing the quality of data based on the elements that are most important for characterizing human faces.
“The key question is about the data that we use,” Smith tells Biometric Update. “How can we ensure that face image data is sufficiently diverse? How do we ensure that the systems that are trained from that data somehow reflect the distribution of faces we see in the world? How do we ensure they don’t have blind spots?”
IBM has engaged with Buolamwini and other researchers to try to understand the problems related to representation in data sets for some time. Moving beyond the recognition of a problem to the creation of a tool to learn more about how to solve that problem is the importance of Diversity in Faces, Smith says.
“There are many ways in which its apparent that the technologies today that are in practice are struggling to be fair and accurate. That said, I think until now there’s been no concerted, systematic effort to solve the problem. It’s been more about pointing out the problem. So that’s really what motivated us to create this particular data set Diversity in Faces. This is not a call to action; this is action.”
Determining how to measure facial diversity, and ensure balance in coverage, and then learning how that improves systems is a necessary early step in the process of training fair systems, Smith contends. Getting better answers to the scientific questions behind facial recognition algorithms will ultimately yield fairer, more accurate systems.
“With these ten coding schemes, it’s a great start, but it’s not complete,” cautions Smith. “That’s why we feel there’s a lot of opportunity for us to build more on this, as well as the broader research community, which is why we’re releasing this publicly.”