Big jump in public face biometric dataset size

A large team of researchers overwhelmingly from China says it has created new million-scale facial recognition benchmark. They claim in a new paper to have built an autonomously cleaned biometric dataset of 2 million identities among 42 million facial images.
The uncurated dataset holds 4 million celebrity identities among 260 million images. The new proposed benchmark is called WebFace260M, and it is being described as the largest public face biometric dataset.
That is a significant differentiator. Public researchers have decried the disadvantage they are at with dataset resources compared to private companies – particularly Facebook and Google. For all intents and purposes, both have unlimited image datasets.
The research paper says Google taps 200 million images of 8 million identities when training FaceNet. Facebook has 500 million faces among 10 million identities.
Dataset size is a potent accelerator of biometrics innovation, and public researchers are worried about being shut out of the race.
The WebFace260M researchers, from Tsinghua University, Imperial College London and a Chinese startup, XForwardAI, claim that their dataset “shows enormous potential on standard, masked and unbiased face recognition scenarios.” It was cleaned with an AI tool they developed, Cleaning Automatically by Self-Training.
Jack Clark, co-founder of AI safety and research firm Anthropic, writing in his blog Import AI, says, “Models trained on the resulting dataset are pretty good.”
Clark also makes the point that facial recognition – especially masked facial recognition – is important to government surveillance agencies. Results like those of WebFace260M influence decisions about “how to surveil a population and how much budget to set aside for said surveillance.”
A dataset this size has more proximate dangers, of course. With great volumes could come privacy-restricted images, long a problem for datasets created by academics and businesses alike.
A site has been posted with project history and updated details.
Article Topics
biometrics | biometrics research | dataset | facial recognition
Comments