Clearview patent on method for scaling biometric training dataset gets US notice of allowance
A patent for a method of biometric algorithm training filed by Clearview AI is on its way to being granted, after the company received a notice of allowance from the U.S. Patent and Trademark Office.
The patent for a ‘Scalable training data preparation pipeline and efficient distributed trainer for deep neural networks in facial recognition’ describes a method for building a training dataset by collecting and organizing images from the internet.
Clearview Vice President of Research Terence Liu explained to Biometric Update in an interview that face biometrics algorithms are trained by ingesting several images from each subject, and then organizing data from ingested images into “clusters” with other images from the same subject.
The patent, therefore, describes images being collected from public sources, grouped by identity and deduplicated before being used for training. Once the matching algorithm is improved, it can be used to find and add more images to each group for further training.
CEO Hoan Ton-That says the minimum number of images for an effective cluster appears to be around five.
Traditional data cleaning involves choosing the biggest cluster “and somehow just be okay with it,” Liu says, “and we found that by doing that you throw way a lot of data. You might not be easily feasible to keep one single identity based on similarity calculations, so we devised a way to make the best use of that data and to surface multiple facial clusters from each of the labeled identities, and then find some clever ways to recombine them to enlarge the variation inside of each facial cluster.”
High volume, messy data
Ultimately, algorithm training and therefore training dataset composition are largely about volume.
“We have a lot of data coming into our pipeline and there’s different sorts maybe from Instagram or other places, and we started experimenting,” Liu recounts. “A big part was cleaning the data to just ingest it in a certain way that the trainer understands and can make best use of.”
“Through that process we gained a lot of practical understanding and methods that have proven to work the best,” he adds. “A lot have to do with how to cluster the data, how to combine the images belonging to the same person. How to clean, de-dupe, and merge, the identity labels belonging to different people. Because we’re dealing with a very messy, noisy form of data, in its raw form.”
The images are arranged into sub-groups by identity, introducing more intra-identity variation with each iteration. This variation allows the algorithm to improve its performance.
Without careful sorting, Ton-That points out that sometimes more data can actually make the result worse.
“Having a way to clean images from the open internet has great implications for where the industry goes in terms of trying to get really large-scale databases,” he says.
For Clearview, the benefit is seen in the performance of the company’s algorithms, according to Ton-That. He says that with the technique being patented, each new model “would find these things on the edges, like blurry ones or different angles, and add it to the training set, and we would just see our overall scores go up on internal tests.”
Collecting “messy” data is necessary to find these edge cases and increase variation. The deep neural networks also benefit from data drawn from different sources. Any one image source will not have enough variation, even if it looks to human eyes like it does, Liu says.
When asked about the potential for synthetic data to provide the same quality of data, Liu is skeptical. For face detection training, he says, synthetic data can be quite beneficial. However, “with facial recognition, the task is far more daunting, because the real-life variations are broader than synthetic data is able to capture.”
Research indicates not competitive results from models trained on only synthetic data, Ton-That observes.
Clearview does use augmentation, which it has found improves facial recognition performance for faces occluded with sunglasses and masks.
The company’s main advantage is in the scale of its training database, however. The patent is a move to protect its method for building it.
“Incumbent companies sometimes wait and see with new technologies to see the viability and adoption of them in the marketplace, then copy innovations later once they have been proven to be valuable,” Ton-That says. “These patents help protect us against a potential future competitor who would like to copy our facial recognition search engine, or our method for creating a highly-accurate bias-free facial recognition algorithm from large scale public internet datasets.”
Coping with scale
The patent refers to “distributing feature centroid vectors in chunks to a plurality of graphics processing units (GPUs), wherein each chunk is distributed to one GPU.”
Distributing the heavy computing load in this way makes the method more efficient.
Liu explains that the company has described an approach for running a big data ingestion and trainer in parallel, which may be included in the final version of the same patent. Large models and servers are needed not just for efficiency, but also just to fit all the data in.
Even so, each big training job takes a week or more if it is run on a cheaper instance, Liu estimates, and that is at the current database size.
Ambitions and conditions
Ton-That says Clearview’s current training set is 70 million images. He wants to get to 200 million or a billion faces using the tech covered in the patent.
Some of Clearview’s detractors will see that project as a threat to privacy, but Ton-That points out that facial recognition’s accuracy is also targeted by critics of the technology.
The potential for improved accuracy is obvious, and Ton-That emphasizes its importance particularly for setting policy around high-risk uses.
He does not believe this soon-to-be-patented technique will run afoul of current regulations, even in places with tighter controls than the U.S. Clearview is protected by exemptions for law enforcement in the EU and public data in Canada, he says.
For the time being, however, Clearview is “just not doing business in those countries.”
Ton-That also pushes back on the argument that using public data for algorithm training represents a threat to privacy.
“No-one’s harmed in the collection or making of the algorithm,” he asserts. Further, “there’s actually no personally-identifiable information in the training sets.”
The company’s agreement with the ACLU earlier this year applies only to the sale of its service that includes its database of images, not training data.
The FTC, which has expressed an interest in regulating facial recognition training data, is primarily concerned with deceptive trade practices, which puts uses of private data for unintended purposes in the agency’s scope.
“We haven’t had any issues or complaints around the use of our public data to train the algorithm at all,” Ton-That says.