Real data is dead? A half-million image biometric dataset says otherwise
A software firm claims to have assembled a 500,000-photo dataset that it says is not only “legally clean” and suitable for biometrics use, but is the largest such collection ever released.
Standard augmentation methods can boost the total to 2 million, according vAIsual, which to date has concentrated on synthetic media.
The high-resolution, original photos of real people come with biometric releases allowing them to be used for AI training.
Trained professionals took the photos in a studio with a green-screen backdrop. Machine learning professionals sat in on the sessions to help capture images that are best for machine learning. The consent and capture processes are depicted in a YouTube video.
This is a man-bites-dog story because the machine learning industry is feeling fairly burned by one dataset snafu after another followed by demonstrated bias.
Despite vAIsual’s dataset, the trend is probably still going toward synthetic subjects. That said, vAIsual could end up demonstrating that training databases of real people (which have their advantages) can be collected without spending a company out of business.