Private medical record photos spotted in biometrics training dataset
Medical record photos are private — but that may not stop them from showing up in datasets used to train artificial intelligence (AI) and biometric systems, according to a story on Ars Technica.
A California artist who works with AI was shocked to discover that LAION-5B, a dataset scraped from publicly available images on the web, contained two post-op medical photos of her taken nearly a decade ago. The artist, who calls herself Lapine, said the photos were shot following procedures to treat dyskeratosis congenita, a genetic disorder that inhibits blood cell production in the bone marrow.
A signed release Lapine posted on Twitter clearly shows she did not consent to the photos being used anywhere outside her medical record. The surgeon who took the pictures died in 2018. How they got into LAION-5B is anyone’s guess. But one thing is certain: they are not the only sensitive biometric data in there. Ars Technica conducted a search to confirm that Lapine’s photos were indeed present in LAION-5B, and discovered “thousands of similar patient medical record photos in the data set, each of which may have a similar questionable ethical or legal status.” Furthermore, many of these may already have been integrated into commercial AI image synthesis services and used to train facial recognition algorithms.
LAION is a non-profit organization “aiming to make large-scale machine learning models, datasets and related code available to the general public.” In other words, its datasets are composed of lists of URLs to original images. So, while its website does have brief instructions on how EU citizens can request takedowns in specific scenarios (e.g., when image and name are linked), LAION does not actually host the images in its datasets. When Lapine posted a question about her problem to LAION’s Discord server, an engineer from the organization suggested she ask for it to be taken down at the source — i.e., it was not LAION’s fault her picture was out there to be scraped.
Lapine, for her part, still wants her photos removed from LAION 5-B and has paused her work with AI, for now, citing ethical concerns about what — or who — might end up in it. “Just because they scraped it from the web doesn’t mean it was supposed to be public information,” she says. “Or even on the web at all.”
The discovery comes weeks after AlgorithmWatch found that a facial recognition data set of trans people remained available online for several years after the initial controversy of its existence.