FB pixel

MIT AI training dataset pulled down for racist, sexist, vulgar labels as industry grapples with bias



A database used in training systems for tasks like facial biometrics and object recognition has been taken down by the Massachusetts Institute of Technology (MIT) after The Register reported it includes racist, misogynistic and vulgar images and labels.

The 80 Million Tiny Images training dataset was created in 2008 to help advance object detection technology, but contains images describing women, Black and Asian people in derogatory language, as well as close-up pictures of sexual organs labeled with offensive slang terminology.

A paper on the dataset from startup UnifyID AI Labs Chief Scientist Vinay Prabhu and University College Dublin PhD candidate Abeba Birhane has been submitted to a computer vision conference for presentation next year. The researchers found that each of nine derogatory terms were used to label more than a thousand images. Training neural networks on such a database would build prejudice into the systems, and go beyond demographic performance differences to build a different kind of bias into AI.

MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) Professor of Electrical Engineering and Computer Science Antonio Torralba told The Register that in retrospect the school should have manually screened the labels used. He apologized on behalf of the lab and said the dataset has been taken down so the troubling content can be removed.

The school noted that between the size of the database and its “Tiny” images capable of running on the computing resources available when it was made, manual inspection may not be feasible or effective at removing the offensive images.

“We therefore have decided to formally withdraw the dataset. It has been taken offline and it will not be put back online. We ask the community to refrain from using it in future and also delete any existing copies of the dataset that may have been downloaded,” the statement reads.

The dataset was scrapped from Google Images, with images divided into roughly 75,000 categories. Torralba said the scraping was performed by pasting more than 53,000 different nouns form WordNet to search for images using them. WordNet was built at Princeton’s Cognitive Science Laboratory to examine the relationship between words, not specifically for association with images.

Even datasets purpose-built for training facial recognition systems have faced criticism for collecting images without consent, and even an IBM dataset created specifically to root out bias in AI has been targeted by litigation.

The debate over the role of imbalanced datasets in causing biased AI boiled over in a Twitter debate between Facebook Chief AI Scientist Yann LeCun and Google Ethical Artificial Intelligence Team Technical Co-Lead Timnit Gebru, summarized by Synced. The original point of contention was LeCun’s assertion that “ML systems are biased when data is biased,” to which Gebru responded that the problem extends beyond that to social and structural problems.

The University of Notre Dame has launched a new Tech Ethics Lab with IBM’s support to research issues like police use of facial recognition, The Washington Post reports.

IBM will invest $20 million over the next decade in the initiative, which seeks to apply ethics earlier in the development of new technologies.

Article Topics

 |   |   |   |   |   |   |   | 

Latest Biometrics News


Biometrics developers dance with data privacy regulations continues

Biometrics controversy and investments are often found side by side, as seen in many of this week’s top stories on…


EU AI Act should revise its risk-based approach: Report

Another voice has joined the chorus criticizing the European Union’s Artificial Intelligence Act, this time arguing that important provisions of…


Swiss e-ID resists rushing trust infrastructure

Switzerland is debating on how to proceed with the technical implementation of its national digital identity as the 2026 deadline…


Former Jumio exec joins digital ID web 3.0 project

Move over Worldcoin, there’s a new kid on the block vying for the attention of the digital identity industry and…


DHS audit urges upgrade of biometric vetting for noncitizens and asylum seekers

A recent audit by the DHS Office of Inspector General (OIG) has called for the Department of Homeland Security (DHS)…


Researchers spotlight Russia’s opaque facial recognition surveillance system

In recent years, Russia has been attracting attention for its use of facial recognition surveillance to track down protestors, opposition…


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Most Read This Week

Featured Company

Biometrics Insight, Opinion

Digital ID In-Depth

Biometrics White Papers

Biometrics Events