MIT AI training dataset pulled down for racist, sexist, vulgar labels as industry grapples with bias
A database used in training systems for tasks like facial biometrics and object recognition has been taken down by the Massachusetts Institute of Technology (MIT) after The Register reported it includes racist, misogynistic and vulgar images and labels.
The 80 Million Tiny Images training dataset was created in 2008 to help advance object detection technology, but contains images describing women, Black and Asian people in derogatory language, as well as close-up pictures of sexual organs labeled with offensive slang terminology.
A paper on the dataset from startup UnifyID AI Labs Chief Scientist Vinay Prabhu and University College Dublin PhD candidate Abeba Birhane has been submitted to a computer vision conference for presentation next year. The researchers found that each of nine derogatory terms were used to label more than a thousand images. Training neural networks on such a database would build prejudice into the systems, and go beyond demographic performance differences to build a different kind of bias into AI.
MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) Professor of Electrical Engineering and Computer Science Antonio Torralba told The Register that in retrospect the school should have manually screened the labels used. He apologized on behalf of the lab and said the dataset has been taken down so the troubling content can be removed.
The school noted that between the size of the database and its “Tiny” images capable of running on the computing resources available when it was made, manual inspection may not be feasible or effective at removing the offensive images.
“We therefore have decided to formally withdraw the dataset. It has been taken offline and it will not be put back online. We ask the community to refrain from using it in future and also delete any existing copies of the dataset that may have been downloaded,” the statement reads.
The dataset was scrapped from Google Images, with images divided into roughly 75,000 categories. Torralba said the scraping was performed by pasting more than 53,000 different nouns form WordNet to search for images using them. WordNet was built at Princeton’s Cognitive Science Laboratory to examine the relationship between words, not specifically for association with images.
Even datasets purpose-built for training facial recognition systems have faced criticism for collecting images without consent, and even an IBM dataset created specifically to root out bias in AI has been targeted by litigation.
The debate over the role of imbalanced datasets in causing biased AI boiled over in a Twitter debate between Facebook Chief AI Scientist Yann LeCun and Google Ethical Artificial Intelligence Team Technical Co-Lead Timnit Gebru, summarized by Synced. The original point of contention was LeCun’s assertion that “ML systems are biased when data is biased,” to which Gebru responded that the problem extends beyond that to social and structural problems.
The University of Notre Dame has launched a new Tech Ethics Lab with IBM’s support to research issues like police use of facial recognition, The Washington Post reports.
IBM will invest $20 million over the next decade in the initiative, which seeks to apply ethics earlier in the development of new technologies.