Ethically developing sets for face biometrics demands a community approach
In the physical world, events flow in one direction. A hiker is bitten by a snake, gets medical attention and is healed. The bandaging does not reach back to change the nature of the snake.
But Princeton University researchers trying to better understand the issue of bias in machine learning datasets used in face biometrics have found an unexpected loop effect in the creation of the sets.
They say in a new pre-press research paper (which has yet to be reviewed), that developments after a dataset is assembled can create unexpected ethical consequences.
The team says makers of these specialized datasets must work with all stakeholders involved in the use of sets from the beginning of the process and throughout a dataset’s useful existence. Those listed include conference program committees and the research community itself.
The paper’s title says it all: Mitigating dataset harms requires stewardship. Ethical development can be foiled even when applied for ethical aims — because those aims were not anticipated by developers.
Three biometrics datasets were studied: Labeled Faces in the Wild, MS-Celeb-1M and DukeMTMC. The researchers analyzed almost 1,000 papers that cited the trio and found that that a range of factors can end up creating ethical quandaries where none seemed obvious during development.
Two of the datasets, DukeMTMC and MS-Celeb-1M, had been retracted over privacy, bias and other concerns prior to the Princeton team’s research efforts.
None of the factors are by themselves problematic. But by not considering them, some dataset developers are making their work less useful or even unusable.
Those factors are broad technological and social change, the building of derivative datasets and models, the clarity of licenses and set-management practices.
The task of redirecting the AI industry will be harder than informing open-minded machine learning developers. Plenty of face biometrics researchers have abandoned standards and guidelines for data collection in order to make quick money off a growing industry.