Practical problems in biometric fairness, coming ISO standard explored in EAB webinar
New wrinkles to the problem of biometric bias, the development of standards for assessing it, and strategies for assessing and mitigating it were shared during the third day of presentations as part of the EAB’s recently-completed event on Demographic Fairness in Biometric Systems.
Yevgeniy Sirotin, principal investigator and manager of SAIC’s Identity and Data Sciences Laboratory at the Maryland Test Facility (MdTF), who presented findings from research conducted by a team he led along with Arun Vemury on ‘Demographic variation in the performance of biometric systems: insights gained from large-scale scenario testing.’
The presentation begins with a review of scenario testing, which is what is performed at the MdTF, and is compared to technology testing (as performed by NIST, for example). The approach and thinking involved in scenario testing can help to “frame questions” around fairness in the use of algorithms, Sirotin suggests at the outset.
Questions have been raised about the fairness of system performance as biometrics have been rolled out in airports and other settings, but as a new deployment, little information has been available to quantify these issues. Tests like the DHS Rallies held at MdTF collect feedback in the form of participant surveys to evaluate user satisfaction, along with algorithmic effectiveness, and through repeated evaluations over time, can yield some insight into the answers to those questions, Soritin says.
MdTF papers have already explored the role of image acquisition in demographic differences in system performance, and the influence of demographics on false match rate (FMR) estimates for facial recognition systems, compared the differences between commercial face and iris biometrics systems in performance variance for different races and genders, and looked into the introduction of cognitive biases to human reviewers by algorithm outcomes.
In an unattended high-throughput scenario test of face biometric systems conducted during the pandemic, MdTF found that many combinations of image acquisition systems and matchers met the 95 percent true identification rate (TIR) the test aimed for across all racial groups. The median system recognized 93 percent of people overall, and the best system correctly identified 100 percent of participants. Most of the errors were not made by algorithms, but at the image acquisition stage.
In a second part of the test, people kept the face coverings they had brought from home on while they went through the system. This revealed that for people wearing face masks, commercial facial recognition systems uniformly did not match Black people as well as other groups, with a larger difference both in capturing images of Black people wearing masks, and in matching them. The best performing combination of image acquisition system and algorithm failed to reach the 95 percent TIR goal for Black people wearing masks, suggesting it may not be appropriate for a high-throughput scenario.
Changing the conditions of the systems operations can make a previously fair system unfair, Sirotin points out.
He also spoke about watch-list identification scenarios, which has very different success criteria, as false negative matches can have such major consequences.
The kinds of errors facial recognition systems make are inherently different than those fingerprint or iris biometrics; ‘but why?’ Sirotin asks. The way that people recognize faces, which has been found to be associated with activity in certain areas of the brain, may be influencing the way the way we think facial recognition should work.
The presentation delves into the impact of small differences in false match rate for people of a particular race or gender when compared against datasets of different demographic composition, and shows that the hazards of false positive identification can be unequal based on dataset composition. Equal within-group error rates, therefore, will not protect people from unequal treatment by biometrics systems on its own, even if it is achieved.
ISO standards and assessment metrics
Jacob Hasselgren and John Howard, also of the MdTF, spoke about the latest developments in the ISO 19795-10 standard for measuring performance across demographic groups. The researchers are editors of the standard, and contributed to an ISO technical report on “the differential impact of demographic factors in biometric recognition system performance,” which was approved for publication in January of this year.
The ISO 19795 series provides a framework for biometric system testing and evaluation, and the Part 10 being currently drafted applies to performance variations across demographic groups. The first draft is expected to be completed this summer and the final version is anticipated to be published in 2023 or 2024.
The presentation detailed the scope, current challenges and the statistics that will be used in evaluating biometric systems for performance with different groups. Other challenges for creating the standard include the limitations of demographic categories like ‘Black’ or ‘Asian,’ which can describe people with highly diverse ethnic backgrounds, and group people with widely-ranging skin color together, and even how to judge statistical equality.
Various methods of assessing demographic differentials have been used, including area under curve (AUC) measurements, but Pereira points out that these methods assume different policies for FMR for different demographics, and could potentially hide some biases. Instead, he proposes ‘fairness discrepancy rate.’
This is a similar concept, he says, to the ‘inequity measure’ explained by Patrick Grother in the previous EAB webinar session.
A strategy for ‘patching’ scoring functions is proposed by Pereira, either at the time of testing or training the algorithm, and its advantages and disadvantages discussed.