New unsupervised AI method performs multimodal emotion recognition
Researchers from the University of Trento and Eurecat Centre Tecnològic have developed a new artificial intelligence method capable of performing unsupervised feature learning for multimodal emotion recognition (MER).
The system, detailed in a recent paper, is based on four unsupervised trained sub-networks, each processing a particular type of data: text, visual (facial images and landmarks) and acoustic. The sub-networks are used for pairwise contrastive learning.
In other words, it can combine face and voice biometrics with text to identify individuals’ emotions.
The authors used different backbones for each modality, according to a study of the state of the art to implement the most appropriate model for each type of data.
As a consequence, the MTCNN algorithm was chosen for face biometrics, for instance, and TCN for voice biometrics.
In terms of the databases used for the experiments, the researchers selected RAVDESS for speech recognition and CMU-MOSEI for facial recognition.
“The success in [MER] primarily relies on the supervised learning paradigm,” explains the report.
“However, data annotation is expensive, time-consuming, and as emotion expression and perception depend on several factors (such as age, gender, culture) obtaining labels with a high reliability is hard.”
To circumvent these issues, the researchers focus on unsupervised feature learning for MER.
This method, the authors claim, is the first attempt in MER literature. “Our end-to-end feature learning approach has several differences (and advantages) compared to existing MER methods.”
First, the method is unsupervised, which means it can be functional without data annotation. Second, it does not require data spatial augmentation, modality alignment, a large number of batch sizes or epochs. Third, it applies data fusion only at inference, and, finally, it does not require backbones pre-trained on emotion recognition tasks.
“The experiments on benchmark datasets show that our method outperforms several baseline approaches and unsupervised learning methods applied in MER,” the paper reads.
Additionally, being an unsupervised feature learning method, the team believes the proposed approach is transferable to other domains without retraining.
“The proposed method keeps the modality pairings the same for all data (like emotions), and the way we learn the features gives equal importance to each modality,” the report concludes.
“An alternative could be having different modality pairings for different emotion classes. This will be further investigated as future work.”
Emotion recognition research is a hot topic, even beyond academia. For instance, in May the American Bar Association suggested it may welcome emotional AI as a tool for honing courtroom and marketing performance.
Article Topics
AI | algorithms | biometrics | biometrics research | emotion recognition | face biometrics | multimodal biometrics | voice biometrics
Comments