Dia-Localization combines data from different sources to jointly tackle two problems, each traditionally solved using a single modality. The combination of the two modalities leads to a higher robustness than a current state-of-the-art audio speaker diarization. Furthermore, the visual models and the output of the speaker diarization allows for a bi-modal localization of the speakers in the video. We view this system as a successful example of multimodal integration in computer science: a unimodal state-of-the art (audio) system gains improvements in accuracy and extends its capabilities by adopting an additional modality (visual), at little incremental engineering or computation cost.
In this project, we extended our speaker diarization systems to estimate both acoustic and visual models, as part of a joint unsupervised optimization. The speaker diarization system first automatically determines the number of speakers, segments the speech track, and clusters the segments by speaker (“Who spoke when?”), then, in a second step, visual activity levels are used to infer the location of the speakers in the video (“Where was the speaker?”). Incorporating video data in this way lowers speaker-diarization error.
The experiments were performed on recordings of meetings in the AMI Meeting Corpus. We used annotated recordings from a single low-resolution camera and a single far-field microphone; however, the system is designed to deal with an arbitrary number of cameras. This flexibility allows for the system to be easily deployed in multiple contexts.
Multimodal diarization and localization has the potential to be used to improve many existing speaker-diarization techniques, including discriminating between speech and nonspeech noise, quickly identifying speaker overlap, and identifying very short backchannel responses. These techniques also have many potential uses as a front-end processing step for high-level analysis tasks, such as behavioral analysis (e.g., meeting dominance estimation).