Speaker diarization consists of segmenting and clustering a speech recording into speaker-homogenous regions, using an unsupervised algorithm. In other words, given an audio track of a meeting, a speaker-diarization system will automatically discriminate between and label the different speakers (“Who spoke when?”). This involves speech/non-speech detection (“When is there speech?”) and overlap detection and resolution (“Who is overlapping with whom?”), as well as speaker identification.
Knowing when each speaker is talking in an audio or video recording can be useful in and of itself, but it is also an important processing step in many tasks. For example, in the field of rich transcription, speaker diarization is used both as a stand-alone application that attributes speaker regions to an audio or video file and as a preprocessing step for speech recognition. Using diarization for speech recognition enables speaker-attributed speech-to-text and can be used as the basis for different modes of adaptation, e.g., vocal tract length normalization (VTLN) and speaker-model adaptation. This task has therefore become central in the speech-research community, as evidenced by its inclusion in NIST’s Rich Transcription Evaluation, in which ICSI has frequently participated.
ICSI has a long history of research in speaker diarization, repeatedly contributing to the state of the art and leading the field in taking on more complex tasks like nonscripted multi-party interaction. We have used or extended our speaker diarization systems to perform automated copyright-violation detection, video navigation and retrieval, and automatic interactional analysis.
Ongoing research in the Speech and Audio & Multimedia groups aims to improve the robustness and efficiency of current approaches, to further develop our online (realtime) diarization techniques (“Who is speaking now?”), and to integrate speaker diarization techniques into multimodal approaches for video analysis. In our video-analysis work on percepts, we are extending speaker diarization techniques to the classification of non-speech sounds.
A state-of-the-art video-search system being built by multiple institutions. ICSI contributes expertise in using audio concept detection for event identification and video categorization. (Part of IARPA’s ALADDIN program.)
A browser that presents users with the basic narrative elements of a sitcom — scenes, punchlines, dialogue segments, etc. — and a per-actor filter on top of a standard video player interface, so they can navigate to particular elements or moments they remember.
A tool that automatically generates a “diary” of a meeting based on recordings, allowing a user to navigate directly to the contributions of particular participants, to search dialog by keyword, or to save time by listening to just the interesting-looking parts.
A specialization framework to automatically map and execute computationally intensive Gaussian Mixture Model (GMM) training on an NVIDIA graphics processing unit from Python code, without signiﬁcant loss in diarization accuracy.
A set of experiments to identify the most dominant people in meetings. This research also studied performance trade-offs between the system’s execution speed, interventions to make it work better, and the distance of the audio sensor from participants.
Uses video data to lower speaker-diarization error, using features derived from overall visual activity levels to make the algorithm more robust in the acoustic domain — and, in turn, using multimodal speaker diarization to support visual speaker localization.
Methods and Tools
Some of the areas in which ICSI’s Speech and Audio & Multimedia groups have developed new techniques contributing to speaker diarization include:
Speech/Non-Speech Detection: Speech/nonspeech detection is used in diarization in multiple important ways, most notably as an indicator of when speech is occurring, for predicting speaker turns. Much current research focuses on resolving detection difficulties due to environmental or background noise. Our system is part of the SHoUT Toolkit, which was developed by ICSI collaborator Marijn Huijbregts, who was then a researcher at University of Twente.
Beamforming: We use beamforming in two ways for diarization: first, to amplify the loudest speaker, and second, for speaker localization. Our tool for this is BeamformIt, which was developed at ICSI by Xavier Anguera. BeamformIt can accept audio data from a variable number of input channels and computes a single output via a filter&sum beamforming technique.
Features for Diarization: We use a wide variety of features for diarization, including short-term features such as Mel Frequency Cepstral Coefficients (MFCCs), long-term features such as prosodics, delay features such as those generated through beamforming, and modulation-filtered spectrogram (MSG) features.
Segmentation/Clustering: Segmentation is an unsupervised clustering task where the number of entities is unknown. Our system uses an agglomerative hierarchical clustering approach based on a Hidden Markov Model (HMM), which models the temporal structure of the acoustic observations, and Gaussian Mixture Models (GMM), which model the multimodal characteristics of the data.
Online Diarization: Online diarization tries to solve the question “Who is speaking now?” in a real-time format. We use pretrained Gaussian-based models for each speaker, as well as a model trained for nonspeech (or, alternatively, we include a real-time speech/nonspeech detector.) This system is lightweight and we have used it extensively in other experiments and applications.
Corpora for Speech Recognition and Speaker Diarization Research @ ICSI
ICSI Meeting Corpus: We produced an audio corpus of 40+ hours of multichannel studio-quality recordings of actual meetings, the largest of its kind at the time it was released.
- The corpus and transcripts are available through the Linguistic Data Consortium (LDC).
- A full description and documentation are available on the ICSI Meeting Corpus website.
AMI Meeting Corpus: ICSI is a member of the Augmented Multi-party Interaction (AMI) consortium, which produced an audio and video corpus of 100 hours of mostly scenario-driven meetings.
- The corpus, transcriptions, and annotations are available at the AMI Meeting Corpus website.
Researchers @ ICSI (Past and Current):
- Jitendra Ajmera
- Xavi Anguera
- Kofi Boakye
- Gerald Friedland
- Luke Gottlieb
- Yan Huang
- Marijn Huijbregts
- Bao-Lan Huynh
- David Imseng
- Adam Janin
- Mary Knox
- Nikki Mirghafori
- Nelson Morgan
- Jose-Manuel Pardo
- Beatriz Trueba-Hornero
- David Van Leeuwen
- Carlos Vaquero
- Oriol Vinyals
- Chuck Wooters
Collaborators @ Other Institutions:
- Jike Chong
- Henry Cook
- Armando Fox
- Daniel Gatica-Perez
- Ekaterina Gonina
- Javier Hernando
- Hayley Hung
- Shoaib Kamil
- Christian Müller
- David Patterson
ICSI’s speaker diarization work has been supported by the Parallel Computing Laboratory at the University of California at Berkeley (ParLab), by the Swiss National Center of Competence in Research on Interactive Multimodal Information Management (IM2), through the Augmented Multi-party Interaction project, and by a DARPA Robust Automatic Transcription of Speech (RATS) grant, among others. The opinions, findings, and conclusions described on this website are those of the researchers and do not necessarily reflect the views of the funders.