(archive site)

Speaker Diarization

Speaker Diarization

Speaker diarization consists of segmenting and clustering a speech recording into speaker-homogenous regions, using an unsupervised algorithm. In other words, given an audio track of a meeting, a speaker-diarization system will automatically discriminate between and label the different speakers (“Who spoke when?”). This involves speech/non-speech detection (“When is there speech?”) and overlap detection and resolution (“Who is overlapping with whom?”), as well as speaker identification.

Knowing when each speaker is talking in an audio or video recording can be useful in and of itself, but it is also an important processing step in many tasks. For example, in the field of rich transcription, speaker diarization is used both as a stand-alone application that attributes speaker regions to an audio or video file and as a preprocessing step for speech recognition. Using diarization for speech recognition enables speaker-attributed speech-to-text and can be used as the basis for different modes of adaptation, e.g., vocal tract length normalization (VTLN) and speaker-model adaptation. This task has therefore become central in the speech-research community, as evidenced by its inclusion in NIST’s Rich Transcription Evaluation, in which ICSI has frequently participated.

ICSI has a long history of research in speaker diarization, repeatedly contributing to the state of the art and leading the field in taking on more complex tasks like nonscripted multi-party interaction. We have used or extended our speaker diarization systems to perform automated copyright-violation detection, video navigation and retrieval, and automatic interactional analysis.

Ongoing research in the Speech and Audio & Multimedia groups aims to improve the robustness and efficiency of current approaches, to further develop our online (realtime) diarization techniques (“Who is speaking now?”), and to integrate speaker diarization techniques into multimodal approaches for video analysis. In our video-analysis work on percepts, we are extending speaker diarization techniques to the classification of non-speech sounds.

Current Projects

AURORA: Content-Guided Search of Diverse Videos:

A state-of-the-art video-search system being built by multiple institutions. ICSI contributes expertise in using audio concept detection for event identification and video categorization. (Part of IARPA’s ALADDIN program.)

Recent Projects


A browser that presents users with the basic narrative elements of a sitcom — scenes, punchlines, dialogue segments, etc. — and a per-actor filter on top of a standard video player interface, so they can navigate to particular elements or moments they remember.

The Meeting Diarist:

A tool that automatically generates a “diary” of a meeting based on recordings, allowing a user to navigate directly to the contributions of particular participants, to search dialog by keyword, or to save time by listening to just the interesting-looking parts.

Fast Speaker Diarization using Python:

A specialization framework to automatically map and execute computationally intensive Gaussian Mixture Model (GMM) training on an NVIDIA graphics processing unit from Python code, without significant loss in diarization accuracy.

Dominance Estimation:

A set of experiments to identify the most dominant people in meetings. This research also studied performance trade-offs between the system’s execution speed, interventions to make it work better, and the distance of the audio sensor from participants.


Uses video data to lower speaker-diarization error, using features derived from overall visual activity levels to make the algorithm more robust in the acoustic domain — and, in turn, using multimodal speaker diarization to support visual speaker localization.

Methods and Tools

Some of the areas in which ICSI’s Speech and Audio & Multimedia groups have developed new techniques contributing to speaker diarization include:

Speech/Non-Speech Detection: Speech/nonspeech detection is used in diarization in multiple important ways, most notably as an indicator of when speech is occurring, for predicting speaker turns. Much current research focuses on resolving detection difficulties due to environmental or background noise. Our system is part of the SHoUT Toolkit, which was developed by ICSI collaborator Marijn Huijbregts, who was then a researcher at University of Twente.

Beamforming: We use beamforming in two ways for diarization: first, to amplify the loudest speaker, and second, for speaker localization. Our tool for this is BeamformIt, which was developed at ICSI by Xavier Anguera. BeamformIt can accept audio data from a variable number of input channels and computes a single output via a filter&sum beamforming technique.

Features for Diarization: We use a wide variety of features for diarization, including short-term features such as Mel Frequency Cepstral Coefficients (MFCCs), long-term features such as prosodics, delay features such as those generated through beamforming, and modulation-filtered spectrogram (MSG) features.

Segmentation/Clustering: Segmentation is an unsupervised clustering task where the number of entities is unknown. Our system uses an agglomerative hierarchical clustering approach based on a Hidden Markov Model (HMM), which models the temporal structure of the acoustic observations, and Gaussian Mixture Models (GMM), which model the multimodal characteristics of the data.

Online Diarization: Online diarization tries to solve the question “Who is speaking now?” in a real-time format. We use pretrained Gaussian-based models for each speaker, as well as a model trained for nonspeech (or, alternatively, we include a real-time speech/nonspeech detector.) This system is lightweight and we have used it extensively in other experiments and applications.

Corpora for Speech Recognition and Speaker Diarization Research @ ICSI

ICSI Meeting Corpus: We produced an audio corpus of 40+ hours of multichannel studio-quality recordings of actual meetings, the largest of its kind at the time it was released.

AMI Meeting Corpus: ICSI is a member of the Augmented Multi-party Interaction (AMI) consortium, which produced an audio and video corpus of 100 hours of mostly scenario-driven meetings.


Speaker diarization work at ICSI is a collaboration between the Speech and Audio & Multimedia research groups, as well as with researchers at UC Berkeley’s ParLab and other institutions.

Researchers @ ICSI (Past and Current):

  • Jitendra Ajmera
  • Xavi Anguera
  • Kofi Boakye
  • Gerald Friedland
  • Luke Gottlieb
  • Yan Huang
  • Marijn Huijbregts
  • Bao-Lan Huynh
  • David Imseng
  • Adam Janin
  • Mary Knox
  • Nikki Mirghafori
  • Nelson Morgan
  • Jose-Manuel Pardo
  • Beatriz Trueba-Hornero
  • David Van Leeuwen
  • Carlos Vaquero
  • Oriol Vinyals
  • Chuck Wooters

Collaborators @ Other Institutions:

  • Jike Chong
  • Henry Cook
  • Armando Fox
  • Daniel Gatica-Perez
  • Ekaterina Gonina
  • Javier Hernando
  • Hayley Hung
  • Shoaib Kamil
  • Christian Müller
  • David Patterson


ICSI’s speaker diarization work has been supported by the Parallel Computing Laboratory at the University of California at Berkeley (ParLab), by the Swiss National Center of Competence in Research on Interactive Multimodal Information Management (IM2), through the Augmented Multi-party Interaction project, and by a DARPA Robust Automatic Transcription of Speech (RATS) grant, among others. The opinions, findings, and conclusions described on this website are those of the researchers and do not necessarily reflect the views of the funders.