Corpora for AMM Research at ICSI
- Corpora from the Berkeley Multimodal Location Estimation Project
- Corpora for Scalable Big Data Analysis
- Corpora for Speech Recognition and Speaker Diarization Research
Corpus for the Ambulance Detection Task: We collected a set of videos from multiple cities that contained ambulances, then trained an automatic system to guess which city an unknown ambulance video was from based on the sound of the siren.
- Training data: Index of links as a .txt file
- Test data: Index of links as a .txt file
- Both datasets: Index of links as a webpage
Corpus for the Indoor/Outdoor Detection Task: We collected a corpus of videos and tagged them for several features, then trained a system to automatically detect if a novel video was recorded indoors or outdoors.
- Dataset: Tagged index of links as a webpage (Tags: indoor/outdoor/other, edited/unedited, genre, audio synced with video)
MediaEval 2014 Placing Task Dataset: A subset of the YLI corpus, provided for the MediaEval Benchmarking Initiative’s Placing Task for 2014, 2015, and 2016.
Non-ICSI Corpora for Multimodal Location Estimation: Earlier MediaEval Placing Task datasets may be found on the MediaEval website.
YLI Corpus: Based on the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset, a collection of of 99.3 million images and 700 thousand videos from Flickr, compiled by Yahoo Labs. We are working with Lawrence Livermore National Laboratories to process the images and videos, computing frequently used audio and visual features and developing subcorpora for multimedia-analysis tasks.
- YFCC100M Dataset: The full multimedia dataset and metadata are available through Yahoo’s Webscope research-data portal.
- Features: Audio feature data for the full video dataset are available now, and visual feature data for images and video keyframes are being added.
ICSI Meeting Corpus: We produced an audio corpus of 40+ hours of multichannel studio-quality recordings of actual meetings, the largest of its kind at the time it was released.
- The corpus and transcripts are available through the Linguistic Data Consortium (LDC).
- A full description and documentation are available on the ICSI Meeting Corpus website.
AMI Meeting Corpus: ICSI is a member of the Augmented Multi-party Interaction (AMI) consortium, which produced an audio and video corpus of 100 hours of mostly scenario-driven meetings.
- The corpus, transcriptions, and annotations are available at the AMI Meeting Corpus website.
Non-ICSI Corpora for Speaker Diarization: Additional data used in our diarization work has come from, among others, the Rich Transcription Evaluation (RT Eval) run by the National Institute for Standards and Technology (NIST). Most of the RT Eval data was drawn from resources hosted by the LDC.