(archive site)

Scalable Big Data Analysis

Scalable Big Data Analysis

Typically, scientists and engineers prefer to use high-level programming languages such as Python or MATLAB to conduct experiments, as they allow for the quick implementation of a novel idea. However, experiments on big data are often computationally very intensive, and therefore must be recoded into a low-level language by expert programmers to achieve sufficient performance. This process creates a gap between productivity and performance. In addition, there may be multiple strategies for mapping a particular computation onto parallel hardware, depending on the input data size and the hardware parameters, further exacerbating the problem.

Multimedia content analysis has one of the largest and the fastest-growing amounts of data of any application area, due to the steady upload of consumer-produced videos, making it a natural test case for scalability efforts. In particular, social-media videos are increasingly being used for scientific research, as they allow us to observe and model many phenomena in, for example, social sciences, economics, meteorology, and medicine. Content-analysis applications for this type of data are typically based on machine-learning techniques for classifying content and making predictions; building an accurate system may involve training on hundreds of thousands of examples, which can take days to process. It is therefore becoming increasingly necessary to parallelize these types of computationally demanding processes.

ICSI is building tools that generate optimized parallel implementations of multimedia content-analysis algorithms by mapping frequently-occurring types of computations onto parallel platforms from Python code. These tools are designed to enable researchers and developers to prototype multimedia content analysis algorithms on large scales. The ultimate aim of these projects is to provide the scalability of diverse parallel processing at the productivity level of high-level languages.

Current Projects

SMASH (Scalable Multimedia content AnalysiS in a High-level language):

Developing tools for high-level analysis of large amounts of multimedia data, expanding on the PyCASP framework to support productive, efficient, portable, and scalable application development.

Recent Projects

PyCASP (Python-based Content Analysis using SPecialization):

A pattern-oriented, application-specific specialization framework that automatically generates optimized parallel implementations of multimedia content-analysis algorithms from high-level Python code.

Fast Speaker Diarization using Python:

A specialization framework to automatically map and execute computationally intensive Gaussian Mixture Model (GMM) training on an NVIDIA graphics processing unit from Python code, without significant loss in diarization accuracy.