Multimodal Location Estimation
Advances in multimedia content analysis have recently made it possible to bring together various strands of research to develop systems that can automatically estimate the location of consumer-produced media recordings (photo, video, or audio) that lack geo-location metadata such as GPS tags. We call this approach multimodal location estimation. Just as human analysts use multiple sources of information to determine geolocation, it is obvious that for location detection, the analysis and combination of clues across different sensory modalities can lead to better results than examining only one stream of sensory input. Therefore, we think approaches to this task should be inherently multimodal.
Let’s imagine a video for which the location is unknown. Acoustic event detection on the audio track reveals a siren usually only found in American police cars, and automatic language identification detects English spoken in a dialect identified with a Southern state. An image object recognizer finds several textures that are associated with a specific type of terrain, one whose vegetation is found only in humid, subtropical areas. Classification of the birdsong in the background indicates that the recording might be from the southern portion of the U.S. For a couple of frames, a building appears that matches Flickr photos of the Parthenon. The combination of these clues is sufficient evidence to conclude that the video is from the Nashville, TN area. This idea is frequently used in crime novels and TV shows, where detectives will use small clues from pictures or videos to figure out where a hostage is being held or where the bad guy is hiding.
The current state of the art has not yet produced a system that can incorporate information from all of these modalities at once. However, systems are being built that combine information from two or more sources to achieve more accurate location results, and a new community is growing up among multimedia and machine-learning researchers interested in tackling this challenging but exciting problem — including a Placing Task in the MediaEval Benchmarking Initiative and a number of workshops at major conferences like ACM Multimedia. Given the massive amount of training data available on the Internet, we think the budding location-estimation research field offers a chance for the multimedia community to tackle challenging machine-learning problems using more and more heterogeneous input, which can lead to better understanding and more generalizable solutions.
ICSI has been a leader in this research area, producing or contributing to top-performing automatic location estimation algorithms that combine cues from visual, acoustic, and textual data, along with information from external sources. Our recent location-estimation work has also led to high-impact new projects exploring the implications for online privacy of these new abilities to locate even Internet users who are trying not to be found.
Building an automatic system that estimates the probable recording location of non-geotagged user-generated media content, using visual and acoustic features, textual tags, and information from external sources.
We used crowdsourced labor to provide a human baseline against which to evaluate automatic location-estimation systems. How well does a human do at this task? And what can that tell us about how to improve computational approaches?