home printable version contact information
description
Introduction
Browser
 
recogntion
Emotion
Speech
Speaker ID
 
tracking
Attention
Face
Body
Discourse
 
technologie
Distant microphone
Corpus
Publications
 
 
Interactive Systems Lab

The purpose of the people identification module is to continuously track and identify meeting participants within a room. In order to increase the robustness and efficiency of the identification process we have taken a multimodal approach and integrated a number of recognizers that use audio and video information. The system is comprised of five components: people segmentation, color appearance ID, speaker ID, face ID and multimodal information fusion.
The ability to identify an object in a given image or image sequence requires the availability of an internal representation of said object. Assuming that such a model is given, it could be utilized to locate and identify objects in one unified step. Unfortunately the search space the recognizer would have to tackle in each run is too large to meet the real time requirements of an interactive system. We therefore use a motion-based preprocessing step to segment people from the background before we try to identify them.
Based on the segmentation derived by the people segmentation module, we create models for the different meeting participants using color histograms. Color histograms provide a stable object representation, which is largely unaffected by occlusions or changes in view. A major obstacle in the use of color for object identification is the fact that colors change with illumination.







The speaker ID module has to solve the problem of finding out which meeting participant is speaking at any given time, independent of what they are saying. This can be seen as a text-independent close-set speaker identification task. We consider both convolution and additive noise as consistent, except for occasional events - phone ringing, door clapping etc. The limited training and test sets are collected in the same noise environment. Our experiments show that if training and testing is done on the same noise conditions, the performance is comparable with the performance achieved on clean speech.

 



[more]
 
top