|
The purpose of the people identification
module is to continuously track and identify meeting participants
within a room. In order to increase the robustness and efficiency
of the identification process we have taken a multimodal approach
and integrated a number of recognizers that use audio and video
information. The system is comprised of five components: people
segmentation, color appearance ID, speaker ID, face ID and multimodal
information fusion.
The ability to identify an object in a given image or image sequence
requires the availability of an internal representation of said
object. Assuming that such a model is given, it could be utilized
to locate and identify objects in one unified step. Unfortunately
the search space the recognizer would have to tackle in each run
is too large to meet the real time requirements of an interactive
system. We therefore use a motion-based preprocessing step to segment
people from the background before we try to identify them.
Based on the segmentation derived by the people segmentation module,
we create models for the different meeting participants using color
histograms. Color histograms provide a stable object representation,
which is largely unaffected by occlusions or changes in view. A
major obstacle in the use of color for object identification is
the fact that colors change with illumination.
|
The speaker ID module has
to solve the problem of finding out which meeting participant
is speaking at any given time, independent of what they are saying.
This can be seen as a text-independent close-set speaker identification
task. We consider both convolution and additive noise as consistent,
except for occasional events - phone ringing, door clapping etc.
The limited training and test sets are collected in the same noise
environment. Our experiments show that if training and testing
is done on the same noise conditions, the performance is comparable
with the performance achieved on clean speech.
|