|
|
Prior to computing the cross-correlations the signals
have to be windowed. Since speech signals are approximately stationary
over a time period of around 20 ms a significantly longer interval
must be considered to avoid periodical secondary maxima in the
correlations. A rectangular window of a length corresponding to 250 ms
is found suitable. ![]() Theoretically as few as three microphones and thus two delays would be sufficient to exactly determine a talker's position in the above mentioned half plane. Since we have a much larger number at our disposal we use the redundant information to eliminate the effects of noise by using all independent pairs of delays for localization on the same sound data. The median value of the resulting series of localizations is used as the final value. Such algorithm is repeated every 250 ms which allows to track a speaker moving in a natural way. BeamformingA one-dimensional microphone array consisting of 8 sensors is used. This allows signal recording in the half plane in front of the array, i.e. the difference in height of talkers is neglected. In order to steer the array towards a given spot the differences of sound arrival time between the microphones are compensated for waves originating exactly from this location. By summing these aligned signals one achieves an enhancement of the desired signal while sound coming from other locations is not in phase and thus its audibility is deteriorated. This procedure is well known as delay and sum beamforming. The coordinates of a point of interest are provided by the mechanisms described later . The characteristical delays for such point are determined mathematically, assuming a spherical form of speechwaves (These delays are unique for a point in the area of interest and thus it might be more appropriate to speak about spotforming). The microphone signals are sampled simultaneously at a rate of 16 kHz and the compensation of the delays has to be carried out digitally. Hence the delays would have to be quantized in multiples of 1/16 ms which causes a too big quantization error for delay estimations. This problem is solved by eight times upsampling of the input signals using a computational favourable technique. Use with face trackerWith speech recognition systems constantly improving in performance, freedom from head-sets and push-bottons to activate the recognizer is one of the most important issues to achieve user acceptance. Microphone arrays and beamforming can now deliver signals that suppress undesired jamming signals but rely on knowledge of where the desired signal is in space, which are usually derived by identifying the loudest signal source. Knowing who is speaking to whom and where and extracting this information should however not depend on loudness, but on the communication purpose. In this paper, we present acoustic AND visual modules that use tracking of the face of a speaker of interest for sound source localization and beamforming for signal extraction. It is shown that a more accurate localization in space can be delivered visually than acoustically. Given a reliable fix, beamforming substantially improves recognition accuracy. back to top
|
|