|
Our work is aiming at a realistic
meeting scenario, the corresponding speech recognition problems,
the analysis of retrieval performance and addition of non-keyword
based features, the generation of readable summaries and a pratical
user interface.
The participants managed to show that keyword based retrieval can
often be done successfully even if there is a significant word error
rate by a speech recognizer
As already identified in previous works meeting recognition is a
very challenging LVCSR task parallel to Hub5 (Switchboard)
and Hub4 (Broadcast
News). The difficulty results basically from three reasons:
First, the conversational style - meetings consists of uninterrupted
continuous recordings with multiple speakers talking in a conversational
style. Second, the lack of training data -meeting data is highly
specialized depending on the topic and participants, therefore large
databases can not be provided on demand.
As a consequence our research has focused on the question on how
to build LVCSR systems for new tasks and languages using limit amounts
of training data.
Third, the degraded recording conditions: to minimize interference
a clip-on lapel microphone was chosen instead of a close-talking
headset. However, this comes at the cost of significant channel
cross-talk. Quite often one can hear multiple speakers on a single
channel.
Acoustic and Language Model Adaptation
For the purpose of building a speech recognition engine on the meeting
task, we combined a limited set of meeting data with English speech
and text data from various sources, namely Wall Street Journal (WSJ),
English Spontaneous Scheduling Task (ESST), Broadcast News (BN),
Crossfire and Newshour TV news shows.
|
The meeting data consists of
a number of internal group meeting recordings (about one hour long
each), of which 14 are used for experiments in this paper. A subset
of three meetings are chosen as the testset.
To achieve robust performance over a range of different tasks, we
trained our baseline system on Broadcast News (BN) using JRTk. The
system deploys a pentphone model with 6000 distributions sharing
2000 codebooks. There are about 105k Gaussians in the system. Vocal
Tract Length Normalization and clusterbased Cepstral Mean Normalization
are used to compensate for speaker and channel variations. Linear
Discriminant Analysis is applied to reduce feature dimensionality
to 42, followed by a diagonalization transform (Maximum Likelihood
Linear Transform).
A 40k vocabulary and trigram language model are used. The baseline
language model is trained on theBroadcast News (BN) corpus. The
error rates on themeeting data are quite high but using acoustic
and language model adaptation the error rate can be reduced.

|