home printable version contact information
description
Introduction
Browser
 
recogntion
Emotion
Speech
Speaker ID
 
tracking
Attention
Face
Body
Discourse
 
technologie
Distant microphone
Corpus
Publications
 
 
Interactive Systems Lab

Our work is aiming at a realistic meeting scenario, the corresponding speech recognition problems, the analysis of retrieval performance and addition of non-keyword based features, the generation of readable summaries and a pratical user interface.
The participants managed to show that keyword based retrieval can often be done successfully even if there is a significant word error rate by a speech recognizer

As already identified in previous works meeting recognition is a very challenging LVCSR task parallel to Hub5 (Switchboard) and Hub4 (Broadcast News). The difficulty results basically from three reasons: First, the conversational style - meetings consists of uninterrupted continuous recordings with multiple speakers talking in a conversational style. Second, the lack of training data -meeting data is highly specialized depending on the topic and participants, therefore large databases can not be provided on demand.
As a consequence our research has focused on the question on how to build LVCSR systems for new tasks and languages using limit amounts of training data.
Third, the degraded recording conditions: to minimize interference a clip-on lapel microphone was chosen instead of a close-talking headset. However, this comes at the cost of significant channel cross-talk. Quite often one can hear multiple speakers on a single channel.
Acoustic and Language Model Adaptation
For the purpose of building a speech recognition engine on the meeting task, we combined a limited set of meeting data with English speech and text data from various sources, namely Wall Street Journal (WSJ), English Spontaneous Scheduling Task (ESST), Broadcast News (BN), Crossfire and Newshour TV news shows.

The meeting data consists of a number of internal group meeting recordings (about one hour long each), of which 14 are used for experiments in this paper. A subset of three meetings are chosen as the testset.
To achieve robust performance over a range of different tasks, we trained our baseline system on Broadcast News (BN) using JRTk. The system deploys a pentphone model with 6000 distributions sharing 2000 codebooks. There are about 105k Gaussians in the system. Vocal Tract Length Normalization and clusterbased Cepstral Mean Normalization are used to compensate for speaker and channel variations. Linear Discriminant Analysis is applied to reduce feature dimensionality to 42, followed by a diagonalization transform (Maximum Likelihood Linear Transform).
A 40k vocabulary and trigram language model are used. The baseline language model is trained on theBroadcast News (BN) corpus. The error rates on themeeting data are quite high but using acoustic and language model adaptation the error rate can be reduced.

[more]

 
top