Automatic Clustering of Words

(
bsuhm@cs.cmu.edu, ries@cs.cmu.edu)
We applied a data driven method [Kneser,Ney 93] which naturally finds classes of words optimized for perplexity reduction. On our Scheduling databases (English, German, Spanish), the linear interpolation of cluster with word bigram models reduces perplexity compared to word bigrams only. Across languages, the clusters represent in part semantic or syntactic categories such as days of the week, months, numbers, or frequent expressions typical for our domain, such as to be busy/available and expressing consent/disagreement.

A recent reimplementation made this method applicable to larger corpora like Switchboard and selects the number of classes automatically.
Due to increased speed the number of iterations of the model could be increased and therefore also the quality of the clusters. We also tried a second data driven approach [Ries,Buo,Wang95], that calculates an explicit context vector of classes and uses the variance criterion to cluster words.
This approach proved to be less effective compared the [Kneser,Ney 93] in terms of perplexity reduction, but the freedom of defining different contexts may help for other applications.

Word Phrase Language Models

(bsuhm@cs.cmu.edu, ries@cs.cmu.edu)
To take advantage of frequently occurring sequences of words, for instance "I'm out of town", we built language models which bundle sequences of words into phrases. The phrases are determined automatically to minimize perplexity on independent data, similar to the clustering approach mentioned above. Word phrase bigram models typically achieve perplexities and word error rates comparable to word trigram models trained on the same data.
A recent reimplementation generates a word phrase trigram model, that achieves better results than the standard trigram model. In addition, common phrases and idioms of a given task are identified automatically.
We also applied other sequence finding techiques such as Suhotins criterion, multigram models and mutual information parsing.

Structural Model

(ries@cs.cmu.edu)
Since the estimation of a phrase language model is hard, we us a combination of word clustering and phrase modeling, which we call a structural model.
The clustering of words allows us to find more general sequences. If we have seen a sequence like free on Wendsday we also want to be able to recognize the phrase busy on Thursday . This resulting corpus of sequences of word classes may be treated as the input corpus to a new iteration of classifcation and sequence finding and so forth.
We were able to achieve word accuracy improvements by rescoring n-best lists of our old bigram ESST system. Current work also reveals a perplexity reduction compared to standard trigram models and we achieved first results on the Spanish Scheduling domain, that shows a perplexity reduction of a trigram phrase model based on word classes over our currently best trigram phrase model.

to topback to top

Fast Training of Maximum Entropy Language Models

(bsuhm@cs.cmu.edu)
We developed a fast training method for maximum entropy language models. The method applies cluster expansion to the computationally very expensive tasks of computing the partition function and determining the iterative scaling equations.
We applied this method to create a language model adapting to the current topic, for improved language modeling of the switchboard corpus. In this application, the proposed training method achieved a speed-up in the order of factor 200, and the topic dependent language reduced perplexity by 10% compared to a standard word bigram model.

to topback to top

Morphological Language Models

(pgeutner@ira.uka.de)
Especially in languages with a large number of inflections and compound words (like German, Spanish, Korean, Japanese, ...) vocabulary growth is immense when unrestricted speech recognition is desired. In order to limit this large vocabulary growth, other base units than simple words have been used both as new recognition units as well as base units for language model training.
Different decomposition methods based on linguistic knowledge have been applied to our German database, which result in morphem-based units. As a result vocabulary growth can be limited, the rate of out-of-vocabulary words decreases and the overall recognition performance is slightly improved.
KEYWORDS language models, class models, automatic clustering, perplexity reduction, semantic/syntactic categories, word phrase models, interpolation, structural model, maximum entropy, morphology, vocabulary growth, new words, OOV words, inflections, compound words

to topback to top

page designed by: Céline Morel

 





Questions or Comments? Contact the Webmaster