Speaker: David Reitter Title: Aspects of a situationalized multimodal human-computer interface Abstract: It's a common perception that some meetings are more effective than others. Those meeting that involve the physical presence of participants allow them to rely on multiple communication channels (multimodality), among them natural language, eye gaze and body postures. When channels are missing, such as in a call conference, communicative elements such as topic tracking (coherence) and turn-taking behavior become harder to manage. This is equally true in user interfaces: when restricted to unimodal communication, and this single channel is limited in bandwidth, noisy, or otherwise error-prone, humans encounter difficulties - for example when they use a small-screen computer interface or a voice-based dialogue system over the phone. Humans can usually integrate multimodal information without effort, which leads us the ask: can multimodality improve language-based interaction in bottle-neck devices? The UI on the Fly project explores ways to do this. While today's user interfaces make different input and output methods available (mouse and keyboard, screen and speakers), our interfaces go beyond that. They ensure cross-channel coordination for both input and output, so the communication channels can be used in parallel. These interfaces convey not just redundant, but also complementary information. For example, they can augment a graphical user interface (GUI) with helpful audio commentary. In mobile situations, screen-based output may be simplified, or eliminated entirely, in reponse to a specific use situation, e.g. when driving. Similarly, the system can adapt to the needs of hard-of-hearing or visually impaired users. We address the adaptivity of the user interface with a dynamic generator. Multimodal Functional Unification Grammar (MUG) is a unification-based formalism (based on FUF, Elhadad&Robin 1992) that provides the means to dynamically generate content that is coordinated across several communication modes, which currently include natural language on the screen and by voice, and GUI elements. The interface can adapt the content presented in each mode to the user's preferences and usage situation. The generation process satisfies the hard constraints defined by the dialogue act (input) and the grammar, and finds the optimal (or a near-optimal) solution according to an objective function. This heuristic defines the trade-off between predicted cognitive complexity of the output and its utility, following classical communication principles (cf. Grice's maxims). This way, the system can select from among several possible output forms generated by the grammar. The input to the formalism is a compositional semantic representation, which suggests it is suitable for principle-based dialogue management rather than hard-coded finite-state dialogue models. MUG aims to establish cross-modal coherence of output: we postulate that a user interface should be consistent, but not entirely redundant in its simultaneous, mode-specific outputs. That means to coordinate output on a lexical level and to some extend on syntactical level. Dialogue acts should be coordinated, too. Emphasized screen objects - similar to pointing gestures in input - and deictic expressions influence and depend on the salience of objects. We adopt Centering theory (Gros, Joshi, Weinstein 1995) as a framework to model coherence in the generation of referring expressions. The concept is embodies in the MUG System, which is a new development environment that makes it easier to develop, debug and study unification grammars for generation.