International Workshop on Spoken Language Translation

   
 

Online Evaluation Servers

Evaluation Server for development sets

This online evaluation server allows you to submit the development sets for each translation direction. The server will calculate the relevant scores and save your results to a database for later review.

You will need a general username and password to access the login page of the server. There you can register your own username and password to login to the actual evaluation server.

The server offers the submission of a hypothesis for the different translation directions and development sets. Your English hypothesis is automatically lowercased and the punctuation marks are removed before calculation of the scores.(compare Evaluation Specification)
There is no preprocessing for the Chinese hypothesis, yet


Example reference files/Automatic preprocessing script

The English SGML reference files for the development set 1 (CSTAR 03) and development set 2 (IWSLT 04) are available here (for the tracks Chinese → English, Arabic → English and Korean → English) and here for Japanese → English.
These reference files are used by the online evaluation server. (Please use the same username/password as is used for the evaluation server).

The automatic preprocessing is done using the following script for each line:

s/\.//g;
s/\?//g;
s/\!//g;
s/,/ /g;
s/:/ /g;
s/;/ /g;
s/-/ /g;
s/"/ /g;
s/\|/ /g;
print lc $_;     

Limitations right now

  • The GTM score is not yet calculated
  • The score calculation takes a rather long time. This is due to the fact that we calculate different scores at once.


>> Evaluation Server for development sets



Evaluation Server for test sets

This online evaluation server allows you to submit the TEST SETS for each translation direction. The server will calculate the relevant scores and save your results (but not show the results before August 22nd)

The server offers the submission of a hypothesis for the different translation directions and Test sets. Please select the appropriate data condition.
You are also able to select if your submission should be considered for the Evaluation campaign. (only one Main/Final submission per track/data condition, you will be able to delete unwanted submissions.)

Submission of English hypothesis

  • The Standard evaluation for English output is lower cased and with removed punctuation marks (automatically preprocessed).
  • If your hypothesis contains mixed case characters an optional evaluation will be performed (mixed case with separated punctuation marks) (automatically preprocessed).

Submission of Chinese hypothesis

  • The first evaluation for Chinese will be based on the ASR segmentation (as provided) without punctuation marks. (automatically preprocessed).
  • The second evaluation for Chinese will be character based segmentation without punctuation marks. (automatically preprocessed)

Submission of ASR output

  • Please submit the translation of ASR Output in the defined format (the testsets have 506 sentences):
                Line    1: 1st source sentence
                Line    2: Hypothesis for 1st source sentence
                Line    3:
                Line    4: 2nd source sentence
                Line    5: Hypothesis for 2nd source sentence
                Line    6:
                Line    7: 3rd source sentence
                Line    8: Hypothesis for 3rd source sentence
                Line    9:
                ...
                ...
                Line 1513: 505th source sentence
                Line 1514: Hypothesis for 505th source sentence
                Line 1515:
                Line 1516: 506th source sentence
                Line 1517: Hypothesis for 506th source sentence
                
    The source sentence is the sentence that was actually translated by your system, i.e. the path in the ASR lattice or the hypothesis from the n-best list (or any recombination). The idea is to compare the quality of the used ASR output (that was translated) with the translation quality. Please leave lines 3,6,9...1515 blank.

Scores

  • The GTM score is calculated based on (only) 1 reference (1st reference for English)
  • The BLEU score is calculated according to the IBM-scripts (closest reference is considered for the Length Penalty).
  • The METEOR score is deficient for Chinese output.

Problems

If you wish not to use the Evaluation server for any reason, please send your hypotheses to Chiori Hori at chiori@cs.cmu.edu and Matthias Eck at matteck@cs.cmu.edu.

>>Evaluation Server for TESTSETS<<



Good luck for your submission to the Evaluation campaign of IWSLT2005!!


 
        Copyright(c) 2005 interACT All rights reserved.