Online Evaluation Servers
Evaluation Server for development sets
This online evaluation server allows you to submit the development sets for each translation direction.
The server will calculate the relevant scores and save your results to a database for later review.
You will need a general username and password to access the login page of the server.
There you can register your own username and password to login to the actual evaluation server.
The server offers the submission of a hypothesis for the different translation directions and development sets.
Your English hypothesis is automatically lowercased and the punctuation marks are removed before calculation of the scores.(compare Evaluation Specification)
There is no preprocessing for the Chinese hypothesis, yet
Example reference files/Automatic preprocessing script
The English SGML reference files for the development set 1 (CSTAR 03) and development set 2 (IWSLT 04) are available here (for the tracks Chinese → English, Arabic → English and Korean → English) and here for Japanese → English. These reference files are used by the online evaluation server. (Please use the same username/password as is used for the evaluation server).
The automatic preprocessing is done using the following script for each line:
s/\.//g;
s/\?//g;
s/\!//g;
s/,/ /g;
s/:/ /g;
s/;/ /g;
s/-/ /g;
s/"/ /g;
s/\|/ /g;
print lc $_;
Limitations right now
- The GTM score is not yet calculated
- The score calculation takes a rather long time. This is due to the fact that we calculate different scores at once.
Evaluation Server for test sets
This online evaluation server allows you to submit the TEST SETS for each translation direction.
The server will calculate the relevant scores and save your results (but not show the results before August 22nd)
The server offers the submission of a hypothesis for the different translation directions and Test sets. Please select the appropriate data condition.
You are also able to select if your submission should be considered for the Evaluation campaign. (only one Main/Final submission per track/data condition, you will be able to delete unwanted submissions.)
Submission of English hypothesis
- The Standard evaluation for English output is lower cased and with removed punctuation marks (automatically preprocessed).
- If your hypothesis contains mixed case characters an optional evaluation will be performed (mixed case with separated punctuation marks) (automatically preprocessed).
Submission of Chinese hypothesis
- The first evaluation for Chinese will be based on the ASR segmentation (as provided) without punctuation marks. (automatically preprocessed).
- The second evaluation for Chinese will be character based segmentation without punctuation marks. (automatically preprocessed)
Submission of ASR output
Scores
- The GTM score is calculated based on (only) 1 reference (1st reference for English)
- The BLEU score is calculated according to the IBM-scripts (closest reference is considered for the Length Penalty).
- The METEOR score is deficient for Chinese output.
Problems
If you wish not to use the Evaluation server for any reason, please send your hypotheses to Chiori Hori at chiori@cs.cmu.edu and Matthias Eck at matteck@cs.cmu.edu.
Good luck for your submission to the Evaluation campaign of IWSLT2005!!
|