Evaluation Campaign >> Data Downloads
Username/Password
You will need a username and password to download the data. Please register here.
Overview
This webpage offers the following data:
- Training data: This is the same for the translation of the manual transcription and the translation of ASR lattices and n-Best list.
- Development data: Different development sets for the translation of the manual transcription and the translation of ASR lattices.
- Test data: Will be available here August 16th for both tracks.
One set that can be used for training a system to do the translation of the manual transcription and one set for training a system to translate ASR output.
The training data is the same in both cases.
The development sets are different. The development sets for the source language are:
- manual transcriptions if only manual transcriptions are translated.
- n-best lists and lattices from a Speech Recognizer if ASR output is translated.
Training Data
Contents
Every data package contains 2 files with 20,000 lines each for source and target language.
Languages other than English are segmented based on the ASR output. Please note: The punctuation marks in the English files are not separated. You can certainly separate those punctuation marks in all tracks and conditions.
The Korean Data was segmented using the ATR KOMA-HanTag software. This software is not publicly available
Data Packages
Development Sets for Translation of Manual Transcription
Contents
Every data package contains 2 development sets. The test sets of CSTAR 2003 and IWSLT 2004 in source and target language as plain text.
Languages other than English are segmented based on the ASR output.
Data Packages
Development Sets for Translation of Speech Recognizer output
Contents
Every data package contains 2 development sets. The test sets of CSTAR 2003 and IWSLT 2004 in source and target language. The source side is provided as Speech Recognizer output (lattice and n-best list).
Languages other than English are segmented based on the ASR output. - No Korean lattices or n-best lists will be provided.
Data Packages
Speech Data
Contents
Each package contains the the recorded speech for the development and test sets.
Chinese Speech Data
Japanese Speech Data
|