Data

Releases

Version 1.0 of the CLEF-CLSR 2006 English and Czech collections, containing interviews and additional resources, will be released soon.

For more details on how to participate and get the collections, click here.

Spoken Word Collections

An English and a Czech collection will be used in the track this year.

English Collection

This year we are pleased to make available additional English speech, extending the collection beyond 750 hours. We may also be able to provide word lattices and more accurate speech recognition.

The English interviews have been manually divided into topically coherent segments, and the task is to retrieve topically relevant segments. The following resources will be provided for each segment in a format similar to those of ordinary CLEF documents:

Spoken words from one-best Automatic Speech Recognition (ASR) transcripts, with a word error rate of approximately 35% will be provided. The average length of a segment is appoximately 300 words (about 3 minutes).
Thesaurus terms, automatically assigned using text classification trained on hand-labeled segments from outside the test collection.

In addition, the following resources will be available for each segment for contrastive conditions:

An average of 5 manually assigned thesaurus terms, one of which names a location and time (usually with one-year) granularity.
A manually written three-sentence segment summary.

For each interview, a collection of metadata that describes that interview. This includes basic biographical details (e.g., interviewee name and birthdate) and half-page free text summary of the interview is also offered.

Czech Collection

The Czech interviews will use a no-boundary condition where relevance assessors mark only topically relevant "playback points" during assessment. Unlike the English collection, where the task is to retrieve segments, the task in Czech is to retrieve ranked lists of candidate playback points. The collection will be formatted as close to ordinary CLEF documents as possible, and will contain around 500 hours of speech.

Topics

25 topics, written in the usual CLEF title, description, and narrative format will be released for the English and Czech collections. Topics will be available in Czech, English, French, German, Russian, and Spanish. The creation of other topic languages can should be arranged by those sites interested in using them.

Additional Resources

In order to facilitate broad participation, the basic test collection is formatted in the same way as a typical CLEF ad-hoc test collection. The following additional resources will also be available to support system development:

A number of representative training topics, with relevance judgments for the same collections of interviews that will be used in the evaluation.
Scripts for generating alternative relevance judgments for the training topics that can be used to support detailed failure analysis (English only).
Scripts for generating richer metadata for each segment using synonymy, part-whole, and is-a thesaurus relationships. This capability can be used with the automatically assigned thesaurus categories or (for constrastive runs) with the manually assigned thesaurus categories (English only).