Home
Workspace
Guidelines
FAQs
Data
Releases
Collection
Topics
Additional Resources
CLEF
 

Data

Releases

Version 2.0 of the CLEF-CLSR collection, containing the evaluation topics and additional resources, was released on April 15 2005 (View Collection README).

Version 1.0 of the CLEF-CLSR collection, containing interviews and additional resources, was released on February 15 2005 (View Collection README).

For more details on how to participate and get the collection, click here.

Spoken Word Collection

The collection to be searched includes interviews with about 300 people. totalling about 750 hours. The interviews have been manually divided into topically coherent segments, and the task is to retrieve topically relevant segments. The following resources will be provided for each segment in a format similar to those of ordinary CLEF documents:

  • Spoken words from one-best Automatic Speech Recognition (ASR) transcripts, with a word error rate of approximately 35% will be provided. The average length of a segment is appoximately 300 words (about 3 minutes).
  • Thesaurus terms, automatically assigned using text classification trained on hand-labeled segments from outside the test collection.

In addition, the following resources will be available for each segment for contrastive conditions:

  • An average of 5 manually assigned thesaurus terms, one of which names a location and time (usually with one-year) granularity.
  • A manually written three-sentence segment summary.

Topics

25 topics, written in the usual CLEF title, description, and narrative format. Topics will be available in Czech, English, French, German, Russian, and Spanish. Other topic languages can be created upon request (if translators with the needed language skills are available).

Additional Resources

In order to facilitate broad participation, the basic test collection is formatted in the same way as a typical CLEF ad hoc test collection. The following additional resources will also be available to support system development:

  • Around 40 representative training topics, with relevance judgments for the same collection of interviews that will be used in the evaluation.
  • Scripts for generating alternative relevance judgments for the training topics that can be used to support detailed failure analysis.
  • Scripts for generating richer metadata for each segment using synonymy, part-whole, and is-a thesaurus relationships. This capability can be used with the automatically assigned thesaurus categories or (for constrastive runs) with the manually assigned thesaurus categories.

For each interview, a collection of metadata that describes that interview. This includes basic biographical details (e.g., interviewee name and birthdate) and half-page free text summary of the interviewis also offered. Since this metadata is created manually, it may be used only for contrastive runs (i.e., not for the one required run).