Data

Releases

For more details on how to participate and get the collections, see Guidelines.

Spoken Word Collections

An English and a Czech collection are used in the track this year.

English Collection

The English collection includes 8104 segments, 105 topics, and 105,076 relevance judgments to facilitate information retrieval experimentation. Segments are the unit of retrieval for English materials in the CLEF CL-SR evaluation. Interviews with survivors of the Holocaust were manually segmented to form topically coherent segments by subject matter experts at the Survivors of the Shoah Visual History Foundation.

The format of each interview is consistent and shown as follows:

<DOC>
</DOC> The tag for each document.

<DOCNO>
</DOCNO> Document identifier: VHF[IntCode]-[SegId].[SequenceNum]

<INTERVIEWDATA>
</INTERVIEWDATA> Metadata about the entire interview.

<NAME>
</NAME> Full name of every person mentioned.

<MANUALKEYWORD>
</MANUALKEYWORD> Thesaurus keywords assigned to the segment.

<SUMMARY>
</SUMMARY> 3-sentence segment summary.

<ASRTEXT2003A>
</ASRTEXT2003A> ASR transcript produced in 2003.

<ASRTEXT2004A>
</ASRTEXT2004A> ASR transcript produced in 2004.

<ASRTEXT2006A>
</ASRTEXT2006A> ASR transcript produced in 2006.

<ASRTEXT2006B>
</ASRTEXT2006B> ASR transcript produced in 2006.

<AUTOKEYWORD2004A1>
</AUTOKEYWORD2004A1> Thesaurus keywords from a kNN classifier.

<AUTOKEYWORD2004A2>
</AUTOKEYWORD2004A2> Thesaurus keywords from a second kNN classifier.

The collection can be indexed with any IR system that is capable of handling documents in the standard format of a CLEF or TREC test collection. Only the <ASRTEXT2006B> field (and optionally the auto keywords fields based on it) should be indexed if retrieval based on the best available Automatic Speech Recognition (ASR) is desired.

For questions regarding the English collection, please email

Doug Oard

Czech Collection

The Czech interviews use a no-boundary condition where relevance assessors mark only topically relevant "playback points" during assessment. Unlike the English collection, where the task is to retrieve segments, the task in Czech is to retrieve ranked lists of candidate playback points. The collection contains around 450 hours of speech in 354 interviews.

To facilitate initial exploration of this collection, a set of scripts for automatically generating document files in a standard form (similar to that used in other CLEF tracks) using a sliding window with a user-specified width and a user-specified start-time spacing.

Each document is formatted as follows:

<DOC>
</DOC> The tag for each document.

<DOCNO>
</DOCNO> Unique document identifier: VHF[IntCode]-[starting-time].

<INTERVIEWDATA>
</INTERVIEWDATA> The first name and last initial for the person being interviewed.

<ASRSYSTEM>
</ASRSYSTEM> Specification of the source of the transcript (either 2004 or 2006).

<CHANNELT>
</CHANNEL> Recorded channel (left or right) used to produce the transcript.

<ASRTEXT>
</ASRTEXT> The ASR transcript.

For questions regarding the Cezch collection, please email

Pavel Pecina

Topics

For the English task, there are 105 topics on which the evaluation system should be run.
All topics for the English task are available in Czech, English, French, German, Dutch, and Spanish.
For the Czech task, there are 29 training topics and 118 topics on which the evaluation system should be run.
All topics for the Czech task are available in Czech and English.

These topics files are available in both the standard CLEF/TREC topic format and as XML. The format of each topic is consistent and shown as follows:

<top>
</top> The tag for each topic.

<num> Topic identifier.

<title> A concise representation of the topic.

<desc> A short description of the topic.

<narr> A more detailed description of the topic.

Additional Resources

In order to facilitate broad participation, the basic test collection is formatted in the same way as a typical CLEF ad-hoc test collection. The following additional resources will also be available to support system development:

A number of representative training topics, with relevance judgments for the same collections of interviews that will be used in the evaluation.
Scripts for generating alternative relevance judgments for the training topics that can be used to support detailed failure analysis (English only).
Scripts for generating richer metadata for each segment using synonymy, part-whole, and is-a thesaurus relationships. This capability can be used with the automatically assigned thesaurus categories or (for contrastive runs) with the manually assigned thesaurus categories (English only).