INEX 2009 QA Track (QA@INEX)

Overview

The INEX 2009 QA@INEX track will complement Ad-Hoc tasks. It will use the same document collection while involving specific questions. The track aims to compare the performance of QA, XML/passage retrieval and automatic summarization systems on an encyclopedic resource like the Wikipedia. The track will consider two type of questions that extend queries considered in Ad-Hoc tasks. The first set will be factual questions which require a single precise answer to be found in the corpus if it exists. The second set will consists of more complex questions whose answers require the aggregation of several passages. The passages need not necessarily be in a single document, thus might involve multi-document answer aggregation. Participation of automatic summarization systems by passage extraction is therefore encouraged. This will be an opportunity to test XML/passage retrieval systems on advanced QA tasks.

The evaluation will use INEX Ad-Hoc on-line interfaces. For both sets of questions, systems will have to provide a ranked list of non overlapping relevant passages. These passages will be evaluated by participants following the same methodology used in the Ad-Hoc track. In the case of short answers, systems will also have to provide for each passage the position of the answer in the passage. The evaluation of factual questions will take into account the distance between the inferred answer and the real one. For aggregated answers, systems will provide a document with a maximum of 500 words exclusively made of passages from the document collection. These documents will be evaluated according to their overlap with relevant passages and their “last point of interest” to be determined by participant evaluation. The last point of interest will show the place where the aggregated document becomes: irrelevant, incomprehensible or redundant.

Test collection

As for the ad hoc track, we will use the INEX 2009 Wikipedia collection (without images). Each participating group will be asked to create a set of candidate questions, representative of a range of real user needs.

QA Task

The QA task to be performed by the participating groups of INEX 2009 is answering an academic question using the Wikipedia. The general process involves:

Like in the Ad-Hoc task, we regard as relevant passages segments that both

We rely on participants to propose a set of questions. For evaluation purposes, we require this year that the answer uses ONLY elements or passages previously extracted from the document collection. The correctness of answers will be established by participants exclusively based on the support passages and documents. This implies that errors in Wikipedia version used in INEX could give rise to acceptable answers.

Two kind of answers will be considered:

  1. Short : a single entity (Noun Phrase, integer, float or date) similar to those used in Multiple Choice Question (MCQ) academic tests.
  2. Long: the answer is constructed by aggregating several previously retrieved passages.

Participants are required to submit at least one completely automatic run. However, manual runs are strongly encouraged. Are considered as manual, runs that require a human intervention at any level of the process. These interventions should be clearly stated and documented.

Short answers

Motivation for the Task

The underlying scenario is to retrieve from the Wikipedia possible answers to MCQ type questions. By way of example, for a question like “what is LINUX?”, acceptable short answers could be “A computer operating system”, “an operating system kernel”, “an asteroid” ... We will also consider numeric answers and dates. The type of the expected answer (NP, integer, float or date) will be explicitly stated in the question.

Display

The results are presented as a ranked-list of answers together with an explanation passage or element involving the answer.

Users

View the result list top-down, one-by-one. Most probable answers should be ranked first.

What we hope to learn from this task is how advanced passage and XML element retrieval on Wikipedia can be useful to academic QA.

Results to Return

The participants should provide two type of results.
  1. A small ordered set (10) of non overlapping XML elements or passages that contains a possible answer to the question. These passages will be evaluated by participants using the on-line interface and the methodology for Ad-Hoc task.
  2. For each element or passage, the position of the answer in the passage. They will be evaluated by computing their distance to the answer.

Focused IR systems are encouraged to participate to this task by simply providing the most relevant and short extracted passages they retrieve.

Long answers

Motivation for the Task

The scenario underlying the complex question task is to write a scholarly definition or synthesis on a topic that do will not constitue a specific entry in the English Wikipedia. The answer needs to be built by aggregation of relevant XML elements or passages retrieved from different documents. For example, a long answer to “How does LINUX work?” could explain its connections with UNIX, the GNU project, several systems running it, different distributions ...

Display

a comprehensible agglomeration of relevant non overlapping XML elements or passages.

Users

will read the generated document until they find a non relevant passage or some incoherence or redundancy.

The aggregated answers will be evaluated according to the way they overlap with relevant passages (number of them, vocabulary and bi-grams included or missing) and the “last point of interest” marked by evaluators. By combining these measures, we expect to take into account both the informative content and the readability of the aggregated answers. What we hope to learn from this task are ideas on how to combine QA, XML element/passage retrieval and automatic summarization by passage extraction to enhance Wikipedia quality, in particular by providing tools to detect redundancies and discrepancies.

Results to Return

Two types of results should be provided.
  1. Up to 1000 non overlapping XML elements or passages that contain some information relevant to the query.
  2. An aggregated document exclusively made of passages previously selected.
Automatic summarization systems by extraction are strongly encouraged to participate.

Result Submission

Fact sheet:

Resources

A baseline XML-element retrieval system powered by Indri will be available on line with a standard CGI interface. The index will cover all words (no stop list, no stemming) and XML tags. Participants that do not wish to build their own index could use this one by downloading it or by using it online (Contact eric.sanjuan@univ-avignon.fr).

Relevance assessments

Relevance assessments will be conducted by participating groups.

They will be done on-line.

Short answers

Each assessor will have a pool of support passages to analyze. Only the question and the support passage of the answer will be displayed, not the answer.

A maximum of 3 consecutive sentences or a complete table will be displayed. If the assessor finds an answer to the question in the passage, s/he will mark its position. We emphasize that this should be an answer based on the INEX version of Wikipedia and should not imply assessor's personal knowledge. Therefore erroneous answers found in the Wikipedia will be accepted if the passage clearly answers the question without any doubt. On the contrary, the passage could contain the right answer without explaining it, i.e. the answer appears out of context or requires some extra knowledge to be identified. In this case no answer should be marked.

Systems will be ranked according to:

Long answers

Each assessor will have to evaluate a pool of answers of a maximum of 500 words each. These answers will be an agglomeration of wikipedia passages but source articles will not be shown.

Evaluators will have to mark:

  1. The “last point of interest”, i.e. the first point after which the text becomes out of context because of:
  2. all relevant passages in the text, even if they are redundant. However, each marked passage should be syntactically correct and without unsolved anaphora.

Systems will be ranked according to the:

We plan to compute almost all similarities in the paper Automatic Summary Evaluation without Human Models presented at TAC 2008 by Annie Louis and Ani Nenkova. In addition, we also plan to compute similarities based on n-grams.

Release of questions and result formats (24/July/2009)

Questions

The set of questions is finally released for this task. It can be found here.

There's a total of 231 questions all related to INEX ad-hoc topics. Therefore in theory, answers should only appear in passages relevant to at least one ad-hoc topic and participants are encouraged to use all information available on these topics

There are three types of questions: short_single, short_multiple and long. Those labeled short_single, short_multiple are 151 and both require short answers that are passages of a maximum of 50 words together with an offset indicating the position of the answer. The only difference between the short_single and short_multiple questions is that single type questions should have a single correct answer whereas multiple type questions will admit multiple answers. For both short types, participants should give their results as a ranked list of maximum 10 passages from the corpus together with an offset indicating the position of the answer. The passages have to be self contained and allow evaluators to decide if the answer is correct or not. Assessment for short questions will take into account the presence of the correct answer within this list and its rank. The set of 151 short type questions can be found here.

Long type questions require long answers of up to 500 words that should be self contained summaries made of up to 1000 passages (up to 50 words each) extracted exclusively from the INEX 2009 corpus. For the evaluation, we need the ranked list of text passages from which the answer is built and the summary that is generated. We have selected a set of 80 long type questions that require such answers. This subset can be found here.

Participants can chose to work on all types of questions or only on one of these two subgroups.

XML format for results

The results should be presented in an single XML file (utf8 encoding) following this DTD:

<!DOCTYPE inex-answer-qa-file [
<!ENTITY lt "lessthan">
<!ENTITY gt "greaterthan">
<!ENTITY amp "ampersand">
<!ELEMENT inex-answer-qa-file (question+)>
<!ELEMENT question (answer+, summary)>
<!ATTLIST question id CDATA #REQUIRED>
<!ELEMENT answer (#PCDATA)>
<!ATTLIST answer rank CDATA #REQUIRED>
<!ATTLIST answer doc CDATA #REQUIRED>
<!ATTLIST answer offset CDATA>
<!ELEMENT summary (#PCDATA)>
]>

Therefore the xml file should have this form:
<question id=”XXX”>
<answer rank="1" doc="xxxx" offset="O">
passage_text
</answer>
...
<answer rank="N" doc="yyyy" offset="O">
passage_text
</answer>
<question id=”YYYY”>
<answer rank="1" doc="xxxx">
passage_text
</answer>
...
<answer rank="L" doc="yyyy">
passage_text
</answer>
<summary>
summary_text
</summary>
</question>

where the values passage_text are the plain passages up to 50 words containing the answer for short answers or used to build the summary for long answers. The field summary is only required for long answers and the summary_text has to be made of passages from previous answer list and is limited to 500 words.

The attributes are the following:

Participants that cannot provide such XML file with all required fields should contact the organizers.

Updated Schedule

The new QA schedule follows the ad hoc task

1/Jun/2009 Declaration of intent
15/Jun/2009 Submission deadline for candidate questions ; Result Submission Specification
25/Jul/2009 Release of final set of questions
8/Sep/2009 Submission deadline for Results (short and long answers)
14/Oct/2009 Submission deadline for relevance assessments
9/Nov/2009 Release of QA evaluation results
6-10/Dec/2009 INEX Wor kshop in Brisbane, Australia

Organizers

Patrice Bellot
University of Avignon

Véronique Moriceau
LIMSI-CNRS, University Paris-Sud 11

Eric SanJuan
University of Avignon
eric.sanjuan@univ-avignon.fr

Xavier Tannier
LIMSI-CNRS, University Paris-Sud 11