|
|
The INEX 2009 QA@INEX track will complement Ad-Hoc tasks. It will use the same document collection while involving specific questions. The track aims to compare the performance of QA, XML/passage retrieval and automatic summarization systems on an encyclopedic resource like the Wikipedia. The track will consider two type of questions that extend queries considered in Ad-Hoc tasks. The first set will be factual questions which require a single precise answer to be found in the corpus if it exists. The second set will consists of more complex questions whose answers require the aggregation of several passages. The passages need not necessarily be in a single document, thus might involve multi-document answer aggregation. Participation of automatic summarization systems by passage extraction is therefore encouraged. This will be an opportunity to test XML/passage retrieval systems on advanced QA tasks.
The evaluation will use INEX Ad-Hoc on-line interfaces. For both sets of questions, systems will have to provide a ranked list of non overlapping relevant passages. These passages will be evaluated by participants following the same methodology used in the Ad-Hoc track. In the case of short answers, systems will also have to provide for each passage the position of the answer in the passage. The evaluation of factual questions will take into account the distance between the inferred answer and the real one. For aggregated answers, systems will provide a document with a maximum of 500 words exclusively made of passages from the document collection. These documents will be evaluated according to their overlap with relevant passages and their “last point of interest” to be determined by participant evaluation. The last point of interest will show the place where the aggregated document becomes: irrelevant, incomprehensible or redundant.
As for the ad hoc track, we will use the INEX 2009 Wikipedia collection (without images). Each participating group will be asked to create a set of candidate questions, representative of a range of real user needs.
The QA task to be performed by the participating groups of INEX 2009 is answering an academic question using the Wikipedia. The general process involves:
Like in the Ad-Hoc task, we regard as relevant passages segments that both
We rely on participants to propose a set of questions. For evaluation purposes, we require this year that the answer uses ONLY elements or passages previously extracted from the document collection. The correctness of answers will be established by participants exclusively based on the support passages and documents. This implies that errors in Wikipedia version used in INEX could give rise to acceptable answers.
Two kind of answers will be considered:
Participants are required to submit at least one completely automatic run. However, manual runs are strongly encouraged. Are considered as manual, runs that require a human intervention at any level of the process. These interventions should be clearly stated and documented.
What we hope to learn from this task is how advanced passage and
XML element retrieval on Wikipedia can be useful to academic QA.
For each element or passage, the position of the answer in the passage. They will be evaluated by computing their distance to the answer.
Focused IR systems are encouraged to participate to this task by simply providing the most relevant and short extracted passages they retrieve.
The aggregated answers will be evaluated according to the way they overlap with relevant passages (number of them, vocabulary and bi-grams included or missing) and the “last point of interest” marked by evaluators. By combining these measures, we expect to take into account both the informative content and the readability of the aggregated answers. What we hope to learn from this task are ideas on how to combine QA, XML element/passage retrieval and automatic summarization by passage extraction to enhance Wikipedia quality, in particular by providing tools to detect redundancies and discrepancies.
A baseline XML-element retrieval system powered by Indri will be available on line with a standard CGI interface. The index will cover all words (no stop list, no stemming) and XML tags. Participants that do not wish to build their own index could use this one by downloading it or by using it online (Contact eric.sanjuan@univ-avignon.fr).
Relevance assessments will be conducted by participating groups.
They will be done on-line.
Each assessor will have a pool of support passages to analyze. Only the question and the support passage of the answer will be displayed, not the answer.
A maximum of 3 consecutive sentences or a complete table will be displayed. If the assessor finds an answer to the question in the passage, s/he will mark its position. We emphasize that this should be an answer based on the INEX version of Wikipedia and should not imply assessor's personal knowledge. Therefore erroneous answers found in the Wikipedia will be accepted if the passage clearly answers the question without any doubt. On the contrary, the passage could contain the right answer without explaining it, i.e. the answer appears out of context or requires some extra knowledge to be identified. In this case no answer should be marked.
Systems will be ranked according to:
Each assessor will have to evaluate a pool of answers of a maximum of 500 words each. These answers will be an agglomeration of wikipedia passages but source articles will not be shown.
Evaluators will have to mark:
Systems will be ranked according to the:
The set of questions is finally released for this task. It can be found here.
There's a total of 231 questions all related to INEX ad-hoc topics. Therefore in theory, answers should only appear in passages relevant to at least one ad-hoc topic and participants are encouraged to use all information available on these topics
There are three types of questions: short_single, short_multiple and long. Those labeled short_single, short_multiple are 151 and both require short answers that are passages of a maximum of 50 words together with an offset indicating the position of the answer. The only difference between the short_single and short_multiple questions is that single type questions should have a single correct answer whereas multiple type questions will admit multiple answers. For both short types, participants should give their results as a ranked list of maximum 10 passages from the corpus together with an offset indicating the position of the answer. The passages have to be self contained and allow evaluators to decide if the answer is correct or not. Assessment for short questions will take into account the presence of the correct answer within this list and its rank. The set of 151 short type questions can be found here.
Long type questions require long answers of up to 500 words that should be self contained summaries made of up to 1000 passages (up to 50 words each) extracted exclusively from the INEX 2009 corpus. For the evaluation, we need the ranked list of text passages from which the answer is built and the summary that is generated. We have selected a set of 80 long type questions that require such answers. This subset can be found here.
Participants can chose to work on all types of questions or only on one of these two subgroups.
The results should be presented in an single XML file (utf8 encoding) following this DTD:
<!DOCTYPE inex-answer-qa-file [Therefore the xml file should have this form:
<question id=”XXX”>
<answer rank="1"
doc="xxxx" offset="O">
passage_text
</answer>
...
<answer rank="N"
doc="yyyy" offset="O">
passage_text
</answer>
<question id=”YYYY”>
<answer rank="1"
doc="xxxx">
passage_text
</answer>
...
<answer rank="L"
doc="yyyy">
passage_text
</answer>
<summary>
summary_text
</summary>
</question>
where the values passage_text are the plain passages up to 50 words containing the answer for short answers or used to build the summary for long answers. The field summary is only required for long answers and the summary_text has to be made of passages from previous answer list and is limited to 500 words.
The attributes are the following:
Participants that cannot provide such XML file with all required fields should contact the organizers.
The new QA schedule follows the ad hoc task
| 1/Jun/2009 | Declaration of intent |
|---|---|
| 15/Jun/2009 | Submission deadline for candidate questions ; Result Submission Specification |
| 25/Jul/2009 | Release of final set of questions |
| 8/Sep/2009 | Submission deadline for Results (short and long answers) |
| 14/Oct/2009 | Submission deadline for relevance assessments |
| 9/Nov/2009 | Release of QA evaluation results |
| 6-10/Dec/2009 | INEX Wor kshop in Brisbane, Australia |
Patrice Bellot
University of Avignon
Véronique Moriceau
LIMSI-CNRS, University Paris-Sud 11
Eric SanJuan
University of Avignon
eric.sanjuan@univ-avignon.fr
Xavier Tannier
LIMSI-CNRS, University Paris-Sud 11