The INEX QA track (QA@INEX) aims to evaluate a complex question-answering task. In such a task, the set of questions is composed of factoid, precise questions that expect short answers, as well as more complex questions that can be answered by several sentences or by an aggregation of texts from different documents. Question-answering, XML/passage retrieval and automatic summarization are combined in order to get closer to real information needs. The 2010 edition of the track is based on the groundwork carried out in 2009 to determine the sub-tasks and a novel evaluation methodology.
In 2009-2010, the track aims to compare the performance of QA, XML/passage retrieval and automatic summarization systems on the Wikipedia. Two types of questions are considered. The first type are factual questions which require a single precise answer to be found in the corpus if it exists. The second type consists of more complex questions whose answers require the aggregation of several passages. The passages need not necessarily be in a single document, thus might involve multi-document answer aggregation. Participation of automatic summarization systems by passage extraction is therefore encouraged. This is an opportunity to test XML/passage retrieval systems on advanced QA tasks.
For both sets of questions, systems have to provide a ranked list of relevant passages. In the case of short answers, systems also have to provide for each passage the position of the answer in the passage. The evaluation of factual questions takes into account the distance between the inferred answer and the real one. For aggregated answers, systems will provide a document with a maximum of 500 words exclusively made of passages from the document collection. These documents are evaluated according to their overlap with relevant passages and their "last point of interest" to be determined by participant evaluation. The last point of interest will show the place where the aggregated document becomes: irrelevant, incomprehensible or redundant.
Relevance assessments will be conducted by participating groups and organizers.
As for the ad hoc track, we use the INEX 2009 Wikipedia collection (without images). Each participating group is asked to create a set of candidate questions, representative of a range of real user needs.
The QA task to be performed by the participating groups of INEX 2009-2010 is answering an academic question using the Wikipedia. The general process involves:
Like in the Ad-Hoc task, we regard as relevant passages segments that both
For evaluation purposes, we require that the answer uses ONLY elements or passages previously extracted from the document collection. The correctness of answers is established by participants exclusively based on the support passages and documents. This implies that errors in Wikipedia version used in INEX could give rise to acceptable answers.
Two kind of answers are considered:
Participants are required to submit at least one completely automatic run. However, manual runs are strongly encouraged. Are considered as manual, runs that require a human intervention at any level of the process. These interventions should be clearly stated and documented.
The underlying scenario is to retrieve from the Wikipedia possible answers to MCQ type questions. By way of example, for a question like "what is LINUX?", acceptable short answers could be "a computer operating system", "an operating system kernel", "an asteroid" ... We will also consider numeric answers and dates.
The results are presented as a ranked-list of answers together with an explanation passage or element involving the answer.
What we hope to learn from this task is how advanced passage and XML element retrieval on Wikipedia can be useful to academic QA.
For each element or passage, the position of the answer in the passage.
Focused IR systems are encouraged to participate to this task by simply providing the most relevant and short extracted passages they retrieve.
Each assessor will have a pool of support passages to analyze. Only the question and the support passage of the answer will be displayed, not the answer.
A maximum of 3 consecutive sentences or a complete table will be displayed. If the assessor finds an answer to the question in the passage, s/he will mark its position. We emphasize that this should be an answer based on the INEX version of Wikipedia and should not imply assessor's personal knowledge. Therefore erroneous answers found in the Wikipedia will be accepted if the passage clearly answers the question without any doubt. On the contrary, the passage could contain the right answer without explaining it, i.e. the answer appears out of context or requires some extra knowledge to be identified. In this case no answer should be marked.
Systems will be ranked according to:
The scenario underlying the complex question task is to write a scholarly definition or synthesis on a topic that do will not constitue a specific entry in the English Wikipedia. The answer needs to be built by aggregation of relevant XML elements or passages retrieved from different documents. For example, a long answer to "How does LINUX work?" could explain its connections with UNIX, the GNU project, several systems running it, different distributions ...
The aggregated answers will be evaluated according to the way they overlap with relevant passages (number of them, vocabulary and bi-grams included or missing) and the "last point of interest" marked by evaluators. By combining these measures, we expect to take into account both the informative content and the readability of the aggregated answers. What we hope to learn from this task are ideas on how to combine QA, XML element/passage retrieval and automatic summarization by passage extraction to enhance Wikipedia quality, in particular by providing tools to detect redundancies and discrepancies.
Automatic summarization systems by extraction are strongly encouraged to participate.
Each assessor will have to evaluate a pool of answers of a maximum of 500 words each. These answers will be an agglomeration of wikipedia passages.
Evaluators will have to mark:
Systems will be ranked according to the:
A baseline XML-element retrieval system powered by Indri is available online with a standard CGI interface. The index covers all words (no stop list, no stemming) and some XML tags. Participants that do not wish to build their own index could use this one by downloading it or by using it online (More information here or contact firstname.lastname@example.org).
The set of questions is here.
There are three types of questions: short_single, short_multiple and long. Those labelled short_single, short_multiple are 195 and both require short answers that are passages of a maximum of 50 words (strings of alphanumeric characters without spaces or punctuations) together with an offset indicating the position of the answer. The only difference between the short_single and short_multiple questions is that single type questions should have a single correct answer whereas multiple type questions will admit multiple answers.
Long type questions require long answers of up to 500 words that should be self contained summaries made of passages extracted exclusively from the INEX 2009 corpus. We have selected a set of 150 long type questions that require such answers.
Here:<qid> Q0 <file> <rank> <rsv> <run_id> <column_7> <column_8> <column_9>
1 Q0 3005204 1 0.9999 I10UniXRun1 The Alfred Noble Prize is an award presented by the combined engineering societies of the United States, given each year to a person not over thirty-five for a paper published in one of the journals of the participating societies. 1 Q0 3005204 2 0.9998 I10UniXRun1 The prize was established in 1929 in honor of Alfred Noble, Past President of the American Society of Civil Engineers. 1 Q0 3005204 3 0.9997 I10UniXRun1 It has no connection to the Nobel Prize , although the two are often confused due to their similar spellings.
In this format passages are given as offset and length calculated in characters with respect to the textual content (ignoring all tags) of the XML file. File offsets start counting a 0 (zero). Previous example would be the following in FOL format:
The results are from article 3005204. The first passage starts at the 256th character (so 257 characters beyond the first character), and has a length of 239 characters.1 Q0 3005204 1 0.9999 I10UniXRun1 256 230 1 Q0 3005204 2 0.9998 I10UniXRun1 488 118 1 Q0 3005204 3 0.9997 I10UniXRun1 609 109
In the case of short type question, we use an extra field that indicates the position of the answer in the passage. This position is given by counting the number of words before the detected answer. Therefore is an offset in number of words instead of characters. Both text passage and fol formats can be used. Previous example would be in this format:
In the case of fol format, it will look like:
1 Q0 3005204 1 0.9999 I10UniXRun1 The Alfred Noble Prize is an award presented by the combined engineering societies of the United States, given each year to a person not over thirty-five for a paper published in one of the journals of the participating societies. 2 1 Q0 3005204 2 0.9998 I10UniXRun1 The prize was established in 1929 in honor of Alfred Noble, Past President of the American Society of Civil Engineers. 10 1 Q0 3005204 3 0.9997 I10UniXRun1 It has no connection to the Nobel Prize , although the two are often confused due to their similar spellings. 7
Whereas in the case of fol format, we will have:
1 Q0 3005204 1 0.9999 I10UniXRun1 256 230 2 1 Q0 3005204 2 0.9998 I10UniXRun1 488 118 10 1 Q0 3005204 3 0.9997 I10UniXRun1 609 109 7
|15/Sep/2010||Submission deadline for candidate questions|
|15/Oct/2010||Release of final set of questions available here.|
|13/Nov/2010||Submission deadline for Results (short and long answers)|
|15/Nov/2010||Release of QA semi-automatic evaluation results by organizers|
|13-15/Dec/2010||INEX Workshop in Amsterdam||17/Jan/2010||Release of manual evaluation by participants|
University of Avignon
LIMSI-CNRS, University Paris-Sud 11
University of Avignon
LIMSI-CNRS, University Paris-Sud 11