INEX 2010 QA Track (QA@INEX)

Overview

The INEX QA track (QA@INEX) aims to evaluate a complex question-answering task. In such a task, the set of questions is composed of factoid, precise questions that expect short answers, as well as more complex questions that can be answered by several sentences or by an aggregation of texts from different documents. Question-answering, XML/passage retrieval and automatic summarization are combined in order to get closer to real information needs. The 2010 edition of the track is based on the groundwork carried out in 2009 to determine the sub-tasks and a novel evaluation methodology.

In 2009-2010, the track aims to compare the performance of QA, XML/passage retrieval and automatic summarization systems on the Wikipedia. Two types of questions are considered. The first type are factual questions which require a single precise answer to be found in the corpus if it exists. The second type consists of more complex questions whose answers require the aggregation of several passages. The passages need not necessarily be in a single document, thus might involve multi-document answer aggregation. Participation of automatic summarization systems by passage extraction is therefore encouraged. This is an opportunity to test XML/passage retrieval systems on advanced QA tasks.

For both sets of questions, systems have to provide a ranked list of relevant passages. In the case of short answers, systems also have to provide for each passage the position of the answer in the passage. The evaluation of factual questions takes into account the distance between the inferred answer and the real one. For aggregated answers, systems will provide a document with a maximum of 500 words exclusively made of passages from the document collection. These documents are evaluated according to their overlap with relevant passages and their "last point of interest" to be determined by participant evaluation. The last point of interest will show the place where the aggregated document becomes: irrelevant, incomprehensible or redundant.

Relevance assessments will be conducted by participating groups and organizers.

Test collection

As for the ad hoc track, we use the INEX 2009 Wikipedia collection (without images). Each participating group is asked to create a set of candidate questions, representative of a range of real user needs.

QA Tasks

The QA task to be performed by the participating groups of INEX 2009-2010 is answering an academic question using the Wikipedia. The general process involves:

Like in the Ad-Hoc task, we regard as relevant passages segments that both

For evaluation purposes, we require that the answer uses ONLY elements or passages previously extracted from the document collection. The correctness of answers is established by participants exclusively based on the support passages and documents. This implies that errors in Wikipedia version used in INEX could give rise to acceptable answers.

Two kind of answers are considered:

  1. Short : a single entity (Noun Phrase, integer, float or date) similar to those used in Multiple Choice Question (MCQ) academic tests.
  2. Long: the answer is constructed by aggregating several previously retrieved passages.

Participants are required to submit at least one completely automatic run. However, manual runs are strongly encouraged. Are considered as manual, runs that require a human intervention at any level of the process. These interventions should be clearly stated and documented.

Short answers

Motivation for the Task

The underlying scenario is to retrieve from the Wikipedia possible answers to MCQ type questions. By way of example, for a question like "what is LINUX?", acceptable short answers could be "a computer operating system", "an operating system kernel", "an asteroid" ... We will also consider numeric answers and dates.

The results are presented as a ranked-list of answers together with an explanation passage or element involving the answer.

What we hope to learn from this task is how advanced passage and XML element retrieval on Wikipedia can be useful to academic QA.

Results to Return

The participants should provide two type of results.
  1. A small ordered set (10) of non overlapping XML elements or passages that contains a possible answer to the question.
  2. For each element or passage, the position of the answer in the passage.

Focused IR systems are encouraged to participate to this task by simply providing the most relevant and short extracted passages they retrieve.

Relevance assessments

Each assessor will have a pool of support passages to analyze. Only the question and the support passage of the answer will be displayed, not the answer.

A maximum of 3 consecutive sentences or a complete table will be displayed. If the assessor finds an answer to the question in the passage, s/he will mark its position. We emphasize that this should be an answer based on the INEX version of Wikipedia and should not imply assessor's personal knowledge. Therefore erroneous answers found in the Wikipedia will be accepted if the passage clearly answers the question without any doubt. On the contrary, the passage could contain the right answer without explaining it, i.e. the answer appears out of context or requires some extra knowledge to be identified. In this case no answer should be marked.

Systems will be ranked according to:

Long answers

Motivation for the Task

The scenario underlying the complex question task is to write a scholarly definition or synthesis on a topic that do will not constitue a specific entry in the English Wikipedia. The answer needs to be built by aggregation of relevant XML elements or passages retrieved from different documents. For example, a long answer to "How does LINUX work?" could explain its connections with UNIX, the GNU project, several systems running it, different distributions ...

The aggregated answers will be evaluated according to the way they overlap with relevant passages (number of them, vocabulary and bi-grams included or missing) and the "last point of interest" marked by evaluators. By combining these measures, we expect to take into account both the informative content and the readability of the aggregated answers. What we hope to learn from this task are ideas on how to combine QA, XML element/passage retrieval and automatic summarization by passage extraction to enhance Wikipedia quality, in particular by providing tools to detect redundancies and discrepancies.

Results to Return

An short summary of less that 500 words, exclusively made of aggregated passages extracted from the wikipedia corpus.

Automatic summarization systems by extraction are strongly encouraged to participate.

Relevance assessments

Each assessor will have to evaluate a pool of answers of a maximum of 500 words each. These answers will be an agglomeration of wikipedia passages.

Evaluators will have to mark:

  1. The "last point of interest", i.e. the first point after which the text becomes out of context because of:
  2. all relevant passages in the text, even if they are redundant.

Systems will be ranked according to the:

Resources

Baseline system

A baseline XML-element retrieval system powered by Indri is available online with a standard CGI interface. The index covers all words (no stop list, no stemming) and some XML tags. Participants that do not wish to build their own index could use this one by downloading it or by using it online (More information here or contact eric.sanjuan@univ-avignon.fr).

Test set of questions

The set of questions is here.

There are three types of questions: short_single, short_multiple and long. Those labelled short_single, short_multiple are 195 and both require short answers that are passages of a maximum of 50 words (strings of alphanumeric characters without spaces or punctuations) together with an offset indicating the position of the answer. The only difference between the short_single and short_multiple questions is that single type questions should have a single correct answer whereas multiple type questions will admit multiple answers.

Long type questions require long answers of up to 500 words that should be self contained summaries made of passages extracted exclusively from the INEX 2009 corpus. We have selected a set of 150 long type questions that require such answers.

Result Submission

Fact sheet

Format for results

We have simplified submission formats following the INEX ad-hoc task submission format which is a variant of the familiar TREC format with additional fields:
<qid> Q0 <file> <rank> <rsv> <run_id> <column_7> <column_8> <column_9>

Here: The remaining three columns depend on the question type (short or long) and on the chosen format (text passage or offset).

Textual content:

raw text is given without XML tags and without formatting characters (avoid "\n","\r","\l"). The resulting word sequence has to appear in the file indicated in the third field. This is an example of such output:
1 Q0 3005204 1 0.9999 I10UniXRun1 The Alfred Noble Prize is an award presented by the combined engineering societies of the United States, given each year to a person not over thirty-five for a paper published in one of the journals of the participating societies. 
1 Q0 3005204 2 0.9998 I10UniXRun1 The prize was established in 1929 in honor of Alfred Noble, Past President of the American Society of Civil Engineers.
1 Q0 3005204 3 0.9997 I10UniXRun1 It has no connection to the Nobel Prize , although the two are often confused due to their similar spellings.

File Offset Length format (FOL)

In this format passages are given as offset and length calculated in characters with respect to the textual content (ignoring all tags) of the XML file. File offsets start counting a 0 (zero). Previous example would be the following in FOL format:

1 Q0 3005204 1 0.9999 I10UniXRun1 256 230
1 Q0 3005204 2 0.9998 I10UniXRun1 488 118
1 Q0 3005204 3 0.9997 I10UniXRun1 609 109
The results are from article 3005204. The first passage starts at the 256th character (so 257 characters beyond the first character), and has a length of 239 characters.

Format for short type questions

In the case of short type question, we use an extra field that indicates the position of the answer in the passage. This position is given by counting the number of words before the detected answer. Therefore is an offset in number of words instead of characters. Both text passage and fol formats can be used. Previous example would be in this format:

In the case of fol format, it will look like:

1 Q0 3005204 1 0.9999 I10UniXRun1 The Alfred Noble Prize is an award presented by the combined engineering societies of the United States, given each year to a person not over thirty-five for a paper published in one of the journals of the participating societies. 2 
1 Q0 3005204 2 0.9998 I10UniXRun1 The prize was established in 1929 in honor of Alfred Noble, Past President of the American Society of Civil Engineers. 10
1 Q0 3005204 3 0.9997 I10UniXRun1 It has no connection to the Nobel Prize , although the two are often confused due to their similar spellings. 7

Whereas in the case of fol format, we will have:

1 Q0 3005204 1 0.9999 I10UniXRun1 256 230 2
1 Q0 3005204 2 0.9998 I10UniXRun1 488 118 10
1 Q0 3005204 3 0.9997 I10UniXRun1 609 109 7

Schedule

15/Sep/2010 Submission deadline for candidate questions
15/Oct/2010 Release of final set of questions available here.
13/Nov/2010 Submission deadline for Results (short and long answers)
15/Nov/2010 Release of QA semi-automatic evaluation results by organizers
13-15/Dec/2010 INEX Workshop in Amsterdam
17/Jan/2010 Release of manual evaluation by participants

Organizers

Patrice Bellot
University of Avignon

Veronique Moriceau
LIMSI-CNRS, University Paris-Sud 11

Eric SanJuan
University of Avignon
eric.sanjuan@univ-avignon.fr

Xavier Tannier
LIMSI-CNRS, University Paris-Sud 11