INEX 2010 Data-Centric Track


Current approaches proposed for keyword search on XML data can be categorized into two broad classes: one for document-centric XML, where the structure is simple and long text fields predominate; the other for data-centric XML, where the structure is very rich and carries important information about objects and their relationships. In previous years, INEX focuses on comparing different retrieval approaches for document-centric XML, while most research work on data-centric XML retrieval cannot make use of such a standard evaluation methodology. This new track aims to provide a common forum for researchers or users to compare different retrieval techniques on data-centric XML, thus promote the research work in this field.


The track uses the IMDB data collection newly built from It consists of information about more than 1,590,000 movies and people involved in movies, e.g. actors/actresses, directors, producers and so on. Each object is richly structured. For example, each movie has title, rating, directors, actors, plot, keywords, genres, release dates, trivia, etc.; and each person has name, birth date, biography, filmography, and so on.

2010 IMDB Colleciton (1.4GB)
Information courtesy of The Internet Movie Database ( Used with permission.
This collection is the IMDB plain text files from the web site and dated as 2010-4-23 (converted into XML).
It is available for personal and non-commercial use. See the IMDb Licence.


Each participating group will be asked to create a set of candidate topics, representative of a range of real user needs. Both Content Only (CO) and Content And Structure (CAS) variants of the information need are requested. Additionally, a set of real user queries will be randomly selected from a search engine's query log to compensate for the first set of candidate topics in case that there were not enough topics or the topics are very biased in the first set.

2010 Data-Centric Track Topics now available for download.


In its first year, the track focuses on the ad hoc retrieval from XML data. Each XML document is typically modeled as a rooted, node-labeled tree. An answer to a keyword query is defined as a set of "closely related" nodes that are "collectively relevant" to the query. So each result can be specified as a collection of subtrees from one or more XML documents that are related and collectively cover the relevant information. The task is to return a ranked list of results (collections of subtrees) estimated relevant to the user's information need. The content of the collections of subtrees may not be overlapped. This is similar to the focus task in the ad hoc track, but using a data-centric XML collection and allowing the construction of a result (i.e. a collection of subtrees) from different parts of a single document or even multiple documents.

Participants may submit up to 10 runs. Each run can contain a maximum of 1000 results per topic, ordered by decreasing value of relevance. All runs may use any fields of the topics, but only runs using either the <title>, or <castitle>, or a combination of them will be regarded as truly automatic. The results of one run must be contained in one submission file (i.e. up to 10 files can be submitted in total).

Submission Format

For relevance assessments and the evaluation of the results we require submission files to be in the format described in this section. The submission format is a variant of the familiar TREC format. The submission system will have a form requesting information about the runs: A run may contain a maximum of 1000 results for each topic. A result can be one or more subtrees from a single or multiple XML documents. A subtree can be specified with its root node, which is uniquely identified by its element path in the XML document tree. The standard TREC format is extended with one additional field for specifying each result subtree: <qid> Q0 <file> <rank> <rsv> <run_id> <column_7>

Path 	 ::= 	 '/' ElementNode Path | '/' ElementNode | '/' AttributeNode 
ElementNode 	 ::= 	 ElementName Index 
AttributeNode 	 ::= 	 '@' AttributeName 
Index 	 ::= 	 '[' integer ']' 
This path identifies the element which can be found if we start at the document root, select the first "article" element, then within that, select the first "body" element, within which we select the first "section" element, and finally within that element we select the first "p" element. Important: XPath counts elements starting with 1 and takes into account the element type, e.g. if a section had a title and two paragraphs then their paths would be given as: title[1], p[1] and p[2].

An example submission is:
1 Q0 9996 1 0.9999 I09UniXRun1 /article[1]/bdy[1]/sec[1]
1 Q0 9996 1 0.9999 I09UniXRun1 /article[1]/bdy[1]/sec[2]/p[1]
1 Q0 9888 1 0.9999 I09UniXRun1 /article[1]/bdy[1]/sec[3]
1 Q0 9997 2 0.9998 I09UniXRun1 /article[1]/bdy[1]/sec[2]
1 Q0 9989 3 0.9997 I09UniXRun1 /article[1]/bdy[1]/sec[3]/p[1]

Here are three results. The first result contains the first section and first paragraph of the second section from 9996.xml, and the third section from 9888.xml. The second result only consists of the second section in 9997.xml, and the third result consists of the first paragraph of the third section from 9989.xml.

Submission Procedure

An online submission tool will be provided closer to the submission deadline.

Relevance Assessments

Relevance assessment will be conducted by participating groups using the INEX assessment system.


The evaluation of the effectiveness of the retrieval results submitted by the participants will be based on their overlap with the relevant parts in data collection judged by the assessors. It is likely to use the same metrics as the Ad Hoc track.


May 25IMDB data collection ready and available for download
July 5Topic submission deadline
July 10Topics and Result submission specification distributed
Sep 10Run submission deadline
Sep 20-Oct 20Relevance assessment
Nov 5Release of assessments and results


Qiuyue Wang
Renmin University of China

Andrew Trotman
University of Otago