INEX 2010 Data-Centric Track
Current approaches proposed for keyword search on XML data can be categorized into two broad classes: one for document-centric XML, where the structure is simple and long text fields predominate; the other for data-centric XML, where the structure is very rich and carries important information about objects and their relationships. In previous years, INEX focuses on comparing different retrieval approaches for document-centric XML, while most research work on data-centric XML retrieval cannot make use of such a standard evaluation methodology. This new track aims to provide a common forum for researchers or users to compare different retrieval techniques on data-centric XML, thus promote the research work in this field.
The track uses the IMDB data collection newly built from www.imdb.com. It consists of information about more than 1,590,000 movies and people involved in movies, e.g. actors/actresses, directors, producers and so on. Each object is richly structured. For example, each movie has title, rating, directors, actors, plot, keywords, genres, release dates, trivia, etc.; and each person has name, birth date, biography, filmography, and so on.
2010 IMDB Colleciton (1.4GB)
Information courtesy of The Internet Movie Database (http://www.imdb.com). Used with permission.
This collection is the IMDB plain text files from the web site and dated as 2010-4-23 (converted into XML).
It is available for personal and non-commercial use. See the IMDb Licence.
Each participating group will be asked to create a set of candidate topics, representative of a range of real user needs. Both Content Only (CO) and Content And Structure (CAS) variants of the information need are requested. Additionally, a set of real user queries will be randomly selected from a search engine's query log to compensate for the first set of candidate topics in case that there were not enough topics or the topics are very biased in the first set.
2010 Data-Centric Track Topics now available for download.
In its first year, the track focuses on the ad hoc retrieval from XML data. Each XML document is typically modeled as a rooted, node-labeled tree. An answer to a keyword query is defined as a set of "closely related" nodes that are "collectively relevant" to the query. So each result can be specified as a collection of subtrees from one or more XML documents that are related and collectively cover the relevant information. The task is to return a ranked list of results (collections of subtrees) estimated relevant to the user's information need. The content of the collections of subtrees may not be overlapped. This is similar to the focus task in the ad hoc track, but using a data-centric XML collection and allowing the construction of a result (i.e. a collection of subtrees) from different parts of a single document or even multiple documents.
Participants may submit up to 10 runs. Each run can contain a maximum of 1000 results per topic, ordered by decreasing value of relevance. All runs may use any fields of the topics, but only runs using either the <title>, or <castitle>, or a combination of them will be regarded as truly automatic. The results of one run must be contained in one submission file (i.e. up to 10 files can be submitted in total).
For relevance assessments and the evaluation of the results we require submission files to be in the format described in this section. The submission format is a variant of the familiar TREC format. The submission system will have a form requesting information about the runs:
A run may contain a maximum of 1000 results for each topic. A result can be one or more subtrees from a single or multiple XML documents. A subtree can be specified with its root node, which is uniquely identified by its element path in the XML document tree. The standard TREC format is extended with one additional field for specifying each result subtree:
<qid> Q0 <file> <rank> <rsv> <run_id> <column_7>
- The participant ID of the submitting institute (available on the INEX web-site).
- The identification of whether the query was constructed automatically or manually from the topic.
- The used topic fields.
- Finally, each submitted run must contain a description of the retrieval approach applied to generate the search results.
- the first column is the topic number.
- the second column is the query number within that topic. This is currently unused and should always be Q0.
- the third column is the file name (without .xml) from which a result subtree is retrieved.
- the fourth column is the rank of the result. Note that a result may consist of one or more subtrees, so there can be multiple rows with the same rank if these subtrees belong to the same result.
- the fifth column shows the score (integer or floating point) that generated the ranking. This score MUST be in descending (non-increasing) order and is important to include so that we can handle tied scores (for a given run) in a uniform fashion (the evaluation routines rank documents from these scores, not from your ranks). If you want the precise ranking you submit to be evaluated, the SCORES must reflect that ranking.
- the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. That is, each run should have a different tag that identifies the group and the method that produced the run. Please change the tag from year to year, since often we compare across years (for graphs and such) and having the same name show up for both years is confusing. Also run tags must contain 12 or fewer letters and numbers, with *NO* punctuation, to facilitate labeling graphs with the tags.
- the seventh column gives the element path of the root node of a result subtree. Element paths are given in XPath syntax. To be more precise, only fully specified paths are allowed, as described by the following grammar:
Path ::= '/' ElementNode Path | '/' ElementNode | '/' AttributeNode
ElementNode ::= ElementName Index
AttributeNode ::= '@' AttributeName
Index ::= '[' integer ']'
This path identifies the element which can be found if we start at the document root, select the first "article" element, then within that, select the first "body" element, within which we select the first "section" element, and finally within that element we select the first "p" element. Important: XPath counts elements starting with 1 and takes into account the element type, e.g. if a section had a title and two paragraphs then their paths would be given as: title, p and p.
An example submission is:
1 Q0 9996 1 0.9999 I09UniXRun1 /article/bdy/sec
1 Q0 9996 1 0.9999 I09UniXRun1 /article/bdy/sec/p
1 Q0 9888 1 0.9999 I09UniXRun1 /article/bdy/sec
1 Q0 9997 2 0.9998 I09UniXRun1 /article/bdy/sec
1 Q0 9989 3 0.9997 I09UniXRun1 /article/bdy/sec/p
Here are three results. The first result contains the first section and first paragraph of the second section from 9996.xml, and the third section from 9888.xml. The second result only consists of the second section in 9997.xml, and the third result consists of the first paragraph of the third section from 9989.xml.
An online submission tool will be provided closer to the submission deadline.
Relevance assessment will be conducted by participating groups using the INEX assessment system.
The evaluation of the effectiveness of the retrieval results submitted by the participants will be based on their overlap with the relevant parts in data collection judged by the assessors. It is likely to use the same metrics as the Ad Hoc track.
|May 25||IMDB data collection ready and available for download|
|July 5||Topic submission deadline|
|July 10||Topics and Result submission specification distributed|
|Sep 10||Run submission deadline|
|Sep 20-Oct 20||Relevance assessment|
|Nov 5||Release of assessments and results|
Renmin University of China
University of Otago