INEX 2009 Efficiency Track

Overview

For 2009, the INEX Efficiency track continues into its second round. Just like in 2008, we intend to provide a common forum for the evaluation of both the effectiveness and efficiency of XML ranked retrieval approaches on real data and real queries. As opposed to the purely synthetic XMark or XBench settings still prevalent in this area, the Efficiency track continues the INEX tradition using a rich pool of manually assessed relevance judgments for measuring retrieval effectiveness. One of the main goals is to attract more groups from the DB community to INEX, thus being able to study effectiveness/efficiency tradeoffs in XML ranked retrieval for a broad audience from both the DB and IR communities. The Efficiency track significantly extends the Ad-Hoc track by systematically investigating different types of queries and retrieval tasks (see below), including a distinct call for specifically difficult topics with high-dimensional structural conditions. Moreover, we expect the new 2009 Wikipedia collection consisting of 50.7 GB XML sources with more than 2.6 million documents and more than 1.4 billion elements to pose major challenges to the scalability of any ranked XML retrieval approach. In summary, our goals for 2009 are:

Document Collection

The Efficiency track will use the INEX 2009 Wikipedia collection which has been newly introduced to INEX 2009. This collection consists of 50.7 GB XML-ified Wikipedia articles, with more than 2.6 million documents and 1.4 billion elements.

Just like in previous years this Wikipedia-based document collection has a rather irregular path structure which is particularly challenging for indexing techniques relying on path summaries.

Unlike with the old collections, we now have a variety of layout-related and semantic structure which may give rise to much more intriguing structural query conditions.

There is no DTD available for the INEX Wikipedia 2009 collection. See the collection summary for a comprehensive list of tags.

Topics

The final set of 230 type (A) and (B) topics is available now!

Valid run submissions may contain any subset of these topics, but we encourage people to process the entire batch of 230 topics.

Retrieval Tasks

For 2009, we split our focus onto two main tasks:

Both tasks will use the following three distinct sets of query types (each representing a slightly different retrieval challenge):

Type (A) queries have some full-text predicates such as phrases, mandatory terms (+), and negations (-). Although participants are encouraged to use these full-text hints, they are not mandatory.

Type (B) queries will be derived from automatic query expansions over the CO formulations of the type (A) Ad-Hoc topics and will be evaluated on the basis of the 2009 Ad-Hoc track assessments.

Both type (B) and (C) queries will require IR-style, non-conjunctive (aka. "andish") query evaluations that can either preselect the most significant query conditions or dynamically relax both the structure-related and content-related conditions at query processing time in order to achieve good recall values. All queries will be provided in both the NEXI query language (in both their CO and CAS formulations) and (exclusively for the Efficiency Track) in their corresponding XPath 2.0 Full-Text syntax.

New Topics

A distinct call for structure-enhanced (type C) topics has been issued, aiming to collect a number of specifically difficult topics with higher-dimensional structural conditions (see below for an example).

    <topic id="2009eff001">
    <title>
      "Max Planck" category "Quantum Physicist" "Nobel Laureates" 
      publications "Entropy and Temperature of Radiant Heat" 
      doctoral students
    </title>
    <castitle>
      //article//person//scientist[
        about(.//header//categories//category, "Quantum Physicist") 
        and about(.//header//categories//category, "Nobel Laureates") 
        and about(.//header//title, "Max Planck") 
        and about(.//sec, Publications "Entropy and Temperature of Radiant Heat")
      ]//doctoral_students//person//scientist
    </castitle>
    <description>
      I am looking for doctoral students of the quantum physicist
      and Nobel laureate Max Planck. Other persons or elements not mentioning
      doctoral students of Max Planck are not relevant.
    </description>
    <narrative>
      I am looking for doctoral students of the quantum physicist and Nobel laureate 
      Max Planck, who themselves became famous scientists.
      Articles with "Max Planck" in the title, "quantum physicist" and "Nobel laureate" 
      as category and a publication about "Entropy and Temperature of Radiant Heat" 
      are expected to be the best sources for this information.
    </narrative>
    </topic>
    

Because of their high-dimensional nature, type (C) topics may be quite specific, i.e., they may aim at just a few target elements. Thus, a good idea for topic development is to start with a CO query, have a look at the XML structure of one or more good result pages, and then specialize the CO query into a CAS query with more structural components. Please feel free to explore the new collection and try out new topics in our TopX 2.0 interface (using the NEXI syntax).

Please note that NEXI imposes more restrictions than XPath on the formulation of branching path queries and the nesting of predicates, such that branching path queries are best formulated using multiple 'about' operators per predicate, each addressing a different branch of the document and being connected by an 'and' (see the above example query).

As NEXI is clearly a subset of XPath Full-Text, please also validate your NEXI syntax (before submitting your candidate topic) using the online NEXI parser available at: http://www.inex.otago.ac.nz/tracks/adhoc/nexiparser.asp

We will then aim to provide a corresponding translation of all NEXI queries into XPath 2.0 Full-Text for the actual task, such that systems using either language can run the final queries.

All Efficiency track submissions will be evaluated in VCAS mode (see http://www.inex.otago.ac.nz/tracks/adhoc/nexi/nexi.pdf), i.e., path conditions (as well as content conditions) will be evaluated in a vague interpretation, which means that one or more tags and/or keywords of the query may be missing to still render a result element as relevant. Thus, as in any INEX evaluation, it will be up to the peer assessors to decide about relevance of individual result elements.

Please do not use link elements or very small layout elements such as 'collectionlink' or 'emph' elements as target elements of the query.

Assessments

Each participating group will have to evaluate a few of the newly developed type (C) topics. Experience from the Ad-Hoc track shows that relevance assessments take one person about one day per topic.
skipped due to lack of topic submissions!

Assessments for type (A) and (B) topics will be reused from the current 2009 Ad-Hoc evaluation.

Reference Engine

We are currently working on making a light-weight version of TopX 2.0 available as reference engine for both Windows and Unix-like environments, which should also be compilable with minimum effort under different environments. TopX currently does not support distributed search and is intended for use as baseline on a single-node retrieval system.

We will try to make the TopX download available before the release of the run evaluations.

Run Submissions

The run submission interface will be open very soon after the Efficiency topic development ends and will be open between September 7 and September 14. We thus intentionally provide a more narrow time frame for run submissions than the Ad-Hoc track.

Just like in 2008, the Efficiency track particularly encourages the use of top-k style query engines. The result submission format includes options for marking runs as top-15, top-150, and top-1500, using either a Focused non-overlapping element retrieval, a Thorough (incl. overlap) element retrieval, or an entire Article retrieval mode.

Automatic runs may use either title field, including the NEXI CO, CAS, or XPATH titles, and even keywords from the narrative or description fields.

At least one automatic and sequential run with topics being processed one-by-one is mandatory.

Furthermore, participants are invited to submit as many runs in different retrieval modes as possible. Both the Ad-Hoc and Budget-constrained tasks permit (and encourage) all of the above combinations of retrieval modes.

Submission format

For relevance assessments and the evaluation of the results, we require submission files to be in the XML format described in this section (similar to our 2008 setting). The submission format for all retrieval modes is defined in the following DTD:
<!ELEMENT efficiency-submission (topic-fields, 
                           general_description, 
                           ranking_description, 
                           indexing_description, 
                           caching_description, 
                           topic+)> 
<!ATTLIST efficiency-submission 
  participant-id CDATA #REQUIRED 
  run-id         CDATA #REQUIRED
  task           (adhoc | budget10 | | budget100 | budget1000 | budget10000) #REQUIRED
  type           (focused | thorough | article) #REQUIRED
  query          (automatic | manual) #REQUIRED
  sequential     (yes|no) #REQUIRED
  no_cpu         CDATA #IMPLIED
  ram            CDATA #IMPLIED
  no_nodes       CDATA #IMPLIED
  hardware_cost  CDATA #IMPLIED
  hardware_year  CDATA #IMPLIED
  topk           (15 | 150 | 1500) #IMPLIED
  index_size_bytes CDATA #IMPLIED
  indexing_time_sec CDATA #IMPLIED
  baseline_time_ms CDATA #IMPLIED
>
<!ELEMENT topic-fields EMPTY>
<!ATTLIST topic-fields
  co_title       (yes|no) #REQUIRED
  cas_title      (yes|no) #REQUIRED
  xpath_title    (yes|no) #REQUIRED
  text_predicates(yes|no) #REQUIRED
  description    (yes|no) #REQUIRED
  narrative      (yes|no) #REQUIRED
>
<!ELEMENT general_description  (#PCDATA)>
<!ELEMENT ranking_description  (#PCDATA)>
<!ELEMENT indexing_description (#PCDATA)>
<!ELEMENT caching_description  (#PCDATA)>
<!ELEMENT topic (result*)> 
<!ATTLIST topic 
  topic-id       CDATA #REQUIRED
  total_time_ms  CDATA #REQUIRED
  cpu_time_ms    CDATA #IMPLIED
  io_time_ms     CDATA #IMPLIED
  io_bytes       CDATA #IMPLIED
>
<!ELEMENT result (file, path, rank, rsv?)> 
<!ELEMENT file   (#PCDATA)>
<!ELEMENT path   (#PCDATA)>
<!ELEMENT rank   (#PCDATA)>
<!ELEMENT rsv    (#PCDATA)>
    

Explanation of DTD fields:

Providing CPU and I/O times (and I/O bytes) is optional for each topic. Also, it is sufficient to provide a ranked list of matching elements along with their XPath identifiers. Ranks are mandatory but score values are optional. Note that in article retrieval mode, all results' elements paths should be /article[1].

Run submissions not matching this DTD will be automatically discarded.

Metrics

Schedule

08/Jul/2009  Release of topic creation guidelines for type (C) queries
28/Aug/2009  Submission deadline for the new, type (C), candidate topics, formulated as NEXI or XPath 2.0 Full-Text
7/Sep/2009  Release of final set of type (A), (B), and (C) queries in both the NEXI & XPath syntax
14/Sep/2009  Submission deadline for all Efficiency track runs (one file per run for the Ad-Hoc and Budget tasks, and types A, B & C)
28/Sep/2009  Beginning of relevance assessments for type (C) queries
26/Oct/2009  Submission deadline for relevance assessments for type (C) queries  skipped due to lack of topic submissions!
2/Nov/2009  Release of all Efficiency track evaluation results (Ad-Hoc and Budget tasks, types A, B & C)
23/Nov/2009  Submission deadline for papers for the pre-proceedings (all tracks)
30/Nov/2009  Release of the workshop pre-proceedings (all tracks)
6-10/Dec/2009  INEX Workshop in Brisbane, Australia

Organizers

Martin Theobald
Max Planck Institute for Informatics
martin.theobald@mpi-inf.mpg.de

Ralf Schenkel
MMCI, Saarland University
schenkel@mpi-inf.mpg.de