|
|
For 2009, the INEX Efficiency track continues into its second round. Just like in 2008, we intend to provide a common forum for the evaluation of both the effectiveness and efficiency of XML ranked retrieval approaches on real data and real queries. As opposed to the purely synthetic XMark or XBench settings still prevalent in this area, the Efficiency track continues the INEX tradition using a rich pool of manually assessed relevance judgments for measuring retrieval effectiveness. One of the main goals is to attract more groups from the DB community to INEX, thus being able to study effectiveness/efficiency tradeoffs in XML ranked retrieval for a broad audience from both the DB and IR communities. The Efficiency track significantly extends the Ad-Hoc track by systematically investigating different types of queries and retrieval tasks (see below), including a distinct call for specifically difficult topics with high-dimensional structural conditions. Moreover, we expect the new 2009 Wikipedia collection consisting of 50.7 GB XML sources with more than 2.6 million documents and more than 1.4 billion elements to pose major challenges to the scalability of any ranked XML retrieval approach. In summary, our goals for 2009 are:
The Efficiency track will use the INEX 2009 Wikipedia collection which has been newly introduced to INEX 2009. This collection consists of 50.7 GB XML-ified Wikipedia articles, with more than 2.6 million documents and 1.4 billion elements.
Just like in previous years this Wikipedia-based document collection has a rather irregular path structure which is particularly challenging for indexing techniques relying on path summaries.
Unlike with the old collections, we now have a variety of layout-related and semantic structure which may give rise to much more intriguing structural query conditions.
There is no DTD available for the INEX Wikipedia 2009 collection. See the collection summary for a comprehensive list of tags.
The final set of 230 type (A) and (B) topics is available now!
Valid run submissions may contain any subset of these topics, but we encourage people to process the entire batch of 230 topics.
For 2009, we split our focus onto two main tasks:
Both tasks will use the following three distinct sets of query types (each representing a slightly different retrieval challenge):
Type (A) queries have some full-text predicates such as phrases, mandatory terms (+), and negations (-). Although participants are encouraged to use these full-text hints, they are not mandatory.
Type (B) queries will be derived from automatic query expansions over the CO formulations of the type (A) Ad-Hoc topics and will be evaluated on the basis of the 2009 Ad-Hoc track assessments.
Both type (B) and (C) queries will require IR-style, non-conjunctive (aka. "andish") query evaluations that can either preselect the most significant query conditions or dynamically relax both the structure-related and content-related conditions at query processing time in order to achieve good recall values. All queries will be provided in both the NEXI query language (in both their CO and CAS formulations) and (exclusively for the Efficiency Track) in their corresponding XPath 2.0 Full-Text syntax.
A distinct call for structure-enhanced (type C) topics has been issued, aiming to collect a number of specifically difficult topics with higher-dimensional structural conditions (see below for an example).
<topic id="2009eff001">
<title>
"Max Planck" category "Quantum Physicist" "Nobel Laureates"
publications "Entropy and Temperature of Radiant Heat"
doctoral students
</title>
<castitle>
//article//person//scientist[
about(.//header//categories//category, "Quantum Physicist")
and about(.//header//categories//category, "Nobel Laureates")
and about(.//header//title, "Max Planck")
and about(.//sec, Publications "Entropy and Temperature of Radiant Heat")
]//doctoral_students//person//scientist
</castitle>
<description>
I am looking for doctoral students of the quantum physicist
and Nobel laureate Max Planck. Other persons or elements not mentioning
doctoral students of Max Planck are not relevant.
</description>
<narrative>
I am looking for doctoral students of the quantum physicist and Nobel laureate
Max Planck, who themselves became famous scientists.
Articles with "Max Planck" in the title, "quantum physicist" and "Nobel laureate"
as category and a publication about "Entropy and Temperature of Radiant Heat"
are expected to be the best sources for this information.
</narrative>
</topic>
Because of their high-dimensional nature, type (C) topics may be quite specific, i.e., they may aim at just a few target elements. Thus, a good idea for topic development is to start with a CO query, have a look at the XML structure of one or more good result pages, and then specialize the CO query into a CAS query with more structural components. Please feel free to explore the new collection and try out new topics in our TopX 2.0 interface (using the NEXI syntax).
Please note that NEXI imposes more restrictions than XPath on the formulation of branching path queries and the nesting of predicates, such that branching path queries are best formulated using multiple 'about' operators per predicate, each addressing a different branch of the document and being connected by an 'and' (see the above example query).
As NEXI is clearly a subset of XPath Full-Text, please also validate your NEXI syntax (before submitting your candidate topic) using the online NEXI parser available at: http://www.inex.otago.ac.nz/tracks/adhoc/nexiparser.asp
We will then aim to provide a corresponding translation of all NEXI queries into XPath 2.0 Full-Text for the actual task, such that systems using either language can run the final queries.
All Efficiency track submissions will be evaluated in VCAS mode (see http://www.inex.otago.ac.nz/tracks/adhoc/nexi/nexi.pdf), i.e., path conditions (as well as content conditions) will be evaluated in a vague interpretation, which means that one or more tags and/or keywords of the query may be missing to still render a result element as relevant. Thus, as in any INEX evaluation, it will be up to the peer assessors to decide about relevance of individual result elements.
Please do not use link elements or very small layout elements such as 'collectionlink' or 'emph' elements as target elements of the query.
Each participating group will have to evaluate a few of the newly developed type (C) topics. Experience from the Ad-Hoc track shows that relevance assessments take one person about one day per topic.
skipped due to lack of topic submissions!
Assessments for type (A) and (B) topics will be reused from the current 2009 Ad-Hoc evaluation.
We are currently working on making a light-weight version of TopX 2.0 available as reference engine for both Windows and Unix-like environments, which should also be compilable with minimum effort under different environments. TopX currently does not support distributed search and is intended for use as baseline on a single-node retrieval system.
We will try to make the TopX download available before the release of the run evaluations.
Just like in 2008, the Efficiency track particularly encourages the use of top-k style query engines. The result submission format includes options for marking runs as top-15, top-150, and top-1500, using either a Focused non-overlapping element retrieval, a Thorough (incl. overlap) element retrieval, or an entire Article retrieval mode.
Automatic runs may use either title field, including the NEXI CO, CAS, or XPATH titles, and even keywords from the narrative or description fields.
At least one automatic and sequential run with topics being processed one-by-one is mandatory.
Furthermore, participants are invited to submit as many runs in different retrieval modes as possible. Both the Ad-Hoc and Budget-constrained tasks permit (and encourage) all of the above combinations of retrieval modes.
<!ELEMENT efficiency-submission (topic-fields,
general_description,
ranking_description,
indexing_description,
caching_description,
topic+)>
<!ATTLIST efficiency-submission
participant-id CDATA #REQUIRED
run-id CDATA #REQUIRED
task (adhoc | budget10 | | budget100 | budget1000 | budget10000) #REQUIRED
type (focused | thorough | article) #REQUIRED
query (automatic | manual) #REQUIRED
sequential (yes|no) #REQUIRED
no_cpu CDATA #IMPLIED
ram CDATA #IMPLIED
no_nodes CDATA #IMPLIED
hardware_cost CDATA #IMPLIED
hardware_year CDATA #IMPLIED
topk (15 | 150 | 1500) #IMPLIED
index_size_bytes CDATA #IMPLIED
indexing_time_sec CDATA #IMPLIED
baseline_time_ms CDATA #IMPLIED
>
<!ELEMENT topic-fields EMPTY>
<!ATTLIST topic-fields
co_title (yes|no) #REQUIRED
cas_title (yes|no) #REQUIRED
xpath_title (yes|no) #REQUIRED
text_predicates(yes|no) #REQUIRED
description (yes|no) #REQUIRED
narrative (yes|no) #REQUIRED
>
<!ELEMENT general_description (#PCDATA)>
<!ELEMENT ranking_description (#PCDATA)>
<!ELEMENT indexing_description (#PCDATA)>
<!ELEMENT caching_description (#PCDATA)>
<!ELEMENT topic (result*)>
<!ATTLIST topic
topic-id CDATA #REQUIRED
total_time_ms CDATA #REQUIRED
cpu_time_ms CDATA #IMPLIED
io_time_ms CDATA #IMPLIED
io_bytes CDATA #IMPLIED
>
<!ELEMENT result (file, path, rank, rsv?)>
<!ELEMENT file (#PCDATA)>
<!ELEMENT path (#PCDATA)>
<!ELEMENT rank (#PCDATA)>
<!ELEMENT rsv (#PCDATA)>
Explanation of DTD fields:
Providing CPU and I/O times (and I/O bytes) is optional for each topic. Also, it is sufficient to provide a ranked list of matching elements along with their XPath identifiers. Ranks are mandatory but score values are optional. Note that in article retrieval mode, all results' elements paths should be /article[1].
Run submissions not matching this DTD will be automatically discarded.
| 08/Jul/2009 | Release of topic creation guidelines for type (C) queries |
| 28/Aug/2009 | Submission deadline for the new, type (C), candidate topics, formulated as NEXI or XPath 2.0 Full-Text |
| 7/Sep/2009 | Release of final set of type (A), (B) |
| 14/Sep/2009 | Submission deadline for all Efficiency track runs (one file per run for the Ad-Hoc and Budget tasks, and types A, B |
| 28/Sep/2009 | |
| 26/Oct/2009 | |
| 2/Nov/2009 | Release of all Efficiency track evaluation results (Ad-Hoc and Budget tasks, types A, B |
| 23/Nov/2009 | Submission deadline for papers for the pre-proceedings (all tracks) |
| 30/Nov/2009 | Release of the workshop pre-proceedings (all tracks) |
| 6-10/Dec/2009 | INEX Workshop in Brisbane, Australia |
Martin Theobald
Max Planck Institute for Informatics
martin.theobald@mpi-inf.mpg.de
Ralf Schenkel
MMCI, Saarland University
schenkel@mpi-inf.mpg.de