INEX 2010 XML Mining Track
The XML Mining track explores two
main ideas: (1) identifying key problems and new challenges of the emerging
field of mining semi-structured documents, and (2) studying and assessing the potential
of supervised, semi-supervised and unsupervised learning techniques for
classification and clustering of semi-structured documents.
The track consists of clustering and
classification tasks. The clustering task requires participants to group documents
into clusters without any knowledge of cluster labels using an unsupervised
learning algorithm. The classification task requires the participants to label
the documents into known classes using a supervised or semi-supervised learning
algorithm and a training set.
The clustering task provides a novel
platform to measure the quality of clustering methods for collection selection.
It uses the relevance judgements from the ad hoc track to determine how well
relevant documents are clustered.
The XML Mining task at INEX 2010
brings together researchers from Data Mining, Information Retrieval, Machine Learning and XML fields. It allows participants to
evaluate XML mining methods against a real use case and with significant
volumes of data.
The INEX XML Wikipedia collection is
a marked-up version of the Wikipedia corpus. The mark-up includes named
entities and document structure such as document sections, tables and
The classification and clustering
tasks use a 144,625 document subset of INEX 2010 collection that has been
pre-processed to provide various representations of the documents.
Representations are available as a vector space representation of terms,
frequent bi-grams, XML tags, trees, links and named entities. The collection is
also available in XML format and text-only format.
The 144,625 document subset
is determined by the INEX 2010 ad hoc reference run. The reference run will contain most of the relevant documents from the manual ad hoc evaluations.
The data collection specification explains these various
Data Collection (146,225 Documents)
Tags and Trees
Bag of Bi-grams
Bag of Stemmed Words
Bag of Stemmed Words and Bi-grams (concatenated)
Please note that none of
these data representations have used the information within the category tag in
the XML files. These tags were omitted during pre-processing. If you are using
your own pre-processing methods, please omit these tags from the dataset.
Training Data Set (20% of the labels) (for classification training)
Classification Task and Runs
The classification task in INEX 2010
evaluates supervised or semi-supervised learning solutions against the ground
truth categories by using standard evaluation criteria such as F-score. You can
submit several classification solutions.
The categories this year have been
extracted from the 22/06/2010 dump of the Wikipedia. Each document can belong
to one or more of the 36 categories resulting in a multi-label classification
task. The categories have been extracted from the
Main Topic Classifications in the Wikipedia from the 1st
and 2nd level. The categories in the Wikipedia consist of a graph.
The shortest path(s) from a document to any one of the highest level categories
in Main Topic Classifications were used to determine categories for a document.
Following the shortest path is motivated by Occam's Razor,
the simplest explanation is often the correct one. This approach extracts
categories from the noisy Wikipedia category graph where documents belong to
many nonsensical categories. Only categories containing more than 3000
documents were used.
The goal is to predict which of the
36 categories documents not in the training set belong to. The submissions may
contain multiple labels per document. The labels in the test set are
multi-label as well.
Link to sample classification submission file
Clustering Task and Runs
The clustering task in INEX 2010
evaluates unsupervised learning solutions against the ground truth categories
by using standard evaluation criteria such as Purity, Entropy, F-score,
Normalised Mutual Information and others. It also evaluates unsupervised
learning approaches in the context of XML information retrieval. It determines
the quality of clusters relative to the optimal collection selection goal,
given a set of queries using manual query assessments from the ad hoc track.
The clustering track explicitly
tests the Jardine and van Rijsbergen cluster
hypothesis (1971), which states that documents that cluster together have a
similar relevance to a given query. The task is to split the XML Mining
collection of 144,625 documents into disjoint clusters for collection
selection. If the cluster hypothesis holds true, and if suitable clustering can
be achieved, then a clustering solution will minimise the number of clusters
that need to be searched to satisfy any given query. There are important
practical reasons for performing collection selection. If only a small fraction
of clusters (hence documents) need to be searched for, then the throughput of
an information retrieval system will be greatly improved.
This task imposes a restriction in
terms of the number of clusters. You can submit several clustering solutions of
different numbers of clusters: 50, 100, 200, 500 and 1000. You are allowed to
submit solutions with these numbers of clusters with a variation of +/- 5%. Any
solution containing cluster numbers other than 50, 100, 200, 500 and 1000 with
+/- 5% variations will not be considered in evaluation. You are also encouraged
to submit the homogeneous size clusters.
The submission file should begin
with 4 comments lines consisting of a leading # following by text. The first
comment line contains the team name. The second comment line contains a
description of the approach employed. The third comment line contains a
description of the hardware the approach was run on. The fourth comment line
contains the runtime of the clustering solution. The remainder of the lines
consist of a single document id ,cluster
id pair for each document.
Link to sample
clustering submission file
30/Aug/2010 Release of various data representations for clustering
30/Oct/2010 Submission deadline for clustering and classification results
05/Nov/2010 Release of classification and classes-to-clusters mapping results
22/Nov/2010 Submission deadline for papers for pre-proceedings
13-15 Dec 2010 Release of collection selection results in Workshop
13-15 Dec 2010 INEX Workshop in Amsterdam
The classification solution for the
test dataset will be evaluated by using the standard criteria such as F-score
and others to determine the quality of classes.
The clustering solution (containing
either of 50, 100, 200, 500 and 1000 clusters) will be evaluated by using the
standard criteria such as Purity, Entropy, F-score, Normalised Mutual
Information and others to determine the quality of clusters. This evaluation
utilises the classes-to-clusters mapping which assumes that the classification
of the documents in the collection is known (i.e., each document has a class
label(s)). The clustering solutions are then evaluated with respect to this
predefined classification. It is important to note that the class labels are
not used in the process of clustering, but only for the purpose of evaluation
of the clustering results. The clustering solutions will also be evaluated to
determine the quality of cluster relative to the optimal collection selection
goal, given a set of queries. Better clustering solutions in this context will
tend to (on average) group together relevant results for (previously unseen) ad
hoc queries. Real ad hoc retrieval queries and their manual assessment results
will be utilised in this evaluation. This novel approach evaluates the
clustering solutions relative to a very specific objective - clustering a large
document collection in an optimal manner in order to satisfy queries while
minimising the search space. Results of second evaluation will be released at
the INEX workshop in December.
Queensland University of Technology
Chris De Vries
Queensland University of Technology
University of Calabria
Queensland University of Technology