INEX 2009 XML Mining Track

Overview

Classification Task

The goal of the challenge is to identify the different Machine Learning (ML) methods proposed so far for structured data, to assess the potential of these methods for dealing with generic ML tasks in the structured domain, to identify the new challenges of this emerging field and to foster research in this domain. Structured data appears in many different domains. We will focus here on XML document collections and we are organizing this challenge in cooperation with the INEX initiative. This challenge aims at gathering ML, Information Retrieval (IR) and XML researchers in order to:
Results of the track will be presented at the INEX workshop. The INEX workshop will offer a unique opportunity to gather researchers from IR/ XML and ML communities. For details of the task see the task web site: http://xmlmining.lip6.fr

Clustering Task

INEX 2009 clustering task is an evaluation forum that provides a platform to measure the performance of clustering methods for collection selection on a real-world Wikipedia collection. The clustering task in INEX 2009 brings together researchers from Information Retrieval, Data Mining, Machine Learning and XML fields. It allows participants to evaluate clustering methods against a real use case and with significant volumes of data. It is a lightweight task that requires you to submit the clustering solutions only. The task is designed to facilitate participation with minimal effort by providing not only raw data, but also pre-processed data which can be easily used by existing clustering software.

The clustering task in INEX 2009 evaluates unsupervised machine learning solutions against the ground truth categories by using standard evaluation criteria such as Entropy, F-score, Normalised Mutual Information and others. The clustering task in INEX 2009 will also evaluate unsupervised machine learning in the context of XML information retrieval. This year we are also running a novel evaluation task to determine the quality of clusters relative to the optimal collection selection goal, given a set of queries using manual query assessments from the INEX Ad Hoc track.

The clustering track will explicitly test the Jardine and van Rijsbergen cluster hypothesis (1971), which states that documents that cluster together have a similar relevance to a given query. The task is to split the English Wikipedia collection, 60 Gigabytes in size having around 2.7 million documents in XML format, into disjoint clusters for collection selection. If the cluster hypothesis holds true, and if suitable clustering can be achieved, then a clustering solution will minimise the number of clusters that need to be searched to satisfy any given query. There are important practical reasons for performing collection selection on a very large corpus. If only a small fraction of clusters (hence documents) need to be searched, then the throughput of an information retrieval system will be greatly improved.

Data

The INEX XML Wikipedia collection is a marked-up version of the Wikipedia documents.  The mark-up includes, for instance, explicit tagging of named entities.  This collection can also be considered as a bag-of-words representation of terms and frequent phrases in a document, frequencies of various XML structures in the form of tags, trees, links and named entities.  As well, the entire document collection is available in XML format and in text-only format if you wish to try different representation approaches. A subset of collection containing about 50,000 documents (of the INEX 2009 corpus) is also provided, in order to cluster them, for teams that are unable to process such a large data collection.
In order to enable participation with minimal overheads in data-preparation the collection has been pre-processed to provide various representations of the documents.  For instance, a bag-of-words representation of terms and frequent phrases in a document, frequencies of various XML structures in the form of tags, trees, links, named entities, etc. 

Here is the data specification file that explains these various representations.

Link to Data Collection specification

Large Data Collection

2009 Tags and Trees
2009 Links
2009 Entities
2009 Bag of Words Bigrams
2009 Bag of Words Stemmed Words
2009 Bag of Words Stemmed Bigrams

Small Data Collection

2009 Tags and Trees Small
2009 Links Small
2009 Entities Small
2009 Bag of Words Bigrams Small
2009 Bag of Words Stemmed Words Small
2009 Bag of Words Stemmed Bigrams Small

Please note that none of these data representations have used the information within the category tag in the XML files. These tags were omitted during pre-processing. If you are using your own pre-processing methods, please omit these tags from the dataset.

Tasks and Runs

The task is to utilize unsupervised classification techniques to group the documents into clusters. You can submit several clustering solutions of different numbers of clusters: 100, 500, 1000, 2500, 5000 and 10000.
The submission file should first contain the number of clusters in the first line and the name of the corpus (INEX 2009 dataset or INEX 2009 subset data). The next line should include the document id and its cluster id.
You are also allowed to submit the clustering solution with multi-label categories. In this case, the submission file should include the document id, cluster1 id, cluster2 id,...

Link to sample submission files

Evaluation

The clustering solutions will be evaluated by two means.
Firstly, the clustering solution will be evaluated by using the standard criteria such as Entropy, F-score, Normalised Mutual Information and others to determine the quality of clusters. This evaluation utilises the classes-to-clusters mapping which assumes that the classification of the documents in the collection is known (i.e., each document has a class label(s)). The clustering solutions are then evaluated with respect to this predefined classification. It is important to note that the class labels are not used in the process of clustering, but only for the purpose of evaluation of the clustering results. These evaluation results will be provided online and ongoing along the same lines as NetFlix, starting from mid-September.
Secondly, the clustering solutions will be evaluated to determine the quality of cluster relative to the optimal collection selection goal, given a set of queries. Better clustering solutions in this context will tend to (on average) group together relevant results for (previously unseen) ad-hoc queries. Real Ad-hoc retrieval queries and their manual assessment results will be utilised in this evaluation. This novel approach evaluates the clustering solutions relative to a very specific objective - clustering a large document collection in an optimal manner in order to satisfy queries while minimising the search space. Results of second evaluation will be released at the INEX workshop in December.

Results

Online Clustering Evaluation Website

Schedule

30/July/2009      Release of various data representations for clustering
15/Oct/2009      Online Evaluation System -  First Release of clustering labels results
2/Nov/2009       Closing of Online Evaluation System -  Submission deadline for clustering results
5/Nov/2009       Release of classes-to-clusters mapping results in Workshop
23/Nov/2009     Submission deadline for papers for pre-proceedings
6-10 Dec 2009  Release of collection selection results in Workshop
6-10 Dec 2009  INEX Workshop in Marburg, Brisbane, Australia

Organisers

Classification Task

Ludovic Denoyer
University Paris 6
ludovic.denoyer@lip6.fr

Patrick Gallinari
University Paris 6
patrick.gallinari@lip6.fr

Clustering Task

Richi Nayak
Queensland University of Technology
r.nayak@qut.edu.au

Chris De Vries
Queensland University of Technology
chris@de-vries.ws

Sangeetha Kutty
Queensland University of Technology
s.kutty@qut.edu.au