INEX 2009 clustering task is an evaluation forum that provides a platform to measure the performance of clustering methods for collection selection on a real-world Wikipedia collection. The clustering task in INEX 2009 brings together researchers
from Information Retrieval, Data Mining, Machine Learning and XML fields. It allows participants to evaluate clustering methods against a real use case and with significant volumes of data. It is a lightweight task that requires you to submit the clustering
solutions only. The task is designed to facilitate participation with minimal effort by providing not only raw data, but also pre-processed data which can be easily used by existing clustering software.
The clustering task in INEX 2009 evaluates unsupervised machine learning solutions against the ground truth categories by using standard evaluation criteria such as Entropy, F-score, Normalised Mutual Information and others. The clustering task in INEX 2009
will also evaluate unsupervised machine learning in the context of XML information retrieval. This year we are also running a novel evaluation task to determine the quality of clusters relative to the optimal collection selection goal, given a set of queries
using manual query assessments from the INEX Ad Hoc track.
The clustering track will explicitly test the Jardine and van Rijsbergen cluster hypothesis (1971), which states that documents that cluster together have a similar relevance to a given query. The task is to split the English Wikipedia collection, 60 Gigabytes
in size having around 2.7 million documents in XML format, into disjoint clusters for collection selection. If the cluster hypothesis holds true, and if suitable clustering can be achieved, then a clustering solution will minimise the number of clusters that
need to be searched to satisfy any given query. There are important practical reasons for performing collection selection on a very large corpus. If only a small fraction of clusters (hence documents) need to be searched, then the throughput of an information
retrieval system will be greatly improved.
Data
The INEX XML Wikipedia collection is a marked-up version of the Wikipedia documents. The mark-up includes, for instance, explicit tagging of named entities. This collection can also be considered as a bag-of-words representation of terms and frequent phrases
in a document, frequencies of various XML structures in the form of tags, trees, links and named entities. As well, the entire document collection is available in XML format and in text-only format if you wish to try different representation approaches. A
subset of collection containing about 50,000 documents (of the INEX 2009 corpus) is also provided, in order to cluster them, for teams that are unable to process such a large data collection.
In order to enable participation with minimal overheads in data-preparation the collection has been pre-processed to provide various representations of the documents. For instance, a bag-of-words representation of terms and frequent phrases in a document,
frequencies of various XML structures in the form of tags, trees, links, named entities, etc.
Here is the data specification file that explains these various representations.
Link to Data Collection specification
Large Data Collection
2009 Tags and Trees
2009 Links
2009 Entities
2009 Bag of Words Bigrams
2009 Bag of Words Stemmed Words
2009 Bag of Words Stemmed Bigrams
Small Data Collection
2009 Tags and Trees Small
2009 Links Small
2009 Entities Small
2009 Bag of Words Bigrams Small
2009 Bag of Words Stemmed Words Small
2009 Bag of Words Stemmed Bigrams Small
Please note that none of these data representations have used the information within the category tag in the XML files. These tags were omitted during pre-processing. If you are using your own pre-processing methods, please omit these tags from the dataset.
Tasks and Runs
The task is to utilize unsupervised classification techniques to group the documents into clusters. You can submit several clustering solutions of different numbers of clusters: 100, 500, 1000, 2500, 5000 and 10000.
The submission file should first contain the number of clusters in the first line and the name of the corpus (INEX 2009 dataset or INEX 2009 subset data). The next line should include the document id and its cluster id.
You are also allowed to submit the clustering solution with multi-label categories. In this case, the submission file should include the document id, cluster1 id, cluster2 id,...
Link to sample submission files
Evaluation
The clustering solutions will be evaluated by two means.
Firstly, the clustering solution will be evaluated by using the standard criteria such as Entropy, F-score, Normalised Mutual Information and others to determine the quality of clusters. This evaluation utilises the classes-to-clusters mapping which assumes
that the classification of the documents in the collection is known (i.e., each document has a class label(s)). The clustering solutions are then evaluated with respect to this predefined classification. It is important to note that the class labels are not
used in the process of clustering, but only for the purpose of evaluation of the clustering results. These evaluation results will be provided online and ongoing along the same lines as NetFlix, starting from mid-September.
Secondly, the clustering solutions will be evaluated to determine the quality of cluster relative to the optimal collection selection goal, given a set of queries. Better clustering solutions in this context will tend to (on average) group together relevant
results for (previously unseen) ad-hoc queries. Real Ad-hoc retrieval queries and their manual assessment results will be utilised in this evaluation. This novel approach evaluates the clustering solutions relative to a very specific objective - clustering
a large document collection in an optimal manner in order to satisfy queries while minimising the search space. Results of second evaluation will be released at the INEX workshop in December.
Results
Online Clustering Evaluation Website