INEX 2010 XML Mining Track

Overview

The XML Mining track explores two main ideas: (1) identifying key problems and new challenges of the emerging field of mining semi-structured documents, and (2) studying and assessing the potential of supervised, semi-supervised and unsupervised learning techniques for classification and clustering of semi-structured documents.

The track consists of clustering and classification tasks. The clustering task requires participants to group documents into clusters without any knowledge of cluster labels using an unsupervised learning algorithm. The classification task requires the participants to label the documents into known classes using a supervised or semi-supervised learning algorithm and a training set.

The clustering task provides a novel platform to measure the quality of clustering methods for collection selection. It uses the relevance judgements from the ad hoc track to determine how well relevant documents are clustered.

The XML Mining task at INEX 2010 brings together researchers from Data Mining, Information Retrieval, Machine Learning and XML fields. It allows participants to evaluate XML mining methods against a real use case and with significant volumes of data.

Data

The INEX XML Wikipedia collection is a marked-up version of the Wikipedia corpus. The mark-up includes named entities and document structure such as document sections, tables and hyperlinks.

The classification and clustering tasks use a 144,625 document subset of INEX 2010 collection that has been pre-processed to provide various representations of the documents. Representations are available as a vector space representation of terms, frequent bi-grams, XML tags, trees, links and named entities. The collection is also available in XML format and text-only format.

The 144,625 document subset is determined by the INEX 2010 ad hoc reference run. The reference run will contain most of the relevant documents from the manual ad hoc evaluations.

The data collection specification explains these various representations.

Data Collection (146,225 Documents)

Tags and Trees
Links
Entities
Bag of Bi-grams
Bag of Stemmed Words
Bag of Stemmed Words and Bi-grams (concatenated)

Please note that none of these data representations have used the information within the category tag in the XML files. These tags were omitted during pre-processing. If you are using your own pre-processing methods, please omit these tags from the dataset.

Training Data Set (20% of the labels) (for classification training)

Training Labels

Classification Task and Runs

The classification task in INEX 2010 evaluates supervised or semi-supervised learning solutions against the ground truth categories by using standard evaluation criteria such as F-score. You can submit several classification solutions.

The categories this year have been extracted from the 22/06/2010 dump of the Wikipedia. Each document can belong to one or more of the 36 categories resulting in a multi-label classification task. The categories have been extracted from the Main Topic Classifications in the Wikipedia from the 1st and 2nd level. The categories in the Wikipedia consist of a graph. The shortest path(s) from a document to any one of the highest level categories in Main Topic Classifications were used to determine categories for a document. Following the shortest path is motivated by Occam's Razor, the simplest explanation is often the correct one. This approach extracts categories from the noisy Wikipedia category graph where documents belong to many nonsensical categories. Only categories containing more than 3000 documents were used.

The goal is to predict which of the 36 categories documents not in the training set belong to. The submissions may contain multiple labels per document. The labels in the test set are multi-label as well.

Link to sample classification submission file

Clustering Task and Runs

The clustering task in INEX 2010 evaluates unsupervised learning solutions against the ground truth categories by using standard evaluation criteria such as Purity, Entropy, F-score, Normalised Mutual Information and others. It also evaluates unsupervised learning approaches in the context of XML information retrieval. It determines the quality of clusters relative to the optimal collection selection goal, given a set of queries using manual query assessments from the ad hoc track.

The clustering track explicitly tests the Jardine and van Rijsbergen cluster hypothesis (1971), which states that documents that cluster together have a similar relevance to a given query. The task is to split the XML Mining collection of 144,625 documents into disjoint clusters for collection selection. If the cluster hypothesis holds true, and if suitable clustering can be achieved, then a clustering solution will minimise the number of clusters that need to be searched to satisfy any given query. There are important practical reasons for performing collection selection. If only a small fraction of clusters (hence documents) need to be searched for, then the throughput of an information retrieval system will be greatly improved.

This task imposes a restriction in terms of the number of clusters. You can submit several clustering solutions of different numbers of clusters: 50, 100, 200, 500 and 1000. You are allowed to submit solutions with these numbers of clusters with a variation of +/- 5%. Any solution containing cluster numbers other than 50, 100, 200, 500 and 1000 with +/- 5% variations will not be considered in evaluation. You are also encouraged to submit the homogeneous size clusters.

The submission file should begin with 4 comments lines consisting of a leading # following by text. The first comment line contains the team name. The second comment line contains a description of the approach employed. The third comment line contains a description of the hardware the approach was run on. The fourth comment line contains the runtime of the clustering solution. The remainder of the lines consist of a single document id ,cluster id pair for each document.

Link to sample clustering submission file

Schedule

30/Aug/2010 Release of various data representations for clustering
30/Oct/2010 Submission deadline for clustering and classification results
05/Nov/2010 Release of classification and classes-to-clusters mapping results
22/Nov/2010 Submission deadline for papers for pre-proceedings
13-15 Dec 2010 Release of collection selection results in Workshop
13-15 Dec 2010 INEX Workshop in Amsterdam

Evaluation

The classification solution for the test dataset will be evaluated by using the standard criteria such as F-score and others to determine the quality of classes.

The clustering solution (containing either of 50, 100, 200, 500 and 1000 clusters) will be evaluated by using the standard criteria such as Purity, Entropy, F-score, Normalised Mutual Information and others to determine the quality of clusters. This evaluation utilises the classes-to-clusters mapping which assumes that the classification of the documents in the collection is known (i.e., each document has a class label(s)). The clustering solutions are then evaluated with respect to this predefined classification. It is important to note that the class labels are not used in the process of clustering, but only for the purpose of evaluation of the clustering results. The clustering solutions will also be evaluated to determine the quality of cluster relative to the optimal collection selection goal, given a set of queries. Better clustering solutions in this context will tend to (on average) group together relevant results for (previously unseen) ad hoc queries. Real ad hoc retrieval queries and their manual assessment results will be utilised in this evaluation. This novel approach evaluates the clustering solutions relative to a very specific objective - clustering a large document collection in an optimal manner in order to satisfy queries while minimising the search space. Results of second evaluation will be released at the INEX workshop in December.

Organisers

Richi Nayak
Queensland University of Technology
r.nayak@qut.edu.au

Chris De Vries
Queensland University of Technology
chris@de-vries.id.au

Andrea Tagarelli
University of Calabria
tagarelli@deis.unical.it

Sangeetha Kutty
Queensland University of Technology
s.kutty@qut.edu.au