INEX 2009 Book Track

Overview

The goal of the Book Track is to promote inter-disciplinary research investigating techniques for supporting users in reading, searching, and navigating the full texts of digitized books and to provide a forum for the exchange of research ideas and contributions. Focusing on topics of interest in the fields of information retrieval (IR), human computer interaction (HCI), digital libraries (DL), and eBooks, the track in 2009 will explore the following four tasks:
  1. Book Retrieval: Constructing a reading list for selected Wikipedia articles using domain-specific full-text search methods on a collection of over 50,000 digitized books,
  2. Focused Book Search: Applying focused retrieval approaches to digitized books to return users relevant book parts,
  3. Active Reading: Conducting user studies into active reading, i.e., exploring how and why readers use eBooks in specific scenarios with a focus on eBook usability, and
  4. Structure Extraction: Building navigation tools for digitized books by constructing hyperlinked table of contents from OCR text and layout information for a sample of 1,000 books.

Book corpus

The track builds on a collection of digitized books, provided by Microsoft Live Book Search and the Internet Archive (for non-commercial purposes only). The corpus consists of over 50,000 digitized out-of-copyright books. The OCR content of the books is stored in an XML format, referred to as BookML. Most books also have an associated metadata file (*.mrc), which contains publication (author, title, etc.) and classification information in MAchine-Readable Cataloging (MARC) record format.

To access the corpus, participants will first need to fill in and return by registered post the Book Corpus License Agreement. Return address is given on page 1 of the License document. Once access is authorized, participants can either download the collection (using their INEX username and password) from www.booksearch.org.uk or receive it on a USB 2.0 HDD (at a cost of about 70 GBP) - details can be found on page 2 of the License document.

Resources

Participants will have access to a dedicated Book Track server with the following planned services:

Schedule

Book Retrieval and Focused Book Search Tasks:

May 15Book corpus ready and available for download
June 22Topic creation guidelines distributed
July 6Topic submission deadline
July 10Topics and Task descriptions distributed
Sep 10Run submissions deadline
Sep 21 - Oct 18Relevance Assessments
Oct 30Release of assessments and results
Nov 23Papers due for the INEX 2009 workshop

Active Reading Task:

July 15Deadline for setup: Bookshelf and user tasks
Sept 15Submission deadline for user study results
Oct 20Distribution of collected data
Nov 23Papers due for the INEX 2009 workshop

Structure Extraction Task:

May 8Registration deadline
June 24Submissions due
June 26Start of the groundtruth annotation
July 10Groundtruth annotation due
July 26-29Result announcement and competition report presentation at ICDAR 2009 [participation and attendance is welcome but not required]
Nov 23Papers due for the INEX 2009 workshop

Organizers

Gabriella Kazai
Microsoft Research Cambridge
gabkaz@microsoft.com

Antoine Doucet
University of Caen
doucet@info.unicaen.fr

Monica Landoni
University of Lugano
monica.landoni@unisi.ch

Marijn Koolen
University of Amsterdam
M.H.A.Koolen@uva.nl