Making Legacy Documents Available on the Web
Project Overview

This project aims at the development of document image analysis techniques for making legacy or classical Indian language documents web accessible so that classical Indian knowledge bank available only in paper form can become widely accessible through digital libraries.

Techniques for providing intelligent electronic access to images of documents, which do not exist in electronic form, can ensure wider dissemination and use of classical paper based information. In the Indian context this is very important because there exist a large volume of old printed texts in a variety of languages as well as hand-written manuscripts and these cannot be easily converted to editable electronic format due to lack of reliable OCR technology. By making these documents web accessible we can ensure wider access and easy dissemination of classical knowledge bank. Also, documents in Indian languages which are not being currently published through electronic means (minor languages) can also be made widely accessible by image based intelligent access mechanisms. Availability of documents through low-cost information kiosks connected to the Internet can make heritage literature widely available even in the interior of the country. This technology can also be used for intelligent management of contents of scanned hand-written or typed official records.

We Investigated

  • Techniques for decomposition of documents (which are in a variety of formats) into logical components.
  • Representation of the documents in terms of logical components so that documents are available on the web not as a raw image but as collection of logically related components
  • Encoding scheme for these components so that compression efficiency can be maximized by making use of the logical structure of the document
  • Transcoding of the document images and image components so that we can maximize information delivery, given constraints on the bandwidth and the presentation device.
  • Development of techniques for automatic grouping of document components so that a logically correlated presentation scheme can be generated
  • Development of appropriate presentation engine incorporating
    • a. User-friendly front-end interface.
    • b. Image based indexing schemes specially designed for document images at the back-end for efficient query based retrieval

Prof. Shantanu Chaudhary

Kavita, Ritu, Ruchi

Minisitry Of Communication and Information Technology

Participating Institutes
  • IIT-Delhi
  • CEERI Pillani
  • IIT Bombay
  • Jadhopur University
  • IIIT Hyderabad
  • CDAC-Kolkata