Development of Robust Document Analysis & Recognition System for Printed Indian Scripts (OCR)
Project Overview

The objective of this project is to develop robust OCR's for printed Indian scripts, which can deliver desired performance for possible conversion of legacy, printed documents into electronically accessible format with the system specifications like

  • OCR for following Scripts will be developed
    • Bangla, Devnagari, Malayalam, Gujarati, Telugu, Tamil, Oriya, Tibetan, Gurmukhi, Kannada
    • Font and point-size independent recognition capability
  • Input
    • Scanned Pages of books published after 1950
    • For each script 25 books published at different times over last 50 years will be considered. Each book is expected to have on average 300 pages. These pages are expected to be representative examples of the quality of pages that the developed OCR systems will be able to handle. These pages are also expected to contain representative examples of scripts (fonts and sizes) and layout patterns (including graphics and image components). These pages will form the annotated corpus for development and testing.
  • Output
    • XML/HTML representation of the pages with appropriate tags so that layout and font information along with graphics and image component can be retained to the maximum possible extent.


Technology that needs to be developed to meet the system specification can be broadly categorized into three classes (these are complimentary technologies to be combined to develop robust OCR)

  • Script Independent Technology
    • Scanning Parameter Determination
    • Preprocessing (skeu correction, enhancement, noise removal)
    • Page Segmentation & Layout Analysis and labeling of components. This task involves text and non-text separation.
    • Line segmentation, Word Segmentation & Word Representation Scheme in script independent fashion
  • Script Dependent Technology
    • Line Segmentation and word level segmentation using script dependent features
    • Character Segmentation and Zoning for specific scripts
    • Recognition Engine for the complete character set of these scripts. Recognition engine will involve Feature selection, Classifier, OCR linked Post-processing
  • Supporting Technology and Resource Development
    • Annotated corpus preparation for development and testing of OCR for the above mentioned scripts
    • Development of tagging structure and representation scheme for the electronic representation of the OCR’ed document in XML format.
    • Development of the software architecture, the component model, integration scheme, presentation engine and user interface design.
    • Integration scheme for post-processing using language based resources.

Performance:

  • Expected accurancy is 99% in case of page segmentation(text and non-text seperation)
  • Character level Recognition after post-processing depends upon the quality of page, quality of printing, ctc. OCR's are expected to provide minimum of 95% recognition accuracy for different types of documents
  • Word level recoginistion is between 85% to 90% for all script


ScriptsMaximum accuracy for the printMaximum Accuracy for print material
material in the period 1950-75Post 1975
Bangla97%99%
Devnagri97%99%
Telgu96%98%
Tamil96%98%
Gurmukhi98%99%
Malayalam96%98%
Oriya96%98%
Gujrati96%98%
Tibetan96%97%
Kannada96%98%

Supervisors
Prof. Shantanu Chaudhary

Members
Ritu, Anju, Renu, Kavita

Sponsors
Minisitry Of Communication and Information Technology

Participating Institutes
  • IIT-Delhi
  • CEERI Pillani
  • IIT Bombay
  • Jadhopur University
  • IIIT Hyderabad
  • CDAC-Kolkata