Friday, October 23, 2009

Innovative OCR Correction

The National Library of Australia has implemented an innovative approach to balance the cost of OCR correction with the user's need to search the full-text of historical newspapers aided by the efforts of the users themselves. When undertaking a large historical digitization project publishers are often faced with decisions around how much full-text OCR correction should be undertaken. With projects, such as historical newspaper collections, it is highly desirable for the user to be able to search the full-text of the archive for people, places or other factual information. The users success is largely influenced by the accuracy of the underlying text extracted by the OCR engine. The success of this extraction is ultimately dependent on the quality of the original source which is highly varied across the centuries.


The National Library of Australia along with the Australian State and Territorial Libraries has created the Australian Newspaper project. Over 4 million newspaper articles are currently available in the archive and are full-text searchable. To overcome the high cost of OCR correction the project includes the ability for the users to correct the underlying text. This approach has resulted in an impressive 3.4 million lines of electronic text corrected in over 150,000 articles. This community effort will surely benefit searchers for ages to come.

No comments: