Wednesday, October 28, 2009

Getting into the (xml)Flow of Things

By now there is wide-spread acceptance that XML's tagging and indexing capability is a powerful tool to leverage a publisher's valuable content asset. Just as important is implementing a publishing workflow utilizing a content management system that stores documents in a native XML format. This means that the goal is to have a workflow where data is not only created and tagged in XML, but also stored in native XML creating the possibility to repurpose the data as needed without data transforms in and out of the CMS.


Consider the challenges presented in the following typical workflow. Even when content is tagged in a rich XML scheme but stored in a relational database the first step that we are faced with is transforming the data from XML so that it can be stored in relational database tables. Once it is stored, if we want to repurpose this data for publication, say on the web, another conversion must take place to recreate the XML once again. This laborious task of multiple back and forth transforms never results in a timely or high quality production process.


Certainly, just getting the data into the relational database can be a long process to begin with. But consider the challenge of receiving XML data from multiple, even hundreds, of sources on a daily basis. The process then involves standardizing the data which is a huge undertaking. In Dave Kellogg's (CEO of MarkLogic) post The First Step's a Doozy, Dave considers Step 1 of loading content into the relational database system to be a daunting challenge.


In order to realize the full potential of an end-to-end publishing workflow, it must be built around content management that not only "handles" XML as another data type but rather employs a central native XML repository. Once the XML can get flowing in this manner it will ensure that publishers can make content that was cumbersome to repurpose into an asset that is easy to assemble in any form desired.

Monday, October 26, 2009

Practical Application of XQuery

End-to-end XML based publishing workflows teamed with XML content management systems have made it possible for publishers to distribute custom published college course materials to university students in a variety of formats. Applications utilizing XQuery, a programming language designed to query repositories of XML data, allow college professors to search, manipulate and assemble the content into custom published course materials for distribution to their students. This previously cumbersome process of custom coursepack printing now makes it straight forward to provide course material in eBook or print formats and enables the inclusion of local content (PDF, Word, etc.) for a true customized package.


At a recent XML-in-Practice conference, a joint presentation by representatives from John Wiley & Sons in addition to McGraw-Hill demonstrated their implementations of web based custom publishing solutions utilizing XQuery on MarkLogic XML Server systems. Wiley's product, Custom Select, allows the user to search and select Wiley content at a section or chapter level and then customize the output with a cover, arrange the order of the content, and also upload local content. The resulting custom course material can then be previewed and submitted for printing or for the creation of eBooks. McGraw-Hill's implementation will provide the same level of functionality.


XQuery is particularly well suited to this application as it provides the capability to search, extract, and manipulate XML data from documents across many types of data sources. For more information on XQuery see XQuery 1.0: An XML Query Language by the World Wide Web Consortium (W3C) or the XQuery Wikipedia entry.

Friday, October 23, 2009

Innovative OCR Correction

The National Library of Australia has implemented an innovative approach to balance the cost of OCR correction with the user's need to search the full-text of historical newspapers aided by the efforts of the users themselves. When undertaking a large historical digitization project publishers are often faced with decisions around how much full-text OCR correction should be undertaken. With projects, such as historical newspaper collections, it is highly desirable for the user to be able to search the full-text of the archive for people, places or other factual information. The users success is largely influenced by the accuracy of the underlying text extracted by the OCR engine. The success of this extraction is ultimately dependent on the quality of the original source which is highly varied across the centuries.


The National Library of Australia along with the Australian State and Territorial Libraries has created the Australian Newspaper project. Over 4 million newspaper articles are currently available in the archive and are full-text searchable. To overcome the high cost of OCR correction the project includes the ability for the users to correct the underlying text. This approach has resulted in an impressive 3.4 million lines of electronic text corrected in over 150,000 articles. This community effort will surely benefit searchers for ages to come.

Wednesday, October 21, 2009

ePub Supported eReader Introduced This Week

The ePub digital book standard gets a big boost this week with the introduction of the Barnes & Noble new eReader. The BN Nook, supports the ePub standard which instantly makes available over 500,000 free books from Google Books. The Google Books are already showing up on BN.com shelves. These Google books are not available on the rival Kindle eBook from Amazon which utilizes a proprietary format. In addition to the free books, BN has over 500,000 more books available for their new reader.

ePub is an XML format composed of open standards from the IDPF (International Digital Publishing Forum) which is the trade and standards publishing association for the digital publishing industry. This format allows publishers to produce and distribute their content in one format and provides consumers with interoperability across a number of devices including the new Nook and the Sony Portable Readers.