Google Acquisition Will Help Correct Errors in Scanned Works

Last Updated Sep 17, 2009 5:27 PM EDT

One of the issues swirling around the Google Book Search (GBS) case is the relatively high rate of errors in the images of the millions of old books Google has scanned.

Google's Engineering Director for GBS, Daniel Clancy, addressed this earlier this month when he told Bnet, "I liken our approach to scanning as similar to a progressive JPG. If you consider the 'image' to be all the world's books and you consider a 'fully rendered image' as each of these books being in a digital form at some desired quality level, then our approach is to be able to get all of the books scanned with the knowledge that there is some statistical likelihood that any individual book and/or page might have an error.

"Our approach allows a large portion of the 'full image' to be rendered as soon as possible with the ability to fill in details in the image over time."

This week, Google has taken an additional step to speeding up the error-correction problem by acquiring reCAPTCHA, the company that owns the system created by researchers at Carnegie Mellon that helps improve the scanning process. (CAPTCHA stands for the "Completely Automated Public Turing Test to Tell Computers and Humans Apart," and has proved to be an effective tool against Spam, bots and the like.)

Back in August 2008, we noted here that CAPTCHAs had the potential to substantially improve the scanning process, because the Carnegie Mellon team had noticed that there were parallels between CAPTCHAs and the problem words in scanned works: In both cases, the letters were distorted so much that computers weren't capable of recognizing the word. So, they created reCAPTCHA to convert words that weren't recognized by character recognition software into CAPTCHAs.

The system has proved useful in efforts to digitize old damaged texts and newspapers.

In announcing the acquisition yesterday, Luis von Ahn, co-founder of reCAPTCHA, and Will Cathcart, Google Product Manager, said in a joint statement: "Computers find it hard to recognize these words because the ink and paper have degraded over time, but by typing them in as a CAPTCHA, crowds teach computers to read the scanned text.

"In this way, reCAPTCHA's unique technology improves the process that converts scanned images into plain text, known as Optical Character Recognition (OCR). This technology also powers large scale text scanning projects like Google Books and Google News Archive Search. Having the text version of documents is important because plain text can be searched, easily rendered on mobile devices and displayed to visually impaired users. So we'll be applying the technology within Google not only to increase fraud and spam protection for Google products but also to improve our books and newspaper scanning process."

This represents a major step forward in the GBS project, I believe. If you squint just a bit, you can begin to see that meta-image Clancy describes as a progressive JPG coming more clearly into view now.

Please check out Erik Sherman's post on this acquisition at Bnet Technology.

Related links:
Google's Chief Engineer Explains the Book Search Initiative Life After Newspapers in the Digital City

  • David Weir

    David Weir is a veteran journalist who has worked at Rolling Stone, California, Mother Jones, Business 2.0, SunDance, the Stanford Social Innovation Review, MyWire, 7x7, and the Center for Investigative Reporting, which he cofounded in 1977. He’s also been a content executive at KQED, Wired Digital,, and Excite@Home. David has published hundreds of articles and three books,including "Raising Hell: How the Center for Investigative Reporting Gets Its Story," and has been teaching journalism for more than 20 years at U.C. Berkeley, San Francisco State University, and Stanford.