Google's Chief Engineer Explains the Book Search Initiative

Last Updated Sep 2, 2009 10:55 AM EDT

Recently, what had been a loose coalition of authors, librarians, and book publishers opposed to the proposed settlement of a class-action lawsuit settlement against Google Book Search brought some heavy-hitters to the table.
Amazon, Microsoft, and Yahoo announced they are joining the coalition, apparently hoping to increase pressure on the Antitrust Division of the Department of Justice, which is investigating whether the proposed deal between Google and certain copyright holders would result in a virtual monopoly over the emerging eBook market.

Opposition to the deal appears to be growing in some parts of Europe, as well, although the European Union's media commissioner recently expressed her support for Google's effort.

Lost in all the legal wrangling over this case, however is the sheer scope and ambition of what Google is trying to do by scanning almost every book published in the U.S. over the past 86 years.
Dr. Daniel J. Clancy, PhD, is the Engineering Director for the Google Book Search Project, and he agreed to share with Bnet some of the details about how the company is approaching the problems inherent in a project of such magnitude, including the probability of errors in the scanning process:
"The quality of our scans is something that is very important to Google and we have spent a great deal of time both in developing and improving the technology we use as well as auditing and assessing the final quality of the content we are serving. We have lots of data and analysis about every type of conceivable error that occurs within Google Book Search, including the frequency and the source.

"We monitor this closely and also analyze the costs associated with the various approaches to achieving the desired quality levels. The approach to scanning that we have developed is based upon this analysis combined with the scope of the project that we are undertaking. Due to the scale and scope of the project it has caused us to challenge lots of different assumptions to develop an approach that will allow us to consider scanning what eventually will be tens of millions of books.

"Consider the following thought exercise: Suppose you had two different approaches to scanning and QA. One of them (technology A) gets 90 percent of the books perfect and the other gets 99 percent of the books perfect (technology B). However, Option B costs 5 times more than option A. So, what approach would you use to scanning assuming that you have a fixed budget?

"If you choose option A, you can scan 5 times more books. However, 10 percent of all of the books will not be perfect. So, if you had the budget to scan 1 million books, 900,000 would be perfect, and 100,000 would not be perfect. If you choose option B, then you end up with 198,000 perfect books and 2,000 that are less than perfect.

"Obviously, Google chose option A. We explicitly did not want to be in the business of deciding which books had value and which did not. In terms of the public domain books, there are countless stories of individuals who have found nuggets of information that they did not know existed. These were not in "seminal" works, but rather were often in obscure works. As such, if we had chosen approach B, we would have probably had one-fifth as many of these stories.

"(But) in actuality the trade off is much more complex. The analogy I use is the comparison of a 'traditional JPG' compression versus 'progressive JPG.' We all remember how JPG files used to render across our screen's a pixel at a time from bottom to top. As each line of the image was rendered, it became 'perfect;' however, you could not get a sense of the whole picture until almost the entire picture was rendered.

"Of course, progressive JPG provides you with a version of the image that is at first a little fuzzy, and then the image sharpens as more of the image is rendered.

"I liken our approach to scanning as similar to progressive JPG. If you consider the 'image' to be all the world's books and you consider a 'fully rendered image' as each of these books being in a digital form at some desired quality level, then our approach is to be able to get all of the books scanned with the knowledge that there is some statistical likelihood that any individual book and/or page might have an error.

"Our approach allows a large portion of the 'full image' to be rendered as soon as possible with the ability to fill in details in the image over time.

"In the thought exercise above, I used cost as a fixed number. You also can do the same exercise using time. Google has been scanning at significant scale and we clearly are not even half way done. As such, lets assume that our initiative takes 10 to 15 years. If you slowed this down by a factor of 5 now you are talking about a 50 to 75 year project until you are done.

"We spend a great deal of effort trying to detect and correct problems. However, we are not able to detect all of the problems that may occur. As the content is used, it is also possible to allow users to identify where the problems are. For problems that negatively impact users, it will be highlighted quickly and can be fixed. On the other extreme, if a page from a book has a problem and no one ever looks at this page over a 20 year period, then the impact of this error is marginal, if not zero.

"There are other challenges including meta-data quality as well as 'product functionality.' All of these things take resources. For meta-data, sometimes the problem is in the source of our meta-data and sometimes it is in how we process or display this meta-data.

"However, in each case I believe that part of the answer is that it will take time. I fear that often people have taken for granted the ambition of the Google Books project and then when there is an issue they see it as a symbol of either Google not caring or Google not trying hard enough.

"We are constantly trying to think of different ways to address a wide variety of problems in this space including meta-data accuracy, clustering, using user feedback, multi-volume works, public domain determination, search, presentation of results, crawling the web for book references, and many other issues."
If you've read this far into this post, you probably now know more than 99 percent of Google's users about what the search company is trying to do, the approach it is taking, and some of the problems it is encountering. Because of the significance of Google's effort to convert old and out-of-print books into digital versions, I'll return to this topic in the coming days and weeks, including a look at the quality of the meta-data in the first ten million books Google has scanned and converted so far.
Related posts:
Aug. 26 Google Offers Free Downloads of a Million Books
Aug. 5 The Google Book Search Case -- for Dummies (Note: Daniel Clancy made most of these remarks in a slightly different form on the private Read-20 I list about a week ago. He and I have collaboratively edited them into this form for the Bnet readership -- D.W.)

  • David Weir

    David Weir is a veteran journalist who has worked at Rolling Stone, California, Mother Jones, Business 2.0, SunDance, the Stanford Social Innovation Review, MyWire, 7x7, and the Center for Investigative Reporting, which he cofounded in 1977. He’s also been a content executive at KQED, Wired Digital, Salon.com, and Excite@Home. David has published hundreds of articles and three books,including "Raising Hell: How the Center for Investigative Reporting Gets Its Story," and has been teaching journalism for more than 20 years at U.C. Berkeley, San Francisco State University, and Stanford.