Application 20100040287 is titled Segmenting Printed Media Pages Into Articles. The company has already shown its interest in getting periodicals online as part of Google Books. But there are two problems. The technical one is tricky, as the application describes:
Complex printed media material, such as a newspaper, often involve columns of body text, headlines, graphic images, multiple font sizes, comprising multiple articles and logical elements in close proximity to each other, on a single page. Attempts to utilize optical character recognition in such situations are typically inadequate resulting in a wide range of multiple errors, including, for example, the inability to properly associate text from multiple columns as being from the same article, mis-associating text areas without an associated headline or those articles which cross page boundaries, and classifying large headline fonts as a graphic image.The application describes how Google would detect blocks of text and determine how they fit together into articles. The implications are clear. Once Google could break scanned magazines and newspapers down into individual articles, it could then store and serve up these articles, perhaps using optical character recognition to create text file versions and then use the context for search as well as advertising.
There's just one legal problem: New York Times Co. , et. al. v. Jonathan Tasini et. al. Usually called the Tasini case, freelance writers sued the New York Times and other print publications for licensing individual articles to database companies without permission from the writers, who retained the copyright on the articles. One of the main turning points was that the publishers had explicit permission only to include the articles in the print publication. However, copyright law did not allow the publishers to break their publications up and make the articles accessible to readers out of the original context.
Google's patent application describes a process that would do exactly that. If the company receives permission from the appropriate rights holders, that would be possible. But going through years of magazines to determine who exactly could legally give permission would be extremely difficult and time consuming. Google could do as it did with scanning books: act and wait to be sued. Yet chances are good that the company would waltz right into another lawsuit, as it has with Google Books, only one where the precedent -- a clear Supreme Court decision -- would wipe away much of the legal ambiguity Google might want to claim. And freelance writers have shown themselves ready to call their lawyers.
[UPDATE: It dawned on me that I had missed an extra twist on the legal front. In the Google Books case, the publishers could also bring suit, and their larger-than-freelance resources, because Google was potentially infringing their rights as well. If Google goes back far enough in magazine and newspaper archives, before publishers often demanded and got extensive rights, then by breaking out individual articles, they would be dealing with only the freelance writers, most of whom have not registered copyright on their articles. That means most of the writers would not have legal standing to bring a suit. Even if the freelancers registered copyright after the infringement, they'd be limited to seeking only the "profits" from use of their material and couldn't even sue for legal fees. That would effectively leave Google free to use the material, knowing that the writers could not afford to challenge the company in court. For the small portion of writers that had registered their copyright, Google has plenty of money to fight them in court.]