Opinion

Google, Publishers, Fated To Be Foes?

December 14, 2009 / 10:10 AM EST / CBS

The Monday Note covers the intersection between media and technology and the shift of business models. It is jointly edited by Frédéric Filloux, a Paris-based journalist and Jean-Louis Gassée, a Silicon Valley veteran currently general partner for the venture capital firm Allegis Capital in Palo Alto. Their column appears on CBSNews.com each Monday.

Could Google and Publishers one day understand each other? Frankly, I doubt it. Two weeks ago I was in Hyderabad for the dual assembly of the World Association of Newspapers and the World Editors Forums There, Google-bashing was the life of the party. As I wrote in last week's Monday Note (see The Misdirected Revolt of the Dinosaurs) the climax was the "debate" between WAN's president, Gavin O'Reilly, and Google's top lawyer, Dave Drummond. One comes from Alpha Centauri, the other from, say, Pandora. For those who want to get to the bottom of the argument, the publisher's statement is here and Google's legal defense is here.

In a nutshell, publishers keep complaining about Google's relentless copyright violations. Tireless Google robots crawl the Internet, indexing and displaying snippets in Google News, without paying a red cent for the content they post. As a result, said O'Reilly, "Google makes tons of money on our back."

Drummond's reply: "We send online news publishers of all types a billion clicks a month from Google News and more than 3 billion additional visits from Search and other Google services. That's about 100,000 business opportunities - to serve ads or offer subscriptions - every minute. And we don't charge anything for that!" He added that Google's practices were fully compliant with the Fair Use principle.

Fair Use is "tired rhetoric," snapped O'Reilly.

At this point the discussion gets technical. And interesting. At stake is this: a crucial evolution of copyright, from a binary form (authorized ≠ forbidden) to a more fuzzy concept (use is allowed but restrictions apply). This evolution of copyright is tied to the Creative Commons (coined by Law professor Lawrence Lessig), which define a sort of shape-adjustable notion of intellectual property.

Here is the (first) catch: How do you translate an intellectual construct, such as flexible copyright, into a computer protocol? In Hyderabad, publishers reignited a nerdy quarrel over the best way to protect their news material. That's the Robots.txt vs. ACAP issue. Non-techies, please stay with me, I'll do that in plain English here (and I'll pursue in French next January on Slate.fr).

Robots.text is a 1994 protocol (two years before Google was incorporated), these were the early Internet days. It works like this:

Say I'm an online publisher. In the tree structure of my site I decide to open selected branches (directories) to search engines robot crawlers. The result of the crawling can be regurgitated by aggregators like as GoogleNews. But for reasons, such as copyright restrictions on material I don't own, parts of my site need to be kept out of Google's sight.

For total prevention of unwanted crawling, I'll just insert two lines of code at the root of my site:

User-agent: *
Disallow: /

The first line carries the name of the robot I want to exclude ("*" means all) and the second line specifies the directories I want to protect. For example :

User-agent: Googlebot
Disallow: /sport-foot-ligue1/
Disallow: /sport-football/
Disallow: /sport-rugby-top14/
Disallow: /sport-rugby/

Here, the site of the French paper Le Monde will prevent Google's indexing robot from crawling sports directories about football and rugby.

It's as simple as that. To get an idea of the various protection policies implemented by news sites, just type the extension "robots.txt" after the URL. Example: http://www.timesonline.co.uk/robots.txt. There you see the list of all the robots the London Times wants to "disallow." Interestingly enough, even though Rupert Murdoch is at the forefront of an anti-Google crusade, his British media property is not excluding Google at all ; same as The Australian, another historical Murdoch property which is rather robot-tolerant (see by yourself). I love such duplicity - sorry, pragmatism. (Actually, the fight is about a MySpace-related advertising contract).

Facing our clunky but straightforward robots.txt protocol, here is a much more modern one: ACAP. It stands for Automated Content Access Protocol, and was created in 2006. But more importantly, it is backed by 150 publishers and by the WAN.

Here we are.

ACAP and Robots.txt look similar: lines of simple code, put at the right place to define the bot(s) and /, that is the directories to be excluded. Except that ACAP is way more sophisticated. Specifically, it can tell:

how many lines of an article the robot is allowed to suck in

assign a specific abstract (snippet) to be taken by the bot

at what time the bot can crawl what part of the site, for instance "0700-1230 GMT"

at what rate it crawls

block links to a part of the site

assign a term limit for the validity of the abstract

decide which country (IP numbers) should be allowed to see what (here comes the balkanization of the Internet : bad idea)

… etc.

Which one is the best ? ACAP in theory. It dramatically increases the granularity of the terms of usage for any given contract. To get a full and, I think, balanced perspective, go to this detailed article at Search Engine Land.

But here is the second catch: Google superbly ignores ACAP; the company's position is that the Robots.txt protocol does enough to protect content. Hence WAN's president's ire.

I asked François Bourdoncle, CEO of the French search engine Exalead for his view of the discord. In 2007, Exalead became the technical partner for the publishing consortium that wanted a better system than Robot.txt. (Exalead did the prototype pro bono). If we consider the best protocol is the one that is the most widely-adopted, then ACAP is toast: its version 1.1 has been adopted by 1250 publishers, compared to the 20,000 sources that went on Google News.

François Bourdoncle offers the best analogy to describe the antagonism between the online media and Google: "It is the craftsmen of the information world vs. Industrialists," he says. On one hand, you have the publishers: they manage thousands of documents on each of their Web sites. They signed complicated copyright contracts, with clauses defining the nuances in authors' protection. On the other hand, you have the likes of Google, where the unit of measurement is the billion of documents. There is no room for finesse here. The problem is one of massive processing, one that can be only be dealt with powerful algorithms. I mean "The Google Way." Publishers want to be able to define the number of lines a bot can draw out from a story? Google will say: I want to be the only one who defines what my search or crawl results (in Google News) actually look like ; if a site x wants abstracts limited to 3 lines and the site y agrees to 9, that'll be a mess. When the Googleplex geeks decide that it's time, they'll upgrade the Robots.txt protocol to bring it closer to ACAP - and to keep the widely adopted protocol their own.

Fact is, Google is playing bad politics here. It is stunning to see such a deployment of raw brainpower so badly messing up a relationship with an important and significant partner, such as the media industry. Here are some measures that Google should consider to lower the tension:

Robots.txt is an old thing. OK, it does the job someway, but Google should adopt ACAP pronto.
Alternatively, it should work out something close to it, along with the publishers. Contrary to what the WAN says, it won't change the deteriorating economics of online news. Still, that would be a welcome symbolic gesture.
Google should organize ASAP a serious gathering at the Googleplex to listen to the publishers' position on copyright, but also on traffic and revenue sharing and pay walls. In every major news organization in the world, there are plenty of smart people managing big news sites who don't carry an anti-Google bias. They should be asked to come up with real proposals and be allowed to expect real answers.

The worst mistake Google can make at this stage is to continue to ignore publishers' claims. Every news organization got it: Google now rules the online publishing world. But with dominance come obligations. Displaying magnanimity could be a good tactical move. Because a new factor has emerged. Microsoft Bing (the search engine), which hopes to capitalize on the ailing publishing world's anger. Googleplex's engineers should integrate that into their master algorithm.

By Jean-Louis Gassée and Frederic Filloux
Special to CBSNews.com