September 2, 2009 11:33 AM
- Text
Gmail Outage Blamed on Miscalculation
(CNET)
This story was written by CNET's Tom Krazit
Google's nearly two-hour Gmail outage Tuesday was the result of a miscalculation regarding the capacity of its system, the company said late Tuesday.
Gmail may be out of beta, but it wasn't ready for prime time Tuesday.
Gmail was down from about 12:30 p.m. PDT Tuesday to about 2:30 p.m. PDT, affecting millions of Gmail customers who depend on the service for everything from fantasy football roster updates to business-critical information. The problem was caused by a classic cascade in which servers became overwhelmed with traffic in rapid succession.
According to Google, the problem began when it took several Gmail servers offline for maintenance, a routine procedure that normally is transparent to users. However, the twist this time around was that Google had made some changes to the routers that direct Gmail traffic to servers in hopes of improving reliability, and those changes backfired.
"As we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers -- servers which direct web queries to the appropriate Gmail server for response," Google said in a post to its Gmail blog late Tuesday.
"At about 12:30 p.m. Pacific a few of the request routers became overloaded and in effect told the rest of the system 'stop sending us traffic, we're too slow!' This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded," wrote Ben Treynor, vice president of engineering and site reliability czar.
Google fixed the problem by allocating traffic across the rest of its prodigious network, a luxury that it enjoys given the resources it has put in place to operate the world's leading search engine. But what's next?
Google said it would focus on making sure that the request routers have sufficient headroom to handle future spikes in demand, as well as figuring out a way to make sure that problems in one sector can be isolated without bringing down the entire service. "We'll be hard at work over the next few weeks implementing these and other Gmail reliability improvements -- remains more than 99.9% available to all users, and we're committed to keeping events like today's notable for their rarity," Treynor wrote.
Several Google Apps customers who use Gmail for internal e-mail at their businesses and organizations did not return calls Tuesday seeking information on the degree to which they were affected, making it difficult to know the magnitude of the failure. However, Google has put an awful lot of time and money this year behind promoting Gmail as a back-end e-mail software alternative to products from Microsoft and IBM, and embarrassments like this will not help it sell the service to other organizations.
"We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there's a problem with the service," Treynor wrote. "Thus, right up front, I'd like to apologize to all of you -- today's outage was a Big Deal, and we're treating it as such."
By Tom Krazit
Google's nearly two-hour Gmail outage Tuesday was the result of a miscalculation regarding the capacity of its system, the company said late Tuesday.
Gmail may be out of beta, but it wasn't ready for prime time Tuesday.
Gmail was down from about 12:30 p.m. PDT Tuesday to about 2:30 p.m. PDT, affecting millions of Gmail customers who depend on the service for everything from fantasy football roster updates to business-critical information. The problem was caused by a classic cascade in which servers became overwhelmed with traffic in rapid succession.
According to Google, the problem began when it took several Gmail servers offline for maintenance, a routine procedure that normally is transparent to users. However, the twist this time around was that Google had made some changes to the routers that direct Gmail traffic to servers in hopes of improving reliability, and those changes backfired.
"As we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers -- servers which direct web queries to the appropriate Gmail server for response," Google said in a post to its Gmail blog late Tuesday.
"At about 12:30 p.m. Pacific a few of the request routers became overloaded and in effect told the rest of the system 'stop sending us traffic, we're too slow!' This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded," wrote Ben Treynor, vice president of engineering and site reliability czar.
Google fixed the problem by allocating traffic across the rest of its prodigious network, a luxury that it enjoys given the resources it has put in place to operate the world's leading search engine. But what's next?
Google said it would focus on making sure that the request routers have sufficient headroom to handle future spikes in demand, as well as figuring out a way to make sure that problems in one sector can be isolated without bringing down the entire service. "We'll be hard at work over the next few weeks implementing these and other Gmail reliability improvements -- remains more than 99.9% available to all users, and we're committed to keeping events like today's notable for their rarity," Treynor wrote.
Several Google Apps customers who use Gmail for internal e-mail at their businesses and organizations did not return calls Tuesday seeking information on the degree to which they were affected, making it difficult to know the magnitude of the failure. However, Google has put an awful lot of time and money this year behind promoting Gmail as a back-end e-mail software alternative to products from Microsoft and IBM, and embarrassments like this will not help it sell the service to other organizations.
"We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there's a problem with the service," Treynor wrote. "Thus, right up front, I'd like to apologize to all of you -- today's outage was a Big Deal, and we're treating it as such."
By Tom Krazit
Popular Now in SciTech
- Apple iPad 3 rumors: thicker, sharper, coming soon
- Retro Duo will play your old Nintendo games
- Obama's 2012 campaign playlist now on Spotify
- Anonymous breaks into Assad's server
- FBI releases Steve Jobs background report
- Hackers release Symantec pcAnywhere source code
- Ethical iPhone 5 petitions head to Apple stores
- Apple iPhone 5 rumors, reports say June release
- Apple faces $1.6 billion iPad trademark lawsuit
- Apple iPad 3 rumors resurface, sources say March release
- Scientists say online dating doesn't work
- Facebook graffiti artist David Choe, from homeless to millions
- Pinterest secretly swaps links for profit
- Facebook RIP pages defaced by British man
- Apple supplier Foxconn hit by hackers
- Shocking Stats on Texting While Driving
- Facebook required for Spotify account, here's a trick
Latest CBS News Headlines
on Facebook
on CBS News
- XL Group posts big quarterly loss, shares drop
- APNewsBreak: Report: Energy loans could cost $3B
- Marriott increases buyback by 35 million shares
- Midday Glance: Specialty Retail companies
on Facebook
- Josh Powell had "incestuous" images on his home computer, authorities say
- Adele sings a cappella for Anderson Cooper
- Notorious teacher sex scandals
on CBS News






