RealTime IT News

Disaster Prep in the Age of Web 2.0

SAN FRANCISCO -- When you do business on the Internet, redundancy is everything, according to Artur Bergman. But striving for "five-nines" uptime can be a waste of money.

Bergman, director of engineering for Wikia, shared some straight talk about operations at the Web 2.0 Expo, held this week in San Francisco. While his remarks reflected the Web 2.0 culture of good-enough applications and end-user debugging, he gave good advice for any company with a Web presence.

In July 2007, a surge on the power grid caused generators at 365 Main, a datacenter that serves many Web 2.0 startups in the area, to fail. No problem; they have eight backup generators, two more to back up the eight, and a flywheel that uses residual energy to maintain power until the generators come on.

Nevertheless, the entire system failed, leaving a multitude of sites, including Craigslist and TypePad, dark for up to 12 hours.

"The overarching complexity of putting [the datacenter's backup system] together made a failure that was 200 percent more than it needed to be," Bergman told the audience.

Still, he said, the resulting site outages weren't 365 Main's fault. "It was all the companies that use it as their only datacenter," Bergman said. "You have to expect that things fail: Your datacenter will fail."

He questioned whether today's crop of nimble startups are prepared for a disaster, such as an earthquake. Some serve applications and the domain name from the same servers, so, if the servers fail, they won't even be able to communicate with users.

Another problem with startups, he said, is that operations staff can't quantify how their work contributes to -- or takes away from -- the bottom line.

For example, he said that most companies don't need to shoot for 99.999 percent uptime, the famous "five nines." In fact, he pointed out, some of the largest sites, such as World of Warcraft, offer an estimated 97 to 98 percent uptime.

"They've trained their users to accept the downtime. Don't aim higher than you need to," Bergman said. "The higher you aim, the more complex a system you need to build -- and the harder it will fail."

Ops engineers also should be able to calculate how much a page view costs the company, and the cost of a specific feature. They also should try to measure the value of reliability. For example, a problem at Wikia that that slowed 25 percent of the pages caused a loss of 50 percent of users.

So the cost benefit is real, even if it's difficult to estimate. Costs also go up when the system is unreliable -- during an outage, the company spends more on tech support and its call center.

"You need to be able to provide these hard numbers to the business side, so they can make the decisions," Bergman explained.

Companies serious about doing business on the Web need to hire engineers to handle operations, Bergman believes. "You don't want sysops. You can't maintain 50 servers the way you maintained a mail server in college."

He drew on his own extensive operations experience to offer tips for companies to prepare for an outage and to deal with one when it inevitably happens.

Rule No. 1 is to understand the system, whether you write it or buy it. "We should not accept voodoo when we deal with engineering systems," he said. "The company's livelihood is that stack running 24/7, and it's your responsibility." He prefers open source software, because if there's a problem, he can look at the source code.

If there's a problem, stop thinking and look before you start tinkering. All staffers involved, including management, should talk internally to figure out what's wrong before anyone begins work on fixing it.

"Make plans, divide a0nd conquer," Bergman said. "You treat it as a project with different branches and continuously report on what's going on."

Finally, if you think the problem will take a while and you have more than two or three people, send a couple home to sleep.

Bergman indirectly addressed one tenet of Web 2.0 development, "Release early and often." The idea is to continuously offer new features and functionality in Web-based applications, so that users can provide feedback, and the company can watch to see which are most popular.

Often, these betas are half-baked. While end users have been more than willing to roll the dice with underdone apps, it's a problem if bugs get baked in.

"Bugs don't go away," Bergman said. "It will come back -- probably at 11 p.m. on Christmas Eve. Your software engineers should be on secondary call, especially on holidays. They need to learn to share the pain they create by releasing buggy software."