Disaster Prep in the Age of Web 2.0

SAN FRANCISCO — When you do business on the Internet, redundancy
is everything, according to Artur Bergman. But striving for
“five-nines” uptime can be a waste of money.

Bergman, director of engineering for Wikia, shared some straight
talk about operations at the Web 2.0 Expo, held this week in San
Francisco. While his remarks reflected the Web 2.0 culture of
good-enough applications and end-user debugging, he gave good advice
for any company with a Web presence.

In July 2007, a surge on the power grid caused generators at 365
Main, a datacenter that serves many Web 2.0 startups in the area, to
fail. No problem; they have eight backup generators, two more to back
up the eight, and a flywheel that uses residual energy to maintain
power until the generators come on.

Nevertheless, the entire system failed, leaving a multitude of
sites, including Craigslist and TypePad, dark for up to 12 hours.

“The overarching complexity of putting [the datacenter’s backup
system] together made a failure that was 200 percent more than it
needed to be,” Bergman told the audience.

Still, he said, the resulting site outages weren’t 365 Main’s fault. “It was all the
companies that use it as their only datacenter,” Bergman said. “You have
to expect that things fail: Your datacenter will fail.”

He questioned whether today’s crop of nimble startups are prepared
for a disaster, such as an earthquake. Some serve applications and
the domain name from the same servers, so, if the servers fail, they
won’t even be able to communicate with users.

Another problem with startups, he said, is that operations staff
can’t quantify how their work contributes to — or takes away from —
the bottom line.

For example, he said that most companies don’t need
to shoot for 99.999 percent uptime, the famous “five nines.” In fact,
he pointed out, some of the largest sites, such as World of Warcraft,
offer an estimated 97 to 98 percent uptime.

“They’ve trained their users to accept the downtime. Don’t aim higher than you need to,” Bergman said. “The higher you aim, the more complex a system you need to
build — and the harder it will fail.”

Ops engineers also should be able to calculate how much a page
view costs the company, and the cost of a specific feature. They also
should try to measure the value of reliability. For example, a
problem at Wikia that that slowed 25 percent of the pages caused a
loss of 50 percent of users.

So the cost benefit is real, even if it’s difficult to estimate. Costs also go up when the system is unreliable — during an outage, the company spends more on
tech support and its call center.

“You need to be able to provide these hard numbers to the business
side, so they can make the decisions,” Bergman explained.

Companies serious about doing business on the Web need to hire
engineers to handle operations, Bergman believes. “You don’t want
sysops. You can’t maintain 50 servers the way you maintained a mail
server in college.”

He drew on his own extensive operations experience to offer tips
for companies to prepare for an outage and to deal with one when it
inevitably happens.

Rule No. 1 is to understand the system, whether you write it or buy
it. “We should not accept voodoo when we deal with engineering
systems,” he said. “The company’s livelihood is that stack running
24/7, and it’s your responsibility.” He prefers open source software,
because if there’s a problem, he can look at the source code.

If there’s a problem, stop thinking and look before you start
tinkering. All staffers involved, including management, should talk
internally to figure out what’s wrong before anyone begins work on
fixing it.

“Make plans, divide a0nd conquer,” Bergman said. “You treat it as a project
with different branches and continuously report on what’s going on.”

Finally, if you think the problem will take a while and you have
more than two or three people, send a couple home to sleep.

Bergman indirectly addressed one tenet of Web 2.0 development,
“Release early and often.” The idea is to continuously offer new
features and functionality in Web-based applications, so that users
can provide feedback, and the company can watch to see which are most
popular.

Often, these betas are half-baked. While end users have been
more than willing to roll the dice with underdone apps, it’s a
problem if bugs get baked in.

“Bugs don’t go away,” Bergman said. “It will come back — probably
at 11 p.m. on Christmas Eve. Your software engineers should be on
secondary call, especially on holidays. They need to learn to share
the pain they create by releasing buggy software.”

News Around the Web