Disaster Tolerant Unix: An Ounce of Prevention

Of the many lessons learned from Sept. 11, an enduring one was the need for
disaster planning. From the need for fast emergency exits to vigilant
security to redundant IT systems, planning ahead for the worst has gained
companies’ attention.

One critical area that businesses need to address is the vast amounts of
data stored on their servers. For top Unix vendors like HP
and Sun Microsystems , the challenge is to meet the needs
of a business with the capabilities of the system. As many companies have
found out, there is no simple answer for how disaster tolerant their Unix
servers should be.

“I think what 9/11 did was cause people to consider they might have a need
for [disaster tolerance],” says David Freund, a research analyst with
Illuminata, an IT consultancy. “The next step was exploring what that
actually means.”

Sketching a Disaster-Risk Profile

Industry analysts and Unix vendors agree that a company must look first to
its overall business continuity plan, which will dictate what kind of
protection it needs for its Unix servers.

“A disaster is when my business stops running,” says Dan Klein, a marketing
manager in HP’s business-critical systems group. “That’s the real disaster
and that’s the disaster you want to prevent.”

In that sense, sketching a disaster plan has not changed all that much, says
Klein. Businesses have been taking precautions against catastrophic events
for hundreds of years, but the need to protect data has steadily risen since
the 1970s, when it was very important, to today, when it is, in many ways,
the business itself.

“The two things you’re really concerned about is data and ability to use the
data,” says Freund. “Not everyone needs the same disaster tolerance.”

The twin considerations are known as recovery-point objective (RPO) and
recovery-time objective (RTO). The RPO is what data needs to be saved, while
the RTO reflects how time-sensitive the information is. A matrix, using these two
objectives, can determine how business-critical data is.

“You need to make it as simple as possible,” says Kevin Coyne, director of
business operations for Sun’s services unit. “What’s the recovery time? How
much downtime can you sustain? What’s recovery point? What’s the longest
amount of time you can sustain loss of data? It becomes much easier then to
develop a solution set.”

Not all businesses, or even parts of businesses, are equal. For example, an
investment bank has a very low disaster-tolerance level, since it needs all
transactions preserved without any downtime. An air-traffic control system,
on the other hand, needs all data up as soon as possible, yet does not has a
critical need for historical information. Even within IT systems, needs
differ. The back-office operations of a company might require a high degree
of accuracy of the data, but can take some time getting back up and running,
while an online store front would need to be up again immediately.

With a disaster-risk profile, a company can then make the choices on the
technical details to meet their needs.

“You need to focus on the business operation, not the technology,” says
Klein. “You’ll find different parts of an enterprise have different
requirements.”

Preparing For The Hundred-Year Flood or Flooded Basement?

After Sept. 11, Sun’s Coyne says the need for protecting a company’s data
quickly moved up the line in importance. Where before one IT person would be
charged with the task, some companies were forming business-continuity
offices.

“Companies recognize they do have control of cost, complex and level of
availability,” Coyne says. “They’re pricing out the cost, then determining
how much they’re able to spend.”

The importance companies have placed on protecting their data is critical,
according to Illuminata’s Freund. He cites a University of Texas Research
Center on Information Systems study that found nearly half of companies that
lose their data in disasters go out of business, while 90 percent close
within two years.

With the stakes that high, Freund says CEOs and CIOs have good reason to
treat data, in many cases, as a their company’s most valuable commodity.

“The cost,” he adds, “is the reality check.”

While companies might want to maximize both their RTO and RPO, they can
quickly find the costs spiraling.

“There’s typically a case of sticker shock,” Freund says. “It’s not
inexpensive. It’s a lot like buying an expensive insurance policy. And it’s
not all-or-nothing.”

One of the first lines of defense is redundancy, in an attempt to eliminate
a single point of failure that could knock out a company’s data system.
However, the task remains as frustrating as a game of whack-a-mole: once one
single point of failure is eliminated, another pops up to take its place.

Disaster tolerance adds layers of redundancy, in order to insure that any
failure would not cripple IT functions. For Unix servers, this begins with
clustering or mirroring. With so-called high-availability clustering, an
application would run on multiple servers, while disk mirroring can create
perfect copies of data.

The World Trade Center’s collapse reinforced the need for geographical
dispersion of a company, including its servers. Freund says this is where
the cost often comes into play.

“If you’re out to increase distance, that distance doesn’t come free,” he
points out. “The greater the distance, the longer it takes electrons to move
down the cable. If it’s up to the second, then it could slow down your
system.”

HP’s Klein says protecting data is will soon become just a cost of doing
business, guarding against losing the most valuable part of a company.

“The basic concept has not changed,” he says. “It’s what we apply it to and
the time factor that’s changed.”

News Around the Web