Don’t Blame Disks For Every Storage Failure

While disk problems are a big culprit in storage subsystem failures, enterprises might want to begin eying physical interconnects, since they’re just as often to blame.

That’s according to researchers at the University of Illinois at Urbana-Champaign and Network Appliance. The researchers — university researchers Weihang Jiang, Chongfeng Hu and Yuanyuan Zhou, and NetApp’s Arkady Kanevsky — concluded in a recent study that disks were responsible for 20 to 55 percent of failures.

But they also found that physical interconnects including shelf enclosures could claim even higher failure rates: 27 to 68 percent.

[cob:Related_Articles]”Disks are not the only component in storage systems,” wrote the study’s authors. “To connect and access disks, modern storage systems also contain many other components, including shelf enclosures, cables and host adapters, and complex software protocol stacks … Failures in these components can lead to downtime and/or data loss of the storage system.”

“Hence, in complex storage systems, component failures are very common and critical to storage system reliability,” they said.

Their findings, available in PDF format, are slated to be presented at this week’s 6th USENIX Conference on File and Storage Technologies (FAST).

The study’s authors analyzed almost five years’ worth of storage logs from 39,000 systems deployed at NetApp customer sites. Those systems include approximately 1.8 million disks, across 155,000 high-end, mid-range, low-end and backup shelf enclosures.

In addition to new statistics on the role of physical interconnects in failures, the researchers also found that protocol stacks were responsible for 5 to 10 percent of failures.

Fortunately for IT admins, the report also suggested some ways to help beat the odds.

For instance, storage subsystems tied together with redundant interconnects experienced 30 to 40 percent lower failure rates than those with a single interconnect, it said.

Additionally, spanning disks of a RAID group across multiple shelves in a system makes for a “more resilient” approach than using a single shelf, the study stated.

Other design considerations could play a role in further reducing problems.

“Storage system designers should also think about using smaller shelves, with fewer disks per shelf, but with more shelves in the system,” the report said.

The research takes a somewhat wider view of storage problems plaguing enterprise datacenters, as a good deal of recent, high-profile research about storage failures has focused primarily on disk problems.

For instance, last year at FAST ’07, Google presented its own study on failure rates (available here in PDF format) based on experiences with 100,000 of its own PATA and SATA disk drives.

The Google study found that drives one year old or less had an annual failure rate of 6 percent, and are at risk from colder temperatures — while high temperatures can lead to excessive failures in older drives.

That study also focused on the drives’ Self-Monitoring, Analysis, and Reporting Technology (SMART) and concluded that the feature — found in most drives used today — may not be up to snuff in accurately predicting disk failure. The Google research found that in 36 percent of failed drives, SMART did not flag any problems.

The authors of this year’s joint Illinois-NetApp study warned that focusing on drive-related problems can encourage enterprises to undertake unnecessary disk replacements to combat crashes, when failures can just as often be caused by other factors.

Similarly, the study also noted that low disk failure rates do not necessarily translate to a more reliable system.

News Around the Web