RealTime IT News

Post Mortem: Skype Redux?

The dust has settled after Skype’s mid-August meltdown. Millions of users around the world lost service for up to two days, producing much wailing and gnashing of teeth. Everything appears to be back to normal now, barring the odd Skype-borne virus. But we wondered.

 

What really happened when the lights went out in Skype-world? Could it happen again? Could it last longer next time? Can we still trust Skype?

 

I talked to the company’s director of operations, Michael Jackson—and for a reality check, polled Martin Geddes, the evangelist of Teleapocalypse and chief analyst with UK-based STL Partners, the consulting and research firm behind the Telco2.0 initiative. Skype also took a crack at explaining everything on one of its blogs.

 

To its credit, the company takes full responsibility for the outage—despite early reports that tried to make Microsoft the villain—and it appears to appreciate the impact the incident had on users.

 

“Two days is a long time; we’re the first to admit that,” Jackson says. “Clearly there are businesses and people who depend on this service. They’re wondering, is it going to happen again? Can we rely on Skype in the future? So I guess we have to regain that trust point—the same as any company that lets its customers down. We’re going to do our absolute utmost to try and make sure it doesn’t happen again.”

 

The company did give paying customers—SkypeOut, SkypeIn, Skype Pro, and voicemail, users—an extra week of service, though initially it appeared it was only giving credit for the period of the outage. Geddes felt this was the one false step in the company’s post-meltdown public relations effort. 

 

“That was the finance department speaking,” he says of the initial offer. “It’s not anything from the heart, it’s not an apology. They should have said, ‘Here’s a week’s credit or a month’s.’ It would be better if they offered nothing. This is an insult, really.”

 

Skype users apparently didn’t feel that way. According to Jackson, usage numbers very quickly bounced back, with log-ons on the following Tuesday about the same as the previous week. The seasonal upswing with school starting in September also exactly mirrored the previous year, he says.

 

What caused the outage?

It’s much clearer now what happened and why.

 

Early on the morning of Thursday, August 16, Microsoft launched a mass online update of Windows computers to add security patches and other bug fixes. Soon after, Skype noticed an unusual number of users were having trouble logging in.

 

Skype users don’t really log in the way users in a conventional client-server network do—in a peer-to-peer network, there are no central servers—but they do have to validate their Skype clients and credentials against the network.

 

The problem in this case was a dearth of supernodes, the user computers the company commandeers to manage the peer-to-peer network and specifically the validation process. Without them users can’t log in.

 

The software agreement you sign when you install Skype client software gives the company permission to use some of your computer’s processing and bandwidth capacity. Each supernode handles about 300 nearby users. Skype configures five in each cell for redundancy. So with upwards of nine million users online, it takes something like 150,000 supernodes to make Skype work. 

 

The software automatically selects the most reliable computers with the fastest Internet connections to be supernodes. The trouble is, when a supernode goes away temporarily, as thousands did when Microsoft automatically rebooted them after the patch, it no longer qualifies to be a supernode, at least until it proves its reliability all over again.

 

So millions of Skype users’ computers were rebooting after the update and most were trying to reconnect to Skype. The few supernodes left standing couldn’t handle the traffic. Geddes compares it to a denial of service attack on a conventional network.

 

“There’s some truth in that,” Jackson says. “It’s a combination of a lack of availability of [super]nodes—they were all full—and the fact you can’t become a supernode until you log on to the network. And there aren’t enough clients available to become nodes because they can’t log on. So it’s more a catch 22 than a [denial of service].”

 

But why did this Microsoft update “catalyze,” as Jackson puts it, such a catastrophic reaction in the Skype network? Microsoft regularly updates and automatically reboots users’ computers.

 

“This patch caught a larger percentage of computers and it was a deeper reset,” Jackson says. “We hadn’t seen this before. We’d seen perturbations in the network [after other Microsoft updates], but put them down to just that, perturbations. We never thought it could be this kind of a domino effect.”

 

The internal gremlin

The other factor—the real culprit, Skype now says—was a resource allocation algorithm in the client software that could not adapt to such a set of circumstances. Instead of clients “backing off” on their attempts to validate on the network when supernodes weren’t immediately available and waiting for the ship to right itself, they kept hammering away, trying to log in.

 

“We just never thought that supernodes could ever not be available to this level,” Jackson says. “Once engineers could see that that’s what had happened, it took about eight minutes to repair [the offending piece of code].”

 

Could it happen again?

Fixing the code should prevent the same thing happening again under similar circumstances, but the network actually righted itself on its own, he points out.

 

Should Skype have known something like this could happen? Jackson says yes. The Microsoft update and reboot was a “legitimate action,” he says, and the “way of the world.” So Skype should have been prepared.

 

Some might dispute this. Why does Microsoft automatically shut down computers at all, given the risk—at the very least—of unsaved user data being lost? Why not perform the update and pop up a message that users would find when they came back to their computers, instructing them to reboot to complete the process?

 

But Jackson goes out of his way to absolve Microsoft and even praise the company on two counts. It initially took seriously the possibility of its own culpability, that something in the patch was preventing the Skype network from recovering, he says. And it was very responsive to Skype—including convening a “SWAT” team at 8 a.m. on the Thursday morning to help trouble shoot. 

 

“It wasn’t anything they did,” Jackson says. “But they were hugely helpful. I was really impressed.”

 

Geddes has an interesting take. One of the trade-offs with peer-to-peer networks, compared to client-server networks, he points out, is that they trade off manageability—especially the ability to manage endpoints—for scalability. It’s the nature of the beast. He also notes that P2P is still “a new, immature” technology and that Skype is feeling its way forward, much as early Internet service providers had to do. 

 

“The Skype folks couldn’t do a test of millions of users re-logging in,” Geddes says. “So Microsoft did the test for them, and it failed.”

 

Jackson insists that a simple fix to the resource allocation algorithm, which will force clients to wait when they encounter a similar situation and re-validate in an orderly fashion, will prevent the same thing happening. “[The network] wouldn’t break. The time period [outage] would be some minutes rather than hours.”

 

The August melt-down was a wake-up call for Skype, though, he says. Following an in-depth post mortem of its own, the company has assigned engineers the task of anticipating other potential network-wrecking circumstances, and figuring out ways to prevent them.

 

As for regaining users’ trust, Jackson is candid and realistic. “We screwed up” he says. “Everybody gets a second chance. We just can’t abuse it.”