Skype offered a detailed explanation Wednesday for the "critical failure" that prevented millions of Skype users worldwide from placing or receiving calls last week. According to the VoIP operator, a bug in an older version of the Skype for Windows client software caused 25 to 30 percent of the service's supernodes to fail.
Like any other peer-to-peer (P2P) network, Skype relies on supernodes with the ability to take on additional responsibilities compared to regular nodes by acting like a directory, supporting other Skype clients and creating local clusters, noted Skype CIO Lars Rabbe.
"Once a supernode has failed, even when restarted, it takes some time to become available as a resource to the P2P network again," Rabbe wrote in a blog. "As a result, the P2P network was left with 25 percent to 30 percent fewer supernodes than normal," which caused "a disproportionate load on the remaining available supernodes."
Outages Are Inevitable
To recover the core system functionality as quickly as possible, Skype utilized resources normally dedicated to supporting group video calling, using them to deploy supernodes, Rabbe explained. "Over the course of Thursday night and Friday morning, we returned these to their normal use and restored group video calling functionality in time for Christmas," Rabbe explained.
Last week's outage underlined the risks involved with P2P-based communication services. "Many users felt lost, often also in corporations that rely on Skype for communication with remote clients and employees," noted Gartner Vice President Andrea Di Maio. "I guess that we have to learn," he wrote, that "nothing will ever be 100 percent reliable."
Online outages are inevitable -- as the service disruptions that occurred at Facebook, Twitter and Google's Gmail service have already demonstrated this year. Still, Skype's latest snafu came as an unpleasant pre-Christmas surprise for many business users, which Di Maio found surprising.
"For some mysterious reason, we seem to believe that the Internet and -- even more -- the cloud [will] shield us from painful experiences, but that's not the case," Di Maio noted.
Learning from Failures
Users need to ask themselves to what extent they can personally or professionally rely on online tools over which they have no control or oversight, Di Maio advised. "We all need to really understand that technology is fallible and we will always need to exercise our creativity to react to unexpected situations," he wrote.
Di Maio also advised online service providers to leverage their critical failures as learning experiences, which is precisely what Skype is promising to do. "Lessons will be learned and we will use this as an opportunity to identify and introduce areas of improvement to our software, further assess and invest in capacity and stability, and develop better processes for outage recovery and communications to our user base," Rabbe explained.
Given that Skype's outage was caused by outdated software, the network operator will be providing users with further software updates this week as well as reviewing its automatic software update processes to ensure that all users are running the latest releases. "We believe these measures will reduce the possibility of this type of failure occurring again," Di Maio added.
Moreover, Skype intends to look for new ways in which the online service can detect potential problems more quickly as well as recover more rapidly following any system failure. Additionally, the VoIP provider will be reviewing its testing processes "to determine better ways of detecting and avoiding bugs which could affect the system," Di Maio wrote.