In a much appreciated explanation by Skype, CIO Lars Rabbe has discussed in detail the specific problems that occurred with Skypes service last week and caused the massive outages for so many customers.
Rabbe explains that a cluster of support servers responsible for offline instant messaging became overheated on Wednesday, December 22 which caused a number of Skype users to start receiving delayed responses.
A bug within the Skype for Windows client version 5.0.0152, the most popular version of Skype, (used by 50 percent of all Skype users globally) couldn’t properly process the delay responses and caused around 40 percent of those clients to fail. Which included about one third of all the publicly available supernodes, which then also failed as a result.
The problems were then magnified as users started rebooting the Skype Windows client in attempts to rectify the issue which in its self caused a huge increase in the load on Skype’s P2P cloud network, around 100 times more than would normally be expected.
Excerpt from Skype blog post:
A supernode is important to the P2P network because it takes on additional responsibilities compared to regular nodes, acting like a directory, supporting other Skype clients and establishing connections between them by creating local clusters of several hundred peer nodes per each supernode.
Once a supernode has failed, even when restarted, it takes some time to become available as a resource to the P2P network again. As a result, the P2P network was left with 25–30% fewer supernodes than normal. This caused a disproportionate load on the remaining available supernodes.