Monday, August 10, 2015

A Gossip-Style Failure Detector

Failure Detector is a valuable component for system management, replication, load balancing, group communication services, various consensus algorithms and many other distributed services which rely on reasonably accurate detection of failed members.

Algorithm:
  1. Each member maintains a list with each member's address and heartbeat counter. At each time period Tgossip it increments its own heartbeat counter and sends its list to one random member. Common communication delay (time for sending messages) is typically much smaller then interval between gossiping rounds (Tgossip).
  2. On receive of such gossip message, a member merges the list in the message with its own list by maximum heartbeat counter for each member.
  3. Each member occasionally broadcasts its list in order to be located initially and also recover from network partitions. Decides about broadcast in probabilistic way with increased probability after longer period of time from last received broadcast. Instead of broadcast can be used few gossip servers with well-known addresses. 
  4. Each member also maintains for each other member in the list, the last time that its corresponding heartbeat counter has increased.
  5. If the heartbeat counter wasn't increased for time Tfail the member is considered as failed. But failed members not removed from the list immediately.
  6. Member removed from membership lists after time period Tcleanup.
It is possible to analyse probabilities of gossiping and calculate parameters Tgossip and Tfail with given quality of service requirement using epidemic theory. Proposed algorithm may be optimized to utilize IP address structure in order to reduce an amount of traffics between sub networks.

References: