Anton Kharenko: Scalable and Efficient Distributed Failure Detectors

Failure detectors are a central component in fault-tolerant distributed systems running over unreliable asynchronous networks e.g., group membership protocols, supercomputers, computer clusters etc. The ability of the failure detector to detect process failures completely and efficiently, in the presence of unreliable messaging as well as arbitrary process crashes and recoveries, can have a major impact on the performance of these systems

Properties of failure detector:

{Strong/Weak} Completeness: is the guarantee that failure of group member will be eventually detected by {all/some} non-faulty members.
Strong Accuracy: no non-faulty group member is declared as failed by any other non-faulty member.

It is proved that achieving both strong completeness and strong accuracy is impossible on fault-prone networks. So if required strong completeness then need to accept weak accuracy (some possibility of false positive failure detection) while trying to reduce it to minimum.

Requirements to Failure Detector:

Completeness: satisfy eventual Strong Completeness.
Efficiency:

Speed: quick (within some given time T) detection of member failure by some (not all) non-faulty member
Accuracy: low probability (below given PM(T) which is much below of probability of message loss Pml) of failure detection mistakes.

3. Scalability: equal expected worst-case network load per member

Theorem. Any distributed failure detector algorithm for a group of size n (>> 1) that deterministically satisfies the Completeness, Speed and Accuracy requirements above, for a given values of T and PM(T) (<< Pml), imposes a minimal worst-case network load (messages per time unit, as defined above) of:

Furthermore, there is a failure detector that achieves this minimal worst-case bound while satisfying the Completeness, Speed, Accuracy requirements. L* is thus the optimal worst-case network load required to satisfy the Completeness, Speed, Accuracy requirements.

Heartbeat-based approaches provides completeness, but have shortcomings:

Centralized - creates hot-spots and prevent them from scaling
Distributed - inherently not very efficient and scalable.

Randomized Distributed Failure Detector Algorithm:

It takes as assumption that list of members are same and already known on the nodes. Member recovers from failure in distinguishable new incarnation. Each message also contains current incarnation number of the sender.

At each member Mi:

Integer pr; /* local period number */

Every T' time units at Mi:

0. pr := pr + 1
1. Select random member Mj from view.

Send a ping (Mi, Mj, pr) message to Mj
Wait for the worst case message round trip time for an ack(Mi, Mj, pr) message.
2. If not received ack yet:

Select k members randomly from view.

Send each of them a ping-req(Mi,Mj, pr) message

Wait for an ack (Mi, Mj, pr) until the end of period pr.
3. If not received ack(Mi, Mj, pr) message yet then declare Mj as failed.

Anytime at Mi:

4. On receipt of a ping-req(Mm,Mj, pr) (Mj != Mi)

Send a ping(Mi, Mj, Mm, pr) message to Mj

On receipt of an ack(Mi, Mj, Mm, pr) message from Mj

Send an ack(Mm, Mj, pr) message to Mm

Anytime at Mi:

5. On receipt of a ping(Mm, Mi, Ml, pr) message from member Mm

Reply with an ack(Mm, Mi, Ml, pr) message to Mm

Anytime at Mi:

6. On receipt of a ping(Mm, Mi, pr) message from member Mm

Reply with an ack(Mm, Mi, pr) message to Mm

Fig. 1 Example of failure detection protocol period at Mi. This shows all the possible messages that a protocol period may initiate.

Parameters T' and k of the algorithm can be formally calculated based on the given required quality of service parameters: average speed of failure detection T and accuracy PM(T).

This algorithm has uniform expected network load at all members. The worst-case network load occurs when, every T' time units, each member initiates steps (1-6) in the algorithm. Steps (1, 6) involve at most 2 messages, while steps (2-5) involve at most 4 messages per ping-req target member. Therefore, the worst-case network load imposed by this protocol (in messages/time unit) is:

L = n * [2 + 4 * k] * 1/T'

Which means linear overall network load O(n) produced by running algorithm on all nodes and constant network load at one particular member independent from the group size n.

References:

On Scalable and Efficient Distributed Failure Detectors (2001)

Wednesday, August 19, 2015

Scalable and Efficient Distributed Failure Detectors