Anton Kharenko: SWIM Distributed Group Membership Protocol

Overview

Many distributed applications require reliable and scalable implementation of membership protocol. Membership protocol provides each process (“member”) of the group with a locally-maintained list of other non-faulty processes in the group. The protocol ensures that the membership list is updated with changes resulting from new members joining the group, or dropping out either voluntarily or through a failure. The SWIM (Scalable Weakly-consistent Infection-style Process Group Membership Protocol) is motivated by unscalability of traditional heart-beating protocols, which either impose network loads that grow quadratically with group size, or compromise response times or false positive frequency of failure detection. It is aiming to implement a membership protocol that provides stable failure detection time, stable rate of false positives and low message load per group member, thus allowing distributed applications that use it to scale well.

Applications which rely on reliability and scalability of the distributed membership maintenance protocol:

Reliable multicast
Epidemic-style information dissemination (gossip)
Distributed databases
Publish-subscribe systems
Large scale peer-to-peer systems
Large scale cooperative gaming
Other collaborative distributed applications

Requirements to protocol:

Low probability of false positive detection of process fail
Do not rely on central server
Low network and computational load
Low time to detect failure
Weakly consistent

Heartbeating-based protocol do not fit well under such requirements since it suffers from scalability limitations. The unscalability of the popular class of all-to-all heartbeating protocols arises from the implicit decision to join two principal functions of the membership problem:

Membership update dissemination
Failure detection

SWIM address non-scalability problems of traditional protocols for membership maintenance by:

Designing the failure detection and membership update dissemination components separately
Using a non-heartbeat based strategy for failure detection. It uses random-probing based failure detector protocol.

Properties of SWIM protocol:

Imposes a constant message load per group member;
Detects a process failure in an expected constant time at some non-faulty process in the group;
Provides a deterministic bound (as a function of group size) on the local time that a non-faulty process takes to detect failure of another process;
Propagates membership updates, including information about failures, in gossip-style; the dissemination latency in the group grows slowly (logarithmically) with the number of members;
Provides a mechanism to reduce the rate of false positives by “suspecting” a process before “declaring” it as failed within the group.

While 1-2 are properties of the used failure detection protocol, 3-5 represents properties introduced in membership protocol.

SWIM Protocol Design

SWIM protocol consists of two main components:

Failure Detector Component
Dissemination Component

Failure Detector Component

Quality of service properties of failure detection:

Strong completeness
Speed of failure detection
Accuracy
Network message load

It is proved that strong completeness and strong accuracy is impossible to achieve at the same time over asynchronous unreliable network. However, since a typical distributed application relies on Strong Completeness (in order to maintain up to date information in dynamic groups), most failure detectors, including heartbeating-based solutions, guarantee this property while attempting to maintain a low rate of false positives. SWIM takes the same approach.

SWIM uses Failure detection protocol described at [2] (see Fig. 1).

Fig. 1 SWIM failure detection: Example protocol period at Mi. This shows all the possible messages that a protocol period may initiate.

Dissemination Component

Basic implementation of Membership Dissemination component consist of:

On detecting failed member by failure detector process multicast it to the group. Other members delete failed member.
Joined or voluntarily leaving member multicast it in the same way. But on join need to know at least one contact member of the group.

In order for a new member to join the group it can use one of the following approaches:

If the group is associated with a well known server or IP multicast address, all joins could be directed to the associated address.
In the absence of such infrastructure, join messages could be broadcasted. Group members will reply with some probability to this broadcast request.

The basic implementation described above shows the general idea of the protocol, but in order to make protocol robust and fault tolerant following improvements may be applied:

Do not use multicast to disseminate membership updates, but spread membership information by piggybacking it on top of failure detection protocol (ping, ping-req and ack messages). This approach results in an infection-style of membership updates dissemination, with the associated benefits of robustness to packet losses, and of low latency.
Use suspicion mechanism. Process that unresponsive to ping marked as “suspected”. Suspected member treated as normal for ping selection, but if respond then spread alive information. Only after predefined timeout suspected member declared as faulty. This timeout effectively trades off an increase in failure detection time for a reduction in frequency of false failure detections. Suspicion mechanism is applicable to any system with the distinct Failure Detector and Membership Dissemination components.
Basic implementation of SWIM protocol guarantee only eventual detection of the failure. Round-Robin probe target selection will guarantee Time Bounded Completeness property; the time interval between the occurrence of a failure and its detection at member Mj is no more than two times the group size in number of protocol periods. This can be solved by the following modification to the protocol. The failure detection protocol at member Mi selecting ping targets not randomly from the list of members, but in a round robin fashion. A newly joining member is inserted in the membership list at a position that is chosen uniformly at random. On completing a traversal of the entire list, Mi rearranges the membership list to a random reordering.

In order to implement gossip-style membership update dissemination each group member Mi maintains a buffer of recent membership updates with count of each buffered element how much it was piggybacked (sent) so far by Mi and is used to choose which element to piggyback next. Each membership update is piggybacked at most some a*log(n) [e.g. 3 * |log(n + 1)|] in order to ensure reliable arrival of gossip to each member of the group.

SWIM can be extended to a Wide-area network (WAN) or a virtual private network (VPN), by weighing ping target choices with topological information, thus reducing bandwidth usage on core network elements inside the network.

SWIM Implementations

Consul

Consul is a distributed, highly available, and datacenter-aware service discovery and configuration management system. It uses SWIM as a basis for implementing Gossip Protocol and for managing membership information.

ScaleCube

ScaleCube (disclaimer: I am one of the authors) is an open source microservices framework for a rapid development of a distributed, resilient, reactive applications that scales. It uses a variant of SWIM protocol implementation in Java and uses it as a basis for managing the cluster of connected microservices. It uses suspicion mechanism over the failure detector events and also separate membership updates dissemination component. But it introduces separate Gossip Protocol component instead of piggybacking membership updates on top of failure detector messages. It is done in order to reuse gossip component for other platform events and have more fine grained control over time intervals used for gossiping and failure detection pinging. New members to the cluster joins via the configuration provided seed members addresses (it is an analogous of well known servers described above). And also it extends SWIM protocol with the introduction of periodic SYNC messages in order to improve recovery from network partitioning and message losses.

Reference:

Tuesday, September 29, 2015

SWIM Distributed Group Membership Protocol

Overview

SWIM Protocol Design

SWIM Implementations