Wednesday, July 8, 2015

Quality of Service of Failure Detectors

Basic Quality of Service (QoS) properties of Failure Detector are:
  • How fast it detects failure
  • How well it avoids false detection
Assuming that process crash is permanent or in other words recovered processes will be the new identities.

Primary QoS metrics of Failure Detector:
  • Detection time (Td). How long it takes to detect failure.
  • Mistake recurrence time (Tmr). Time between two consecutive mistakes.
  • Mistake duration (Tm). Time it takes for failure detector to correct mistake.
Derived metrics (can be computed from primary metrics):
  • Average mistake rate.
  • Query accuracy probability. Probability that failure detector's output is correct at random moment of time.
  • Good period duration.
  • Forward good period duration.
The defined metrics do not depend on implementation-specific features of failure detection algorithm and can be used to compare failure detectors.

QoS requirements to the Failure Detector can be expressed via primary metrics:
  • Upper bound on the detection time (Tud)
  • Lower bound on the average mistake recurrence time (Tlmr)
  • Upper bound on the average mistake duration (Tum)
Together with probabilistic parameters of the network:
  • Message loss probability [Ploss]
  • Average message delay [E(D)]
  • Variance of message delay [V(D)]. Exponential distribution of message delays is very common.
Also at the paper [1] was proposed a modified heartbeating algorithm for failure detection with input parameters:
  • n -  delay between consecutive heartbeats
  • ro - time shift of the heartbeat after which process declared as failed
The goal was to compute based on (1) QoS requirements (Tud, Tlmr and Tum) and (2) the probabilistic parameters of network (Ploss, E(D) and V(D)), the optimal failure detector parameters n and ro for the proposed algorithm.

It is possible to estimate links quality parameters analyzing heartbeat messages. So we can adjust failure detector parameters dynamically using link quality estimator:
References:
  1. On the Quality of Service of Failure Detectors (2002)