Skip to content

Latest commit

 

History

History
98 lines (52 loc) · 4.21 KB

README.md

File metadata and controls

98 lines (52 loc) · 4.21 KB

Phi Accrual Failure Detector

Copyright (c) 2017 Tucker Barbour

Authors: Tucker Barbour ([email protected]).

Phi Accrual Failure Detector

Build Status

An Erlang implementation of the "The Phi Accrual Failure Detector" (Hayashibara, et al., 2004). This implementation is based on the implementation in Akka and Cassandra.

Quick Start

Add to a rebar3 project via rebar.conf

{deps, [{phi_failure_detector, {git, "https://github.com/ctbarbour/phi_failure_detector.git", {branch, master}}}]}

Add to a erlang.mk project via Makefile

DEPS = phi_failure_detector
dep_phi_failure_detector = git https://github.com/ctbarbour/phi_failure_detector.git master

To start detecting failures for a service endpoint, start the OTP application and start a new failure detector with a service label and identifier. In this case our service label is http and our identifier is {192,168,10,1}.

application:ensure_all_started(phi_failure_detector),
phi_failure_detector:new(http, {192,168,10,1})

Start adding samples to the failure detector when you get a successful heartbeat from the service endpoint.

phi_failure_detector:heartbeat(http, {192,168,10,1}).

Check the φ value of the service.

phi_failure_detector:phi(http, {192,168,10,1}).

Get the φ of all endpoints with the same service label.

phi_failure_detector:phi(http).

Description

Phi Accrual Failure Detector is a failure detection algorithm that scales a level of suspicion dynamically based on network conditions over time rather than outputting a binary Up or Down result. For more detailed information I recommend reading the paper, or at least the abstract.

To dynamically scale the suspicion level of an endpoint the Phi Accrual Failure Detector records successful heartbeats from a node and builds a distribution of the interarrival times. With this distribution we can calculate the probability that a heartbeat will arrive some time in the future. As network conditions change over time so does the distribution of interarrival times. A node's suspicion is now continuous and not just a binary value. We can make decisions based on how likely it is that a node has failed rather than if thinking in terms of failed or not failed. An application using a Phi Accrual Failure Detector can take precautionary measures when the likelihood of failure has reached a certain threshold and take more drastic measures when the likelihood of failure has reached a higher threshold.

Build and Test

$ rebar3 do xref, dialyzer
$ rebar3 eunit

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/ctbarbour/phi_failure_detector.

Modules

pfd_app
pfd_monitor
pfd_samples
pfd_service
pfd_service_sup
pfd_sup
phi_failure_detector