1/ Distributed Systems Evening:
Everyone has heard of the CAP theorem. The infamous "2 out of 3" (sic).
While the CAP result has profound consequences it's not all that surprising: "If a process cannot talk to others then it cannot know what is going on."
Everyone has heard of the CAP theorem. The infamous "2 out of 3" (sic).
While the CAP result has profound consequences it's not all that surprising: "If a process cannot talk to others then it cannot know what is going on."
2/ The slightly less known, but equally as consequential is the "FLP result". It roughly says this:
"There is no algorithm to guarantee a consensus in an asynchronous network if at least one node may fail."
"There is no algorithm to guarantee a consensus in an asynchronous network if at least one node may fail."
3/ Asynchronous network means messages can be delayed arbitrary, without any upper bound.
Interestingly enough the result does NOT assume lost messages. Just delayed. Yet you cannot have a consensus algorithm!
That's quite a strong result, isn't it?
Interestingly enough the result does NOT assume lost messages. Just delayed. Yet you cannot have a consensus algorithm!
That's quite a strong result, isn't it?
4/ There is something what used to bother me in the past:
-FLP: You cannot have a consensus in an async model with just 1 failing process.
-Angry me: What about Raft, what about Paxos and its variants? Explain yourself!
-FLP: You cannot have a consensus in an async model with just 1 failing process.
-Angry me: What about Raft, what about Paxos and its variants? Explain yourself!
5/ -Hand-waving me: Well, Raft/Paxos they all assume partial synchrony, they really on some assumptions regarding synchrony.
-Smart-ass me: Well, if you assume (partial) synchrony then the problem is trivial. But what if the assumptions are violated? What happens then?!
-Smart-ass me: Well, if you assume (partial) synchrony then the problem is trivial. But what if the assumptions are violated? What happens then?!
6/ So if you are like the old me here is a better explanation:
Consensus algorithms assume partial synchrony for progress.
They do not assume that for safety. So delayed messages cannot lead to safety violations.
Consensus algorithms assume partial synchrony for progress.
They do not assume that for safety. So delayed messages cannot lead to safety violations.
7/ No matter how much they are delayed if a consensus is reached then everyone see the same value.
In other words: The consensus algorithms get away from FLP by guaranteeing only this: "Nothing bad will ever happen".
In other words: The consensus algorithms get away from FLP by guaranteeing only this: "Nothing bad will ever happen".
8/ But to guarantee "Something good will eventually happen" (=reach the consensus) they require partial synchrony. If messages are still arbitrary delayed then consensus may not ever be reached.
9/ Further resources:
The original paper: https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf
FLP paper walk-through: https://www.the-paper-trail.org/post/2008-08-13-a-brief-tour-of-flp-impossibility/
Paper on partial synchrony: https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf
The original paper: https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf
FLP paper walk-through: https://www.the-paper-trail.org/post/2008-08-13-a-brief-tour-of-flp-impossibility/
Paper on partial synchrony: https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf