I've been doing a lot of reading on Paxos lately. There are many papers to read. Some of them are very good, some of them are less so. Paxos is not very complex - there are only four messages types in total in the classic Paxos case (though most implementations will end up with at least six), but it is quite subtle. Very few of the papers speak at all about why things need to be written to disk, or when, for example. The following are some notes of mine, which might be useful to others.
I'm not going to attempt to describe Paxos (though I may do so by accident). The most succinct description I've come across is the first paragraph of the Paxos Made Practical paper by Robbert van Renesse:
Paxos is a simple protocol that a group of machines in a distributed system can use to agree on a value proposed by a member of the group. If it terminates, the protocol reaches consensus even if the network was unreliable and multiple machines simultaneously tried to propose different values. The basic idea is that each proposal has a unique number. Higher numbered proposals override lower-numbered ones. However, a "proposer" machine must notify the group of its proposal number before proposing a particular value. If, after hearing from a majority of the group, the proposer learns one or more values from previous proposals, it must re-use the same value as the highest-numbered previous proposal. Otherwise, the proposer can select any value to propose.
Far too many papers choose different terminology for the different phases of the protocol. Thus in the following I'm just going to call them phase 1 and phase 2. Phase 1 is the one where: (1a) a proposer invents a ballot number and sends that to a majority of the acceptors; (1b) each acceptor replies iff the ballot number from (1a) is greater than any ballot number it's previously seen in a (1a) message, and that reply contains the ballot number and value of any (2b) message. Phase 2 is the one where: (2a) a proposer asks the majority of acceptors to accept a value with a ballot number; (2b) each acceptor replies (and accepts the value) iff the ballot number is greater or equal to the maximum ballot number it has seen in a (1a) message.
These papers normally talk about being able to tolerate failure of up to F machines (actually, acceptors). For this to be possible, Paxos still requires the availability of a majority of the original set of acceptors. So that means a total of 2F + 1 acceptors (if you have 2F + 1 machines, then a majority of them is F+1 machines, hence F of them can still fail and you have access to a majority of the original set of machines). The importance of a majority is that if you randomly choose two sets of F+1 machines, there will be at least one machine in common (intersection is never empty). The whole point of ballot numbers (and the way they're constructed such that they can never collide) is so that at least one machine in any two majority sets of machines will be able to correctly order any two different ballot numbers. Thus the point of phase 1 is to figure out if your ballot is currently the maximum known ballot number to each acceptor, and if it is, what value you're allowed to present for phase 2.
In the original papers, acceptors only ever reply iff the ballot number from the proposer meets certain requirements. If it does not, the acceptors are silent and the proposer is meant to determine through some sort of timeout that their message has failed. However, there's no need to implement it like this - several systems have the acceptors actively send back nack messages to the proposer. Paxos will work perfectly well on unreliable communication channels, which means it's fine over UDP. However, UDP frequently doesn't work in the cloud due to cloud providers breaking PMTU discovery and not supporting fragmentation. In such an environment, Paxos will not violate its properties, but you might find nothing makes any progress. If none of that applies to you and so you use UDP then you may well need to implement the timeouts too, in case the nacks go missing (if you choose to use that nacks too) and you can't detect that loss. If you're using TCP then you might decide you can just rely on the nacks (application-layer nacks, not TCP), not bother with timeouts, and also watch for the TCP connection dropping. The argument against timeouts is that the machine with whom you're communicating might just be under heavy load. But then again, is that acceptable for the service you're trying to build?
Paxos ensures that once a majority of the acceptors have accepted a value (by accepted, I mean an acceptor has received a phase 2a message which meets the requirements such that it intends to reply with a 2b message), it is impossible to get a majority of the acceptors to accept a different value. This, in combination with the construction of ballot numbers means that if any two acceptors have accepted a value for the same ballot, it is the same value. An entire instance of Paxos creates consensus in perpetuity on one value only. Normally you want to create a stream of values, so you need to run lots of rounds. How you construct your stream is up to you, but simply name-spacing each instance by the event log ID works just fine.
If in phase 1 you established your ballot number is the greatest ballot number ever and then you were just allowed to pick any old value, then it should be clear that anyone else could come along later, pick an even greater ballot number, and change the accepted value. So this is why phase 1b includes the value and ballot number for the most recent 2b message the acceptor has sent. The proposer has to receive 2b messages from a majority of acceptors before it can make further progress. Now let's pretend that each acceptor actually sends back all the ballot numbers and values for all 2b messages its ever sent as part of this instance. The proposer now has the complete history of all values accepted from a majority of acceptors. These histories can be different for each acceptor, though as said above, where any two acceptors accepted a value for the same ballot, the value will be the same. You can therefore create a list of all the accepted values, with the ballot numbers as indices; there may well be gaps in this list. What should the proposer now do? All the proposer can do is to cause the acceptors to append to this imaginary list - the acceptors will only act on a greater-than-ever-before ballot number, which means appending to the end of our imaginary list. We want future proposers to be forced to continue our work rather than diverge. If, for example, we should force the proposer to send a 2a message with the earliest accepted value then that does not cause more acceptors to agree on what their earliest accepted value is. So the only other sane choice is to force the proposer to send a 2a message with the value of the greatest ballot number it knows of that's been accepted. This can't reduce the spread of this particular value: even if the proposer dies right here, the acceptors haven't lost anything. It can increase the spread of this value though by passing this value to acceptors who haven't previously accepted any value for this ballot number (and because we got a 1b reply from that acceptor, we know that our current ballot number is acceptable to that acceptor; from this point, we can realise that we're most likely to make progress if the majority we send our 2a message to is the same as the majority we sent our 1a message to).
Once a value has been accepted by a majority of acceptors, any further valid (by valid, I mean it does not get ignored, or cause a nack) 1a message from a proposer will guarantee the accepted value is returned in the 1b message and must be chosen again by the proposer in its next 2a message (by the definition of a majority, it is not possible for a different value to have been accepted at the same time by a different majority (even with a higher ballot number)). Once a value is accepted by a majority of acceptors, at least one member of that majority is guaranteed to be in any other majority, and will thus present its accepted value in any 1b messages it sends. Acceptors are considered single-threaded in their dealings with Paxos. So when an acceptor is dealing with a valid 2a message and sending a 2b response, it is not simultaneously processing 1a messages. Thus if an acceptor receives a valid 2a message and accepts that value, some other proposer may be sending phase 1a messages or even different 2a messages to this same acceptor, but they have to wait their turn. In both cases: if some 1a or 2a message arrives afterwards, they are ignored (or a nack sent) if their ballot number is too low, otherwise the 1a will elicit a response (1b) containing the newly accepted value. However, the 2a cannot be valid. This is because a proposer would only send a 2a if it got a 1b back, which implies its ballot number is the greatest. But this acceptor has just accepted a value, implying the accepted value's ballot number must be greater still. Thus in any interleaving involving multiple proposers and an acceptor, the only value acceptable will be with the ballot number of the most recently issued 1b message (or put another way, a 2a will only be accepted from proposer ρ if the previous valid message the acceptor received was a 1a from proposer ρ (other than the special case of the first ballot number where phase 1 isn't necessary - covered below)).
All this talk of majorities is often confusing in combination with failures. The simplest formulation is that a proposer should send the 1a and 2a messages to all acceptors, and can make progress when it receives 1b and 2b message responses (respectively) from a majority of them. This is the simplest way of dealing with the possibility of failures. However, it's frequently a good idea to optimise for the common case, which is when failures don't occur. Thus you can pick your majority of acceptors at the start (perhaps randomly, perhaps not) and communicate with just them, sending your 1a and 2a messages just to them, and waiting for all of them to reply. But what then happens if any of them fail? You're now not talking to a majority. The simplest choice here is to abandon the current ballot, pick a fresh majority (this is a majority of the original 2F+1 acceptors - don't reduce the size of your machines just because of a failure; you should only reduce the size of your machines when you know a machine is not going to come back - covered below), and start from phase 1a with an increased ballot number.
When a proposer receives 2b messages back from a majority of acceptors, it knows the value chosen is never going to change for this particular instance of Paxos. The proposer can then disseminate this information as it chooses (this is often called phase 3/learning phase). If you plan it so, you can have the acceptors send their 2b message to not-just-the-proposer: that way, several parties can learn at the same time that a value has been chosen, without needing the extra hop of going through the proposer. You'll need to deal with some learners dying, whilst others don't, and the need to re-synchronise what's been learnt. The re-synchronising will however be easy because you won't have any conflicts - Paxos guarantees us that. So it should just be adding everything that's been learnt together from all available sources.
The proposer only gets to propose its own value in a 2a message when there is no information returned in the 1b messages from a majority of acceptors. Thus if a proposer is trying to get a particular value added to an event log, it may have to take part in several instances of Paxos before it finds one where its able to get 2a messages to a majority of acceptors for the first ballot number of that instance. Note that in light of failures of acceptors it should not immediately skip to the next instance: it could be that it was able to get its value to some acceptors (albeit not a majority), and some other proposer just happened to pick some of those acceptors in its own majority, and succeeded in spreading that value to a majority. It should only move on to a new Paxos instance if it has learnt it has lost the current instance.
In this light, if the proposer fails after having sent some 2a messages, you have no knowledge as to what value is going to be agreed on by later ballots. If the proposer did manage to get 2a messages to the majority then yes, you have consensus. But if the 2a messages didn't reach a majority, then a different proposer can happen to choose a majority not including any of the previous acceptors, and get a totally different value accepted. Or it can choose a majority which has some acceptors in common with the original 2a messages, and thus complete the instance with the value as originally sent. So you can't assume anything about what will happen in an instance when a proposer dies.
If you happen to construct your system such that you know a particular instance will only ever be started by a particular proposer, then that proposer doesn't need to bother with phase 1 at all - it can start with a phase 2a message (you're guaranteeing there's no history of this instance of Paxos for the proposer to learn through phase 1). Of course, again in light of failures of acceptors it may need to revert to phase 1, but in the common case (no failures), this is a useful optimisation that can halve the number of messages.
The original Paxos papers talk about having the acceptors write their state to disk, though it's not really explained why. If you do have the acceptors write state to disk then it means they can be restarted and continue where they left off - they'll have to read their state off disk and send some more messages, perhaps redundantly, and so your proposers will have to make sure they can handle messages they receive (i.e. 1b and 2b messages) idempotently, but you've probably done that anyway. But for this to work also requires that the restart of the acceptors isn't detected as a failure by the proposers, which may mean you're using UDP rather than TCP, which means you've probably gone down the timeout route. All of this means performance is unlikely to be amazing: the extra fsyncs necessary is going to hurt, the timeouts may have to be fairly generous (and how on earth do you decide what amount of time is allowed for an acceptor to restart without considering that a failure?), and detecting actual failures is going to be more complex.
If you decide to use TCP instead, and you decided that a TCP connection dropping indicates a failure at the other end, then it means that an acceptor being restarted will be considered a failure of an acceptor. In which case, it doesn't matter if that acceptor loses its state. Consequently, the acceptors don't need to write anything to disk. To see this, consider the following: due to design, or some programming bug, you have always chosen the exact same F+1 as your majority of acceptors. They have never crashed, and so they contain all the information to date about every instance of Paxos you've run. The remaining F acceptors contain nothing. Suddenly there's a power failure, and F of those chosen F+1 acceptors die. But Paxos can handle the death of up to F acceptors, so you're still fine. But now you have just F+1 acceptors left, and those F+1 acceptors are your only choice, ongoing, for the majority of acceptors. Crucially, that includes the 1 surviving acceptor from the original majority that has all the state. So nothing has been lost, even without writing anything to disk.
Now yes, if you need to be able to power down the entire system and then resume where you left off then you're going to have to write to disk. But that may still be outside of Paxos rather than within. For example, if you're using Paxos to achieve consensus on some stream of commands then the client which is submitting the command doesn't return until: 1) Paxos has successfully added the command to the stream (i.e. the client, playing the role of proposer, has had 2b messages back from a majority of acceptors for an instance of Paxos in which the proposer was able to pick its own value (command) for the 2a message); 2) the command has been processed by some later step (command processor) and some result returned. Now if the acceptors are co-located with the command processor and you need to turn everything off then does it matter if the stream of commands is lost? The client that submitted the command is just going to get an error, even if its command is eventually processed some time later, so do we really care if that command is lost?
It depends. Presumably the state of the command processors is written to disk every time the state changes, and if you have several of these processors then they could run at different speeds. So it's then a matter of: how do you re-synchronise the state of these command processors? Some of the papers show that you could use Paxos for this, which is true, but then you may need Paxos to maintain quite a history and add other messages to the acceptors so that you can efficiently catch up. Or you could build this re-synchronisation through other means, outside of Paxos, and just keep Paxos for speedy fault-tolerant non-blocking consensus.
Update (13th March 2015): There's a case I missed here. Consider that paxos isn't writing to disk. Acceptors send out their 2b messages to learners. Only 1 learner receives at least F+1 2b messages before all the acceptors die. That 1 learner acts on the 2bs, proceeding in some way (which does involve writing to disk) before it too dies. At this point, the system is blocked because >F acceptors have died, and indeed the entire system is now turned off. Everyone now comes back up, except for the one node that received all the 2bs. Consequently, knowledge of this paxos instance is lost, but the system is operational as <F acceptors are dead. If that one node ever comes back, you have divergence. Even worse, that one node that acted on the 2bs may have done enough work to mean the result of that instance became visible to the outside world.
So how do you actually deal with intentional removal (or addition) of acceptors? One simple idea is that the set of identities of acceptors amounts to a topology, and this is versioned (ver n). So if you want to add or remove an acceptor then you calculate what the new topology is, give it an increased version (ver n+1), and now run a Paxos instance proposing achieving consensus on what ver n+1 of the topology really is (I think you want to do this Paxos instance with the new topology rather than the old). All instances of Paxos reference the topology version. In a Paxos instance, once a value has been accepted by the majority, you cannot change the topology for that instance. If you increased the number of acceptors then you could render the old majority so small that none of them are chosen in a later ballot, thus allowing the accepted value to be changed. If you decreased the number of acceptors then you could remove enough of the old majority such that a new majority from the new topology need not include any of the original majority, and again the accepted value could change. Thus the easiest is simply that any 1a message would have to carry the topology version, and once an acceptor has received a valid 1a message, the topology version for that Paxos instance is fixed. So then if an acceptor receives a 1a or 2a message for that instance which has the wrong topology number, it will issue a nack, indicating the correct topology number, hopefully forcing the proposer to discover the correct new topology. That does mean that if >F acceptors fail, a Paxos instance can just get stuck. This, you'll just have to deal with explicitly, having everyone detect this case and completely aborting the instance.
There are further complications here too. Consider the earlier case where the same majority of F+1 kept getting chosen, and then F of them died, leaving just 1 acceptor with all the information. If a topology change now happens, removing all the failed acceptors then all the information held by this special acceptor is certainly not held by a majority any more, and is in danger of being lost. So historical instances of Paxos must be made read only, and before the topology change is agreed, it may be essential to duplicate or otherwise make safe such data.
Hopefully the above is useful to some people, not just myself. I've struggled to find much information regarding the various approaches and tradeoffs when implementing Paxos. There are various papers such as Paxos Made Live which are certainly worth reading. But they don't seem to cover much of what I've tried to cover above. Such papers tend to record their concrete implementations rather than considering various different uses of Paxos and what the consequences are. Comments and corrections are very welcome - please let me know.