Reactionary Visions

Fri, 13 Mar 2015

Paxos notes

I've been doing a lot of reading on Paxos lately. There are many papers to read. Some of them are very good, some of them are less so. Paxos is not very complex - there are only four messages types in total in the classic Paxos case (though most implementations will end up with at least six), but it is quite subtle. Very few of the papers speak at all about why things need to be written to disk, or when, for example. The following are some notes of mine, which might be useful to others.

I'm not going to attempt to describe Paxos (though I may do so by accident). The most succinct description I've come across is the first paragraph of the Paxos Made Practical paper by Robbert van Renesse:

Paxos is a simple protocol that a group of machines in a distributed system can use to agree on a value proposed by a member of the group. If it terminates, the protocol reaches consensus even if the network was unreliable and multiple machines simultaneously tried to propose different values. The basic idea is that each proposal has a unique number. Higher numbered proposals override lower-numbered ones. However, a "proposer" machine must notify the group of its proposal number before proposing a particular value. If, after hearing from a majority of the group, the proposer learns one or more values from previous proposals, it must re-use the same value as the highest-numbered previous proposal. Otherwise, the proposer can select any value to propose.

Before going further, you probably need to have read the Paxos Made Simple paper and probably the Wikipedia page on Paxos too.

Far too many papers choose different terminology for the different phases of the protocol. Thus in the following I'm just going to call them phase 1 and phase 2. Phase 1 is the one where: (1a) a proposer invents a ballot number and sends that to a majority of the acceptors; (1b) each acceptor replies iff the ballot number from (1a) is greater than any ballot number it's previously seen in a (1a) message, and that reply contains the ballot number and value of any (2b) message. Phase 2 is the one where: (2a) a proposer asks the majority of acceptors to accept a value with a ballot number; (2b) each acceptor replies (and accepts the value) iff the ballot number is greater or equal to the maximum ballot number it has seen in a (1a) message.

These papers normally talk about being able to tolerate failure of up to F machines (actually, acceptors). For this to be possible, Paxos still requires the availability of a majority of the original set of acceptors. So that means a total of 2F + 1 acceptors (if you have 2F + 1 machines, then a majority of them is F+1 machines, hence F of them can still fail and you have access to a majority of the original set of machines). The importance of a majority is that if you randomly choose two sets of F+1 machines, there will be at least one machine in common (intersection is never empty). The whole point of ballot numbers (and the way they're constructed such that they can never collide) is so that at least one machine in any two majority sets of machines will be able to correctly order any two different ballot numbers. Thus the point of phase 1 is to figure out if your ballot is currently the maximum known ballot number to each acceptor, and if it is, what value you're allowed to present for phase 2.

In the original papers, acceptors only ever reply iff the ballot number from the proposer meets certain requirements. If it does not, the acceptors are silent and the proposer is meant to determine through some sort of timeout that their message has failed. However, there's no need to implement it like this - several systems have the acceptors actively send back nack messages to the proposer. Paxos will work perfectly well on unreliable communication channels, which means it's fine over UDP. However, UDP frequently doesn't work in the cloud due to cloud providers breaking PMTU discovery and not supporting fragmentation. In such an environment, Paxos will not violate its properties, but you might find nothing makes any progress. If none of that applies to you and so you use UDP then you may well need to implement the timeouts too, in case the nacks go missing (if you choose to use that nacks too) and you can't detect that loss. If you're using TCP then you might decide you can just rely on the nacks (application-layer nacks, not TCP), not bother with timeouts, and also watch for the TCP connection dropping. The argument against timeouts is that the machine with whom you're communicating might just be under heavy load. But then again, is that acceptable for the service you're trying to build?

Paxos ensures that once a majority of the acceptors have accepted a value (by accepted, I mean an acceptor has received a phase 2a message which meets the requirements such that it intends to reply with a 2b message), it is impossible to get a majority of the acceptors to accept a different value. This, in combination with the construction of ballot numbers means that if any two acceptors have accepted a value for the same ballot, it is the same value. An entire instance of Paxos creates consensus in perpetuity on one value only. Normally you want to create a stream of values, so you need to run lots of rounds. How you construct your stream is up to you, but simply name-spacing each instance by the event log ID works just fine.

If in phase 1 you established your ballot number is the greatest ballot number ever and then you were just allowed to pick any old value, then it should be clear that anyone else could come along later, pick an even greater ballot number, and change the accepted value. So this is why phase 1b includes the value and ballot number for the most recent 2b message the acceptor has sent. The proposer has to receive 2b messages from a majority of acceptors before it can make further progress. Now let's pretend that each acceptor actually sends back all the ballot numbers and values for all 2b messages its ever sent as part of this instance. The proposer now has the complete history of all values accepted from a majority of acceptors. These histories can be different for each acceptor, though as said above, where any two acceptors accepted a value for the same ballot, the value will be the same. You can therefore create a list of all the accepted values, with the ballot numbers as indices; there may well be gaps in this list. What should the proposer now do? All the proposer can do is to cause the acceptors to append to this imaginary list - the acceptors will only act on a greater-than-ever-before ballot number, which means appending to the end of our imaginary list. We want future proposers to be forced to continue our work rather than diverge. If, for example, we should force the proposer to send a 2a message with the earliest accepted value then that does not cause more acceptors to agree on what their earliest accepted value is. So the only other sane choice is to force the proposer to send a 2a message with the value of the greatest ballot number it knows of that's been accepted. This can't reduce the spread of this particular value: even if the proposer dies right here, the acceptors haven't lost anything. It can increase the spread of this value though by passing this value to acceptors who haven't previously accepted any value for this ballot number (and because we got a 1b reply from that acceptor, we know that our current ballot number is acceptable to that acceptor; from this point, we can realise that we're most likely to make progress if the majority we send our 2a message to is the same as the majority we sent our 1a message to).

Once a value has been accepted by a majority of acceptors, any further valid (by valid, I mean it does not get ignored, or cause a nack) 1a message from a proposer will guarantee the accepted value is returned in the 1b message and must be chosen again by the proposer in its next 2a message (by the definition of a majority, it is not possible for a different value to have been accepted at the same time by a different majority (even with a higher ballot number)). Once a value is accepted by a majority of acceptors, at least one member of that majority is guaranteed to be in any other majority, and will thus present its accepted value in any 1b messages it sends. Acceptors are considered single-threaded in their dealings with Paxos. So when an acceptor is dealing with a valid 2a message and sending a 2b response, it is not simultaneously processing 1a messages. Thus if an acceptor receives a valid 2a message and accepts that value, some other proposer may be sending phase 1a messages or even different 2a messages to this same acceptor, but they have to wait their turn. In both cases: if some 1a or 2a message arrives afterwards, they are ignored (or a nack sent) if their ballot number is too low, otherwise the 1a will elicit a response (1b) containing the newly accepted value. However, the 2a cannot be valid. This is because a proposer would only send a 2a if it got a 1b back, which implies its ballot number is the greatest. But this acceptor has just accepted a value, implying the accepted value's ballot number must be greater still. Thus in any interleaving involving multiple proposers and an acceptor, the only value acceptable will be with the ballot number of the most recently issued 1b message (or put another way, a 2a will only be accepted from proposer ρ if the previous valid message the acceptor received was a 1a from proposer ρ (other than the special case of the first ballot number where phase 1 isn't necessary - covered below)).

All this talk of majorities is often confusing in combination with failures. The simplest formulation is that a proposer should send the 1a and 2a messages to all acceptors, and can make progress when it receives 1b and 2b message responses (respectively) from a majority of them. This is the simplest way of dealing with the possibility of failures. However, it's frequently a good idea to optimise for the common case, which is when failures don't occur. Thus you can pick your majority of acceptors at the start (perhaps randomly, perhaps not) and communicate with just them, sending your 1a and 2a messages just to them, and waiting for all of them to reply. But what then happens if any of them fail? You're now not talking to a majority. The simplest choice here is to abandon the current ballot, pick a fresh majority (this is a majority of the original 2F+1 acceptors - don't reduce the size of your machines just because of a failure; you should only reduce the size of your machines when you know a machine is not going to come back - covered below), and start from phase 1a with an increased ballot number.

When a proposer receives 2b messages back from a majority of acceptors, it knows the value chosen is never going to change for this particular instance of Paxos. The proposer can then disseminate this information as it chooses (this is often called phase 3/learning phase). If you plan it so, you can have the acceptors send their 2b message to not-just-the-proposer: that way, several parties can learn at the same time that a value has been chosen, without needing the extra hop of going through the proposer. You'll need to deal with some learners dying, whilst others don't, and the need to re-synchronise what's been learnt. The re-synchronising will however be easy because you won't have any conflicts - Paxos guarantees us that. So it should just be adding everything that's been learnt together from all available sources.

The proposer only gets to propose its own value in a 2a message when there is no information returned in the 1b messages from a majority of acceptors. Thus if a proposer is trying to get a particular value added to an event log, it may have to take part in several instances of Paxos before it finds one where its able to get 2a messages to a majority of acceptors for the first ballot number of that instance. Note that in light of failures of acceptors it should not immediately skip to the next instance: it could be that it was able to get its value to some acceptors (albeit not a majority), and some other proposer just happened to pick some of those acceptors in its own majority, and succeeded in spreading that value to a majority. It should only move on to a new Paxos instance if it has learnt it has lost the current instance.

In this light, if the proposer fails after having sent some 2a messages, you have no knowledge as to what value is going to be agreed on by later ballots. If the proposer did manage to get 2a messages to the majority then yes, you have consensus. But if the 2a messages didn't reach a majority, then a different proposer can happen to choose a majority not including any of the previous acceptors, and get a totally different value accepted. Or it can choose a majority which has some acceptors in common with the original 2a messages, and thus complete the instance with the value as originally sent. So you can't assume anything about what will happen in an instance when a proposer dies.

If you happen to construct your system such that you know a particular instance will only ever be started by a particular proposer, then that proposer doesn't need to bother with phase 1 at all - it can start with a phase 2a message (you're guaranteeing there's no history of this instance of Paxos for the proposer to learn through phase 1). Of course, again in light of failures of acceptors it may need to revert to phase 1, but in the common case (no failures), this is a useful optimisation that can halve the number of messages.

The original Paxos papers talk about having the acceptors write their state to disk, though it's not really explained why. If you do have the acceptors write state to disk then it means they can be restarted and continue where they left off - they'll have to read their state off disk and send some more messages, perhaps redundantly, and so your proposers will have to make sure they can handle messages they receive (i.e. 1b and 2b messages) idempotently, but you've probably done that anyway. But for this to work also requires that the restart of the acceptors isn't detected as a failure by the proposers, which may mean you're using UDP rather than TCP, which means you've probably gone down the timeout route. All of this means performance is unlikely to be amazing: the extra fsyncs necessary is going to hurt, the timeouts may have to be fairly generous (and how on earth do you decide what amount of time is allowed for an acceptor to restart without considering that a failure?), and detecting actual failures is going to be more complex.

If you decide to use TCP instead, and you decided that a TCP connection dropping indicates a failure at the other end, then it means that an acceptor being restarted will be considered a failure of an acceptor. In which case, it doesn't matter if that acceptor loses its state. Consequently, the acceptors don't need to write anything to disk. To see this, consider the following: due to design, or some programming bug, you have always chosen the exact same F+1 as your majority of acceptors. They have never crashed, and so they contain all the information to date about every instance of Paxos you've run. The remaining F acceptors contain nothing. Suddenly there's a power failure, and F of those chosen F+1 acceptors die. But Paxos can handle the death of up to F acceptors, so you're still fine. But now you have just F+1 acceptors left, and those F+1 acceptors are your only choice, ongoing, for the majority of acceptors. Crucially, that includes the 1 surviving acceptor from the original majority that has all the state. So nothing has been lost, even without writing anything to disk.

Now yes, if you need to be able to power down the entire system and then resume where you left off then you're going to have to write to disk. But that may still be outside of Paxos rather than within. For example, if you're using Paxos to achieve consensus on some stream of commands then the client which is submitting the command doesn't return until: 1) Paxos has successfully added the command to the stream (i.e. the client, playing the role of proposer, has had 2b messages back from a majority of acceptors for an instance of Paxos in which the proposer was able to pick its own value (command) for the 2a message); 2) the command has been processed by some later step (command processor) and some result returned. Now if the acceptors are co-located with the command processor and you need to turn everything off then does it matter if the stream of commands is lost? The client that submitted the command is just going to get an error, even if its command is eventually processed some time later, so do we really care if that command is lost?

It depends. Presumably the state of the command processors is written to disk every time the state changes, and if you have several of these processors then they could run at different speeds. So it's then a matter of: how do you re-synchronise the state of these command processors? Some of the papers show that you could use Paxos for this, which is true, but then you may need Paxos to maintain quite a history and add other messages to the acceptors so that you can efficiently catch up. Or you could build this re-synchronisation through other means, outside of Paxos, and just keep Paxos for speedy fault-tolerant non-blocking consensus.

Update (13th March 2015): There's a case I missed here. Consider that paxos isn't writing to disk. Acceptors send out their 2b messages to learners. Only 1 learner receives at least F+1 2b messages before all the acceptors die. That 1 learner acts on the 2bs, proceeding in some way (which does involve writing to disk) before it too dies. At this point, the system is blocked because >F acceptors have died, and indeed the entire system is now turned off. Everyone now comes back up, except for the one node that received all the 2bs. Consequently, knowledge of this paxos instance is lost, but the system is operational as <F acceptors are dead. If that one node ever comes back, you have divergence. Even worse, that one node that acted on the 2bs may have done enough work to mean the result of that instance became visible to the outside world.

So how do you actually deal with intentional removal (or addition) of acceptors? One simple idea is that the set of identities of acceptors amounts to a topology, and this is versioned (ver n). So if you want to add or remove an acceptor then you calculate what the new topology is, give it an increased version (ver n+1), and now run a Paxos instance proposing achieving consensus on what ver n+1 of the topology really is (I think you want to do this Paxos instance with the new topology rather than the old). All instances of Paxos reference the topology version. In a Paxos instance, once a value has been accepted by the majority, you cannot change the topology for that instance. If you increased the number of acceptors then you could render the old majority so small that none of them are chosen in a later ballot, thus allowing the accepted value to be changed. If you decreased the number of acceptors then you could remove enough of the old majority such that a new majority from the new topology need not include any of the original majority, and again the accepted value could change. Thus the easiest is simply that any 1a message would have to carry the topology version, and once an acceptor has received a valid 1a message, the topology version for that Paxos instance is fixed. So then if an acceptor receives a 1a or 2a message for that instance which has the wrong topology number, it will issue a nack, indicating the correct topology number, hopefully forcing the proposer to discover the correct new topology. That does mean that if >F acceptors fail, a Paxos instance can just get stuck. This, you'll just have to deal with explicitly, having everyone detect this case and completely aborting the instance.

There are further complications here too. Consider the earlier case where the same majority of F+1 kept getting chosen, and then F of them died, leaving just 1 acceptor with all the information. If a topology change now happens, removing all the failed acceptors then all the information held by this special acceptor is certainly not held by a majority any more, and is in danger of being lost. So historical instances of Paxos must be made read only, and before the topology change is agreed, it may be essential to duplicate or otherwise make safe such data.

Hopefully the above is useful to some people, not just myself. I've struggled to find much information regarding the various approaches and tradeoffs when implementing Paxos. There are various papers such as Paxos Made Live which are certainly worth reading. But they don't seem to cover much of what I've tried to cover above. Such papers tend to record their concrete implementations rather than considering various different uses of Paxos and what the consequences are. Comments and corrections are very welcome - please let me know.

posted at: 21:30 | path: / | permanent link to this entry

Sun, 05 Oct 2014

Programming in the real world

One of the things that annoys me about Object Oriented Programming is how it's often suggested that it models the "real world". Frequently tutorials will start with creating an object modelling a chair, and through inheritance you'll be able to build up composable aspects of chairs: different numbers of legs, different colours, different designs. Sometimes they use tables rather than chairs. This is lovely, but it actually has everything to do with data modelling through inheritance, decomposition, abstraction and encapsulation, and almost nothing to do with Object Orientation: the key is that these chairs have no modifying methods on them. If they have any methods at all then they'll be for things like getting the number of legs or the colour, or volume or something - something that is basically fixed once the object is instantiated. At this point in such tutorials I'd probably claim this is not actually programming yet: all that's been achieved so far is that we've assigned some semantics to some numbers held in memory and we can write some numbers in memory. Programming is when we manipulate numbers: that involves reading and writing numbers.

The problem then is that Object Orientation immediately stops being about modelling the "real world" as soon as we can modify memory. If we think about how we actually would go about getting a chair made for us, it could go a bit like this:

  1. Go see your local carpenter,
  2. Have a discussion with them about the style and type of chair you'd like,
  3. They make the chair and give it to you in return for payment,
  4. You take the chair home,
  5. You decide you don't like the colour so you take it to your garage and repaint it yourself.

It should be clear that the inanimate object (the chair) is the odd one out here. Everything else is done by actors that have their own state, mainly act asynchronously, and can communicate with other actors through protocols - protocols that do not involve sharing mutable state (e.g. if I say something to you, that speech is immutable; you can't change what I've said (though you could choose to mishear me!)). At no point is any state of any actor actually exposed to another actor: I may share with you what I'm currently thinking, and you can try to influence me, but we don't exactly need a mutex around memory in my brain because YOU'RE NOT GETTING IN THERE!

If you tried modelling this sort of thing through Object Orientation without actors then you'd end up with your own thread doing all the work: it'd be you, it'd be the carpenter and it'd be the chair, maybe all at once. If your carpenter is in fact a growing business with a receptionist, a design team and a billing department your thread would be playing those roles too and would probably have to use locks to avoid unexpected interactions with other threads doing the same commissioning-receptioning-designing-constructing-delivery-repainting dance. And all the time, whilst you're doing the carpentry yourself, you'd could easily have your own thoughts, feelings, aspirations and regrets all on the same stack for your carpenter-alias to mess with.

Thus Object Orientation causes multiple personality disorder.

So in my view, the way Object Orientation gets introduced tends to be more like "useful tools for modelling data". But the OO approach to manipulating that data goes wrong as soon as you try to model the animated real world. Firstly it has nothing to say about separating out threads to self-contained actors (but try this in a language or on a platform without green-threads, or without the ability to preempt threads and you can quickly hit pain), and secondly even if you do have actors, OO encourages the sharing of mutable data rather than passing around either immutable data or copies of data. Yes, good programming discipline can result in sane designs and a successful result, but it's not a core aspect of the OOP mantra.

So, OOP has nothing good to say on manipulating data at all - it either says nothing or it encourages silly ideas like using locks. The data modelling bits are fine, but I think they're a broader concept beyond the confines of OOP. What else does OOP get you? An arbitrary restriction on the receiver of any method. That's about it. It's thanks to this restriction that writing combinators like cons on a list library in an OO language is really painful.

This week Erik Meijer wrote an article called The Curse of the Excluded Middle: "Mostly functional" programming does not work. After an introduction, we get onto The Problem, which (paraphrasing) is that languages that are mainly imperative but offer some features from pure functional languages are not as safe as pure functional languages.

The first three examples, from C# are certainly surprising to me (I barely know any C# at all though). The first two problems come from trying to compose side-effecting stuff with laziness. In the first case it's not clear that the problem is with the IO operation (printing things out) or actually with the laziness, but more the odd behaviour of the Where operator (presumably the implementation of Where doesn't know that a Cartesian product isn't necessary, but surely any normal monadic/list-comprehension implementation wouldn't have this problem?). The second case is certainly the terrifying composition of laziness with throwing exceptions and thus the exception having the potential to pop out anywhere where the lazy expression gets forced. However, if you know the Select operator is lazy, it's not really that surprising. It's arguably piss-poor language design that there's nothing there to help you, but C# doesn't have checked exceptions; apparently programmers don't like having to deal with errors so you reap what you sow. The third case is how C# has a nice using feature which binds a resource to a lexical scope. But if you construct a closure capturing the resource and then send it out of that lexical scope then using goes wrong (it will still discard the resource even though there's a reference to it within the closure which remains in-scope). This is certainly piss-poor language design: if the closure captures stuff from your lexical scope and you're not reference counting (or equivalent) your lexical scope then YOU'VE DONE IT WRONG. This is as bad as in C allocating stuff on your stack and then returning pointers to it.

Next he moves on somewhat tangentially to the point that if object creation is an observable action then you can't optimise it out. I'm not sure anyone outside a pure functional language specialist would ever want object creation to be optimised out, but the point is that if your constructor has side effects or can in any other way be observed then you can't have your language runtime do memoization of object creation. Doing side effects in object constructors has long been discouraged: I first read that back in the Effective Java book about a decade ago and I'm sure it wasn't exactly a ground-breaking piece of advice then.

So far then we have that side effects which are untracked have the potential to be bad: whether it's printing things out, or throwing exceptions, or discarding resources early, or preventing compiler optimisations. But next I feel the article goes a bit wrong. He first moves onto how channels in C⍵ can store state so they're not pure either, thus bad. And then goes onto how in Erlang you have the same problem as you're just modelling mutable state in actors:

Note how this Erlang actor basically encodes an object with dynamic method dispatch using the pattern-matching, message-sending, and recursion primitives of the language, which you may happily leverage to implement mutable references, sabotaging the fact that the Erlang language does not natively expose mutable state.

This is wrong: you cannot implement mutable references in Erlang. Data is immutable in Erlang so if you send some value out of an actor, you are sending that value. Not a reference to a value or variable. Even if you create a closure and send that out of the actor, the closure is capturing those values as they exist at that point in time. If you have received a value sent to you from an actor, you may use it to create other values, but doing so does not affect the "original", and similarly, the actor itself can continue to modify its own state, but it does not affect the values it sent to you. Yes, you can use Erlang actors to model objects. But an actor's own modifications of its state cannot be observed as side effects on values you've previously retrieved from that actor, and vice versa.

The reference you have to an actor is a process identifier (also immutable) which does not present any information itself about the state of the actor. Through that, you can send messages to an actor and test whether or not the actor is still alive, but that is all. And in any case, where has the sudden objection to mutable state come from? State is just a catamorphism on prior inputs. State is not the problem: unconstrained side effects are the problem. Certainly sharing mutable state is a problem (and you could argue that mutating shared state is a side effect and that it should be tracked statically), but Erlang does not allow for that.

He may have been better off going for an example of opening an file, sending the file handle to another process and then closing the file handle before it's been used (i.e. the same as the third C# example). Except:

  1. All file operations can return an error anyway so handling errors in such code is completely normal;
  2. In Erlang a file handle is normally an actor itself, so what you're doing is passing around a process identifier. Sending messages to a dead process (once the file is closed) is a normal activity and you can detect if the process has died in normal ways;
  3. If you bypass such normal file handling for performance reasons and open the file in "raw" mode then Erlang has a light form of object capabilities in which only the process that opened the file is allowed to use the file handle, so again the use of the file handle would error predictably;
  4. The language doesn't have the same C# feature for discarding resources once you return out of a lexical scope. Consequently closing a file is an explicit operation and given the asynchronous concurrent mindset one develops when working in Erlang, it's very likely you'll realise how odd it is to be closing a file handle whilst there's some closure out there which may not have been run yet.

Beyond this, he introduces the Haskell type system and explains that it captures side effects statically. As a result, by bowing to the demands of the type checker, it offers you a proof that if such effects occur, your program will handle them: exceptions will not go uncaught, IO operations are only permitted where the semantics lead to expected outcomes, resources are not used after they're discarded and the compiler can use all these proofs to do all manner of optimisations to your program.

These proofs can certainly be very valuable (though they are no substitute for disciplined, high quality design and careful implementation). Obviously, they don't capture everything though. Particularly relevant for concurrent and distributed programs, they don't capture sufficient side effects to allow for a proof of the absence of deadlocks. Haskell standard libraries contain channels and semaphores which can easily be used to sporadically end up with a deadlock between processes. A deadlock is definitely a side effect: the effect is the program probably stops working. The cause is an insufficient use of locks to control the scheduler (be it scheduling of OS threads or language runtime scheduling of green threads).

More broadly, the proof a type checker offers is that the specification you've provided (type signatures) is not violated by its inferences about your code. Until the type checker allows "and makes progress" as part of a specification, Haskell itself is no safer than any other language that claims to be "mostly functional".

posted at: 10:22 | path: / | permanent link to this entry

Sat, 20 Sep 2014

Concurrency, Actors, Locks and mailboxes

Having been part of the original team that wrote RabbitMQ (indeed I wrote the very first prototype, back in the summer of 2006), and having worked full time with Erlang since 2009 until fairly recently, it's been interesting doing some work in Go recently.

Go's a language I currently have mixed feelings for. In some ways I like it - it's simple, it doesn't take too long to learn, the syntax is clean, the tool chain is superb (compilation is quick), performance is very good, it's very easy to drop into C whenever you really need to. It also has one very nice high end feature: first class channels - you can pass channels over channels which is pretty powerful. But equally, it's statically typed and has a terrible type system (i.e. no generics. Personally, I don't feel like the proofs offered by type checkers are worth much to me so I'd much rather have no static type checking than one that is this primitive and brain dead), it's not extensible properly (e.g. you can't create your own data structures which work with the range) and worst of all, there's no pattern matching. The lack of pattern matching is particularly horrific given Go's "best practise" of returning tuples from functions (except they're not really tuples - another mistake in the language), the right most of which indicates an error or success. But you can't pattern match on the assignment so you end up with endless if-statements that explicitly check the error for nil. There are other irritations which I've found, particularly related to its locks package (i.e. non-re-entrant; can't upgrade read to write; waiting for a write blocks all attempts to gain reads. Yes, I know I'm free to implement my own locks package if I want to).

Go also doesn't push you to using actors - you have to re-implement all that yourself if you want to. In a recent project, I started off with some locks and within about three days found it utterly impossible to reason about which go-routines can hold which locks and whether or not there's any deadlock potential. Inevitably, there was. So I ripped all the locking code out and wrote my own actor loops.

This was quite interesting as here I could now be more flexible than Erlang. I think most people think that actors mean "only one thread/process/routine can read and write its state" - there is a key concept of owning that state and regulating access to it. However, what I found is that I actually only needed to limit modifying the state to a single go-routine: each actor-receive-loop routine would take a write lock on its state whenever it needs to modify its own state, but it's perfectly reasonable to have anyone read the state, provided they take a read lock before doing so. The fact we can share pointers in Go makes this possible, whereas it's impossible to do this in Erlang (well, not quite - if you use ets then you can do it, which is exactly what we do in RabbitMQ in the rabbit_msg_store - but it's certainly not pretty!). So now we can have concurrent reads and no need to pass read-requests over a channel/mailbox. This seems pretty nice to me.

Recently I was reading a paper and it suggested that:

In message passing systems, processes interact exclusively by sending and receiving messages and they do not have access to shared memory.

Firstly, on a very technical note, they do have access to shared memory - the mailbox or queue is exactly that. The key reason why it leads to more composable systems is that when you hold the lock to write into a mailbox, you can never do anything other than write into that mailbox - you can never try to acquire multiple locks, so you can't deadlock in this way. And that's even assuming you're using locks for mailboxes - queues make lovely structures for lock-free concurrent access.

Secondly, as I suggest above, it appears to be safe to allow multiple concurrent readers of an actor's state, provided modifications to the state are done atomically by the actor thread - though more care has to be taken now to ensure updates are consistent - you have to make sure you update all the state you need to change in one go under a write lock (the sort of transactional semantics you end up needing to ensure makes me heavily think of STM). Whilst I would probably still call such a system a "message passing system" I can certainly imagine others would disagree and at a minimum it's some sort of hybrid (you could argue that the side effect of releasing the write lock when you've finished modifying the state is to publish an immutable copy of the state to any and all subscribers that want it - except without all that overhead. When viewed in these terms, it makes more intuitive sense that it's safe - provided of course that you don't do anything blocking whilst you're holding a state read-lock). This design also seems to get a fair bit trickier once you get to distributed systems and the need to have proxy objects representing the state and interface of a remote actor. By comparison, in Erlang a reference to an Actor is an immutable process identifier of Pid which is easy to send around and reason about.

But mainly I was thinking about the pattern of data flow: a mailbox allows multiple writers to send data to a single reader (a gather operation, maybe). The actor loop allows the exact opposite: a single reader of the mailbox can then affect multiple things (a scatter) - either by sending out messages to many other actors (in essence, a push action), or by (as I suggest above) modifying state which can be concurrently read by many other actors (correspondingly, a pull action). In my mind's eye, I see a sort of concertina effect as all these messages are pushed into a mailbox, and then slowly the effects of each message spread out again. In some ways it seems slightly odd how powerful this is, but in other ways it makes perfect sense: if you consider a finite state machine then your mailbox is just the stream of events coming into it and you have your little automaton updating the state with some operation combining the current state with the current message. It is the very fact that the next state is dependent on the current state and the current message that requires mutual exclusion around modifying the state. And of course by ensuring that that mutual exclusion lock is (implicitly) held in absence of any other locks that makes actor systems so much easier to reason about and understand - any deadlocks that occur are at the protocol level and, if you model your protocols between actors properly, can be determined statically (though I'm not aware that anyone actually does this - false positives may abound).

This then made makes me think about how, once all actors have done their initialisation and reached the core actor loop, the entire system is purely event driven. When looked at like this, are we really sure actors are enough? Are there not other forms of expression that capture the relation between events as inputs, with state, and an output more cleanly? In particular I'm thinking of things like Join calculus and Functional Reactive Programming. Given that actors are apparently becoming rather more mainstream these days, I wonder if that really means they're only part of the solution: sure I can write large distributed systems that scale, perform well, don't deadlock or livelock and are exceedingly robust. But I can I write them with less code and cleaner semantics?

posted at: 15:02 | path: / | permanent link to this entry

Welcome

Yet again, a new blog. This one is much simpler than the last. Whilst one always hopes, don't expect the frequency of posts to be greater than any of the previous blogs here... The styling is currently non-existent - seriously, there is no CSS right now. Sorry.

This one is much simpler than the last - gone is Serendipity and PyBlosxom is in, in its place. I'm currently trialing using muse to write posts. Not entirely sure it's worth it, but I need to get my emacs foo improved so working with a new major mode is likely a good idea.

posted at: 13:32 | path: / | permanent link to this entry