Let's build! A distributed, concurrent editor: Part 4 - Writing to disk

In this series:

Last week’s article was not very exciting. I know, and I’m sorry. As I’ve said, I don’t enjoy programming browsers, I struggle to build up much enthusiasm for it, or for writing about it. This week is much more my bag. This week, and the next few articles in this series (I reckon I’m about half way through), are what I consider fun.

I need to write the server. There’s some normal stuff to do: running an HTTP server; accepting WebSockets; managing a bit of state associated with that. But there’s also managing the state of the document, and recording it and its history on disk. I need to figure out what to do when the server receives an undo or redo message: firstly what the resulting document is, but secondly what then to send out to the clients in order to update them. Only the server has the document history, not the clients. So the server has to interpret the undo and redo messages and update the clients with the consequence of those messages.

Deltas or Consequences?

It’s difficult to know where to start, in terms of figuring out what to write to disk. One decision I need to make is whether to try and store deltas, i.e. modifications to the document, or the consequences of those modifications, i.e. the full document.

  • A delta, or a change to the document, might look like: “change word with identifier 5ce63g from dog to cat
  • Whereas the consequence would be the resulting document: the best possible pet is a cat

The client is already sending the server deltas; in last week’s article I covered how the client models the document as a linked list of words. When the user types, the client marks the modified words as dirty. Only the dirty words get sent up to the server.

The client would be happy to receive either deltas or consequences: it expects the full document whenever it establishes a connection to the server; but as each word has its own version, and the version is pretty much a Lamport Clock, the client would behave the same way regardless of whether it receives a stream of deltas or consequences. This is quite a nice aspect of the design: the protocol and algorithms I’ve covered so far are all (I believe) fairly simple, but they still permit flexibility in the design and implementation of other parts of the system.

The client doesn’t really care what it receives, so it doesn’t help guide my decision as to what to write to disk. I could still store endless copies of the full document on disk, but keep track of what’s been sent to the clients, and calculate the difference between the two whenever I need to update the clients. Thus I could still be storing consequences but sending deltas.

If I choose to store deltas, and those deltas include the plain undo and redo messages that the clients send, then what’s on disk would be a list of these deltas, from the creation of the empty document onwards. Replaying them, and correctly interpreting them, should result in the correct current state of the document. It feels quite nice that I could receive deltas from clients, append them to some file on the server, and make no further writes to disk at all: that seems fairly simple and efficient to me. It’s not at all clear how the interpretation of those deltas would work: I’d need to figure out some algorithms there; but it’s appealing to me that the interpretation of the deltas is a distinct activity to writing to disk. Writing to disk this way (not much data to write; appending only; not modifying existing data on disk) feels like it could be fast and efficient, and that would help the server to scale to many documents and many users. One downside I can see is that it’s not great to have to replay and interpret the document’s entire editing history from the dawn of time in order to calculate the current state of the document. Some sort of check-point mechanism might be helpful if I go down this route.

But what about the alternative: writing consequences? I have to keep the history of the document on disk so that I can restart the server and still be able to cope with new undo messages arriving. In the simplest design, I would write out the full document after every modification. This seems distasteful, but I want to see if other things get easier if I follow through with this design.

A scenario: a document has been created, and a few modifications have happened. The document currently consists of four words, “The brown fox jumps”. This is shown in figure 1, which also shows the full document after some preceding modifications.

'The red fox', then 'The red fox jumps', and now 'The brown fox jumps'

Figure 1: A few modifications have led to this document.

So now a user presses undo, and the client sends the server an undo message. What should I write to disk? It’s tempting to append the previous version of the document; this seems simple and reasonably intuitive. On disk, this would look like:

First 'The red fox', then 'The red fox jumps', followed by 'The brown fox jumps'. The user presses undo so lastly back to 'The red fox jumps'

Figure 2: After the user presses undo

What if the user now presses redo? How do I figure out that the redo is firstly legal (i.e. something has been undone which can now be redone), and secondly find where it is and append the consequence of the redo? After thinking about multiple undos and redos, this starts to look very tricky to me. Maybe I shouldn’t write the consequences of undo and redo to disk, and instead maintain some sort of pointer so that I know where in the sequence of normal edits I am?

First 'The red fox', then 'The red fox jumps', followed by 'The brown fox jumps'. The user presses undo but this moves the pointer back to the previous state of the document

Figure 3: After the user presses undo, but now with a pointer

This would mean for a normal edit, I must write to disk the new document and make sure the pointer is on disk pointing at the latest state of the document. If I receive an undo, I must update the pointer on disk, moving it one to the left (assuming I can go left), and if I receive a redo, I must update the pointer on disk moving it one to the right (assuming I can go right). What happens if I receive an undo follow by a normal edit?

The user presses undo, moving the pointer back to 'The red fox jumps', and then edits to become 'The red fox leaps'

Figure 4: The user presses undo and then edits

In this case, the consequence of the new edit must replace everything to the right of the pointer. So the consequence of an edit is not simply appended to the list of consequences; first that list must be truncated. I explored these sorts of sequences of events back in the 2nd article in this series. Having to maintain on disk this pointer is not the end of the world, and nor is having to truncate the list or overwrite existing data. But certainly the use of disk is now more complex than if I just write deltas.

Will this do though? If I implement this will it work? I want to think more about what happens when the server receives that undo message. I know it can update the pointer on disk, which will reveal to the server the correct new state of the document, and that document state now needs sending to all the clients. Assuming that on disk I’ve stored the document as a linked list of words, using the protocol I’ve created and used in previous articles, can I just read that document from disk and send it to all the clients?

Sadly, no. The reason why not, affects me regardless of whether I’m writing out deltas or consequences, so this isn’t a nail-in-the-coffin of writing consequences. But it does show the apparent simplicity of storing consequences is erroneous.

Last week I worked through the algorithms in the client. The client will ignore an updated word it receives if the version of that word is not greater than the version it currently has. This is how the system brings some small amount of sanity to concurrent modification. If I draw out the consequences of the edits, but now add the versions of each word, in green, above the word, it would look like this:

Before the user presses undo, the words of 'The brown fox jumps' are at versions 3, 5, 3, 1, respectively.

Figure 5: Just before the user presses undo, with word versions shown

The server now receives undo. It wants to change the word brown back into the word red. The word brown is currently at version 5. If the server just reads the previous document and sends that out to all the clients, they’ll receive an update for the word, trying to change the word to red, but that update for the word will carry version 4. And so the clients will ignore it, because they already have that word at version 5. It needs to be sent out at version 6, or higher.

So, I have to make sure that the words in the new state of the document, that differ from what the server has previously sent to the clients, have higher version numbers than the client has previously seen. This is not as simple as bumping all the words in the previous state of the document; this is making sure the server knows, for each word, what is the highest version number the server has ever sent to the clients, and then bumping that. This is extra state that I could store on disk alongside the pointer: a mapping from WordId to Version to record the highest version of each word ever sent out.

This is the case for both undo and redo. Consider a sequence of events: undo, redo, undo, redo, undo, redo, then the server is restarted, and lastly another undo. How is the server meant to know which version numbers to use for this final undo? It should be at least 7 greater (because of the 6 previous events) than the version numbers recorded in the consequences on disk. So the consequences on disk, and the position of the pointer, are not sufficient on their own to allow me to derive the current version number for any word. Hence the extra state must be written to disk too. This now means for undo and redo events, the server will have to adjust the pointer, bump version numbers as appropriate, and then write all that data back to disk. For normal edits, it would have to do the same, plus truncating and appending the new document state to disk.

This design has become too ugly for me. There’s quite a lot of state on disk now, and the updates to that state are not trivial. I’m going to abandon this approach, and pick appending deltas (including plain undo and redo messages) to disk only. In other words, writing an event log to disk.

Interpreting deltas

The plan so far: receive messages from the clients, append them to disk in some way. For edit messages, the only other thing to do is to send the edit to everyone else. All the fun comes from undos and redos. This is why I gave Bob and Alice such death-stares back in episode 1 when they asked for this feature.

I shall define a term here: a frame. Each frame combines a single edit event with counts of the number of times that edit event has been undone, and redone. Every edit event is wrapped by exactly one frame. Assume I manage to transform my event log of deltas into a list of frames somehow. I could then walk through this list of frames to calculate the state of the document. If, for example, I see a frame where the undo count is greater than the redo count, then I know I shouldn’t apply the edit in that frame to the document state, at least at this point. But I’m getting ahead of myself; what exactly does this list of frames look like?

Each delta becomes a pointer to a frame. The same frame can appear several times.

Figure 6: Transforming a sequence of deltas into pointers to frames

The list of frames is the same length as the list of deltas. But, a single frame can appear several times in the list: the first time it appears (left most / oldest) will be when the frame’s edit message happened. The next time it appears will be the first time it was undone. The time after that is when it was redone, and so on.

I can now walk over this list of frames and calculate the document. An edit must be applied to the document at the correct point: if the edit was ultimately undone then it should never be applied. I.e. undoing an edit is the same as not applying that edit in the first place. If an edit was undone and redone then it should be applied to the document at the point where it was redone, and not earlier. I use the following logic to decide whether or not to apply an edit to the document as I loop through the list of frames:

  • If undoCount == 0 && redoCount == 0 && undoNext then apply the frame’s edit to the document. This frame will not occur later in the list.
  • Else, if undoNext then decrement undoCount by 1, set undoNext = false.
  • Else, if !undoNext then decrement redoCount by 1, set undoNext = true.

If I try out a few base-cases, this appears to be correct:

  • If an edit was never undone (or redone), then its frame will appear once, undoCount and redoCount will be 0, and undoNext will be true (the default value when a frame is constructed). So the first time the frame is encountered in the list, its edit will be applied.

  • If an edit was undone (and never redone), then its frame will appear twice. Initially undoCount will be 1. When the frame is first encountered, the edit will not be applied, undoCount will be decremented to 0, and undoNext will be flipped from true to false. When the frame is next encountered, undoCount and redoCount will both be 0, but undoNext will be false, so the edit will not be applied, and the frame is never seen again. Thus the edit has been undone.

  • If an edit was undone and redone once each, then its frame will appear three times. Initially undoCount and redoCount will both be 1. The last time the frame is encountered, both undoCount and redoCount will be 0, and undoNext will be true, so the edit will be applied to the document at that point, and not before.

Earlier, I talked about the fact that undos and redos need to bump versions of words; they cannot use the versions of the words that are in the corresponding edit. If they do, then the clients will ignore the messages they receive from the server, as the versions will be too low. I can sort this out here:

  • The first time I see a frame, I must make sure that every word edited by the frame, that exists in the document state, has a version at least as high as in the edit. Think about the sequence of events as they would have been received by the server: a client would have sent the edit to the server, and any of the words in that edit that had greater version numbers than the server knew about would have won and altered the document. So I have to have that logic here: if the version number is greater in the edit, then copy the version over to the document. I just don’t necessarily apply the edit itself unless undoCount == 0 && redoCount == 0 && undoNext, as covered previously.

  • All other times I see the frame will correspond to undo and redo events. For these, every word in the frame’s edit should be looked up in the document state, and whatever version is found there should be bumped by 1. If you think back to the design idea for writing to disk consequences and not deltas, this is the same as keeping track of the highest version number seen for each word. It’s just here, I can derive it from the event log, and not have to write it out to disk explicitly.

Imagine a client edits word 5ce63g, changing its version from 4 to 5, and its letters from dog to cat. It sends this edit to the server, which appends it to disk. The client then sends an undo message to the server. The server appends this to disk too, and then recalculates the document. The list of frames ends with two pointers to the same frame: firstly for this edit, and secondly for its undo.

The first time the frame is encountered, word 5ce63g will exist in the document state, with the letters dog. I copy the version from the frame’s edit, thus making sure word 5ce63g is at version 5 (or greater) in the document state. In the frame, undoCount == 1, so the edit is not applied. I decrement undoCount to 0, and set undoNext to false. The next (and final) item in the list of frames is this same frame again, corresponding to the undo event. I bump the version of word 5ce63g in the document state by 1. Although undoCount is now 0, undoNext is false, so the edit is never applied, which is what we want: the edit was undone!

The final document state contains word 5ce63g at version 6 (or greater), with the letters dog. I can send this to all the clients: the version is greater than any version I’ve received from the clients, so they will apply it, and the letters are dog, not cat. I.e. the edit has been correctly undone.

Building the list of frames

This list of frames seems to do the job, but how do I build it? If I walk through the list of deltas, then for each delta, I need to figure out what the current frame is, and append that current frame to the list of frames.

Assume that each frame has previous and next pointers to other frames. These will not relate to the order of frames in the list of frames. They simply help with determining the current frame.

In pseudo-code:

for delta in eventLog:
  if isEdit(delta):
    frame := newFrame(delta) // undoCount = 0, redoCount = 0, undoNext = true
    frame.previous = currentFrame
    currentFrame.next = frame
    currentFrame = frame

  if isUndo(delta): // it's an undo of the current frame
    currentFrame.undoCount += 1
    currentFrame = currentFrame.previous

  if isRedo(delta): // it's a redo of the next frame
    currentFrame = currentFrame.next
    currentFrame.redoCount += 1

  frames = append(frames, currentFrame)


It all seems to be falling into place: writing an event log is nice and simple, and hopefully fast and efficient. For edit events, I can send the event out to all the other clients and that’s it. For undo and redo events I can’t send those out to the clients because the clients don’t have the edit history of the document, and so the server has to calculate what the document has become. I’ve covered the algorithms necessary to do this, which rely on replaying the event log … from the dawn of time.

Yeah. That might not scale for a large document. I’m quite happy to believe using a little more CPU to calculate the document from the event stream is preferable to storing lots of extra data and state on disk. But it’s only a matter of time before the event log becomes so long that replaying it is slow. At least I only have to replay it on undo and redo events: I would expect these to be rarer than normal edits. Even so, this is something I’d like to fix.

Replaying the event log, and applying the calculated frames to the document state, starts with an empty document state. There’s no reason it has to be empty. From time to time, I could write out the complete document, creating a check-point. I could then load that, and only have to process events that happened after that check-point was created. This would put a bound on the maximum number of events that would need to be replayed, thus guaranteeing acceptable levels of performance.

Well, almost. Consider the server creates a check-point, and then the very next event it receives from a client is an undo. I do not want the undo to undo the check-point: that would result in an empty document! No, instead, I have to ignore the check-point, and look further back in the event log to identify the normal edit event which matches up with this new undo.

Creating the list of frames now changes: instead of walking forwards from the dawn of time, I walk backwards from the most recent event. If I reach a checkpoint, and none of the frames I’ve created so far are missing their edit event, then I can stop, and rely on that check-point: I can start with the document state as defined by the check-point, and interpret the frames in the normal forwards direction. But if there are some frames which are missing their edit event, then I must skip past this check-point and keep heading back in time. This explains why I’ve not been linking to the real code so far: it runs backwards! This also gives me some clues about when it’s appropriate to try to create a check-point: it’s only sensible just before writing out an edit event. Creating a check-point before an undo or redo event is appended to the event log is counter-productive because that check-point will always have to be ignored and skipped over.

If your cat falls asleep on your keyboard and is pressing Ctl-z, then your entire document will be undone. This design does mean that as more and more of your document is undone, each undo will take longer and longer because it has to replay ever more of the event log. But I think that’s a reasonable trade-off: when editing a document, I often find myself undoing and redoing changes I’ve made recently (and its those events that have now been optimised to some extent by the addition of check-points), but it’s rare that I (deliberately) undo hours and hours of work. So I believe this design should be acceptable.

Code and Technology

In terms of writing to disk, there are lots of options: my requirements are very light. I could use individual files and append directly using basic file operations; one file per document seems reasonable. If I put the check-points within the event log then I’d need a way of indicating that an event is a check-point or not; I could extend the bebop protocol to do this conveniently, or introduce some extra random byte to do this inconveniently. I would want to be careful about making sure fsync and friends get called correctly and the data really does get written to disk. I’d probably attempt a simple length-prefix encoding.

At the other end of the spectrum, I could reach straight for an SQL store. But I am not a fan. Every time you have to transcribe some data structure from one language to another, you hit problems: little inconsistencies that bite you hard later on. So far I’ve been very lucky that the modelling of the document in TypeScript, and Bebop, and (it turns out) in Go is all very uniform. But I don’t want to push my luck and attempt it in SQL. Modelling data structures and objects in general in SQL is always a disaster, in my opinion; ORMs group data together in tables in unnatural ways that do not reflect the structure of the data at all. Also, from a testing and deployment point of view, it massively complicates things: it’s a huge extra moving part which needs a lot of careful choreography.

Instead I’m picking a half-way house: bbolt. It’s an embedded key-value store, which supports transactions. Keys and values are simple byte arrays. Perfect for what I need: the values will be the bytes I receive from the client off the WebSocket, and bbolt supports auto-incrementing sequence numbers which will help with the keys. It also supports cursors which will allow me to easily walk backwards from the end of the event log. It’s just a Go library so it compiles in to the binary: no external moving parts to mess about with. Simple semantics, and easy to test with and deploy.

Aside: I did also try to use badger which may well have better performance than bbolt. However, I found the quality of the documentation and API design to be much poorer than bbolt, so I abandoned badger.

The code that loads the event log from disk should have some resemblance to the earlier pseudo-code. Some of the logic is a bit different because it’s working backwards over the event log, not forwards. Building the document from the frames looks fairly different. One of the reasons for this is that the server always keeps track of the current frame. This is very useful as it allows me to easily check whether an undo or redo message I’ve just received is legal or not, by looking at the previous and next fields. But it means that the logic about updating undoCount and redoCount etc is slightly different (though equivalent, hopefully!) than how I’ve introduced it in this prose, because object management is a little different.

Filtering edits for words that have greater version numbers should seem very similar from last week’s article, as should performing garbage collection. Because Go supports structural equality, I can have WordIds as keys in a map and not have to convert to strings like I had to in JavaScript. Although I’ve not discussed it, the server does filter edits it receives from clients before writing to disk or sending to other clients. This means that if the server receives an ancient edit from a slow client, it may well not need to write anything to disk or send any update to any client if all the edited words in the received message are already out of date. It’s an easy optimisation to make and makes sure that use of disk and network bandwidth is minimal.

I mentioned much earlier that even if I store consequences on disk, I could still send deltas to the clients by keeping track of what’s already been sent to them. Well, the server recalculates the document after an undo or redo event. Rather than send the entire document to the clients, I do keep track of what the server currently thinks it has sent to the clients (i.e. the current state of the document), and so I can calculate the difference and continue to send only deltas to the clients.

Creating a check-point is not much more than appending the current state of the document to disk. And when a new client connects, the server can send a check-point of the document state to the new client as its part of the initial synchronisation with the client. This is equivalent to the client marking all the words it knows about in the document as dirty and sending all of those up to the server, which I discussed last week.

There are a very large number of ways of skinning this particular cat and safely storing the necessary data on disk. I have no doubt there are many other solutions that are better under various metrics, from simplicity to performance. Hopefully not robustness though, as I have done some solid testing on this! That said, the server currently completely trusts the changes any client sends, which could cause problems: there’s nothing to stop the client sending changes that make the linked list of words turn into a cycle. That could cause significant problems, for example the current GC algorithm would never terminate if this happens. Ooops!

If you come up with your own design and want to share it with me please do; I’m curious about other designs. You can reach me by email at matthew at this website’s domain name, or on twitter (DMs are open).

For anyone having a gander at the server code, you’ll spot it seems to be written using some sort actors library. I think that’s going to be the subject of next week’s article: I’m a big fan of using actors for concurrent programming, and I’d like to spend some words extolling its virtues (and also write a tutorial for my actors library)! After that, it’s on to testing: I’ve already written some integration tests and some soak tests. Hopefully I’ll add some fuzz tests before this series comes to a close.