Let's build! A distributed, concurrent editor: Part 2 - Protocols

In this series:


In Part 1 I set the scene. Alice and Bob haven’t fully specified the system, but they’ve certainly given me some requirements:

  • The exact behaviour of concurrent edits is left for me to define, but I’m not allowed to solve it trivially (i.e. just do nothing), and the key requirement is that all users should end up seeing the exact same document fairly quickly, once they’re connected to the server.
  • They want the editing history to be stored so that undo/redo will work even if they disconnect and reconnect.
  • They want to be able to edit the document locally even when disconnected from the network. Some sort of synchronisation will need to happen when they reconnect to the network.

In this article I will discuss how to build a protocol around changes to the document, and how to model the document. I won’t yet talk about how I’m going to store those changes on disk: that’ll be the subject of a later article. There are many trade-offs here: I don’t think there’s a perfect solution to any of the challenges discussed. Where I’ve made trade-offs, I would like to think they either make the semantics of the system simpler, or the implementation simpler. I really don’t mind using systems that have limited functionality in particular situations, but I do find it very distressing to try and use systems which have complex and difficult to understand semantics. I feel that as an engineer, I need to design and build systems that can be easily understood by the user (along with being easily understood and maintainable for other engineers), even if that means limited functionality in somewhat unusual circumstances.

Aside: recently I came across some articles I’d written five years ago or so. In there, I put forth many opinions, which I now disagree with to a greater or lesser extent. There will be some opinions of mine in this series too. I’m going to try and limit them as much as possible to opinions that I’ve held for a long time, so that if I reread these in several years time, I hopefully won’t cringe too much. However, I’ve no doubt that some ephemeral opinions will get through. So if you find yourself disagreeing with me here, I’m quite sure I’ll be joining you in the fullness of time!

Concurrent changes

I need to understand the consequences of several users making concurrent changes to the document. I am going to assume all users have a copy of the entire document in their browser, and they are all free to make changes to it at the same time. Those changes need to get sent up to the server by some means. The problem to solve is then: “how does the server combine these changes in some sane way?” The server, after receiving and processing these messages, would send updates out to all the clients so that they are kept up to date as much as possible, and can see all the changes being made fairly promptly.

Original text: The fox. Alice edits to: The quick fox. Bob edits to: The fox jumps. What should the server do?

Example 1: How does the server achieve this merging?

In example 1, Alice and Bob start out with the same document. Alice adds the word quick in the middle. At the same time, Bob adds the word jumps at the end. At the time at which both Alice and Bob make their changes, they are unaware of the changes the other person is making. There are three acceptable outcomes here:

  1. Only Alice’s change survives, Bob’s change is lost, and the document reads "The quick fox"
  2. Only Bob’s change survives, Alice’s change is lost, and the document reads "The fox jumps"
  3. The server merges both changes together, and the document reads "The quick fox jumps"

It would be nice to achieve the final option here. But merging and preserving both edits is not always the right thing to do: looking beyond merely technical issues, a well written document will need to take into account all sorts of different aspects of the language being used: spelling, grammar, tense, voice and narrative, structure, introducing concepts before using them, and so forth. Even if I could make it so that every change is preserved and merged together, the document would still need careful proof-reading and editing: even perfectly safe-looking changes to different parts of the document can reduce the entire document to gibberish.

Original text: The fox. Alice edits to: The quick fox. Bob edits to: The brown fox. What should the server do?

Example 2: Insert words at the same position

In example 2, it’s much less clear what should be done. Both Alice and Bob have inserted new words at the same place. For the server to preserve both changes would require the server to decide which word comes first: is the final document "The quick brown fox" or "The brown quick fox"? English speakers will (I believe) have a natural preference for "The quick brown fox" because there is some rule somewhere about ordering different types of adjectives. But I don’t want to start attempting to build natural-language processing into the server. Whilst I know very little about English, I know much less about every other language.

Original text: The fox. Alice edits to: The foxes. Bob edits to: The fox's. What should the server do?

Example 3: Editing the same word

In example 3, both Alice and Bob make changes to the same word. It seems fairly clear that the server should not try to merge the edits together. Identifying which is the correct one to keep probably depends on future edits. About the best I can hope for is that they spot they’re both editing the same word at the same time, and one of them then takes their hands off the keyboard to see what the other does. But this outcome requires them both to be connected to the server at the same time so they can send and receive updates. If they’re both making substantial edits whilst disconnected, I’m in trouble!

Original text: The red fox. Alice edits to: The fox. Bob edits to: The brown fox. What should the server do?

Example 4: Editing a deleted word

Finally, example 4. Alice deletes a word which Bob edits. Again, it seems very difficult to come up with fool-proof rules as to what should happen here. You could even argue that both edits should be rejected and the document should remain unaltered.

In several of these cases, particularly where there’s any possibility of a user’s edits being discarded, an alternative approach would be to identify the conflicting edits, send them all back down to the users, and get them to resolve the conflict. That then raises the question of what happens if two users both concurrently vote to resolve the conflict in different ways? I don’t think this is so far fetched either: it seems to me quite likely that users are going to have a preference for their own edits! I’m going to steer clear of this approach: there’s more than enough complexity to deal with even in the simpler case of sometimes discarding a user’s edits. However, if you’re interested in this sort of approach, take a look at Peritext. I will not be using CRDTs in this set of articles.

I’ll come back to these examples later on when I eventually make some decisions about what an edit really looks like. But for now, I want to turn my attention to undo and redo - this requirement has the greatest number of unknowns and the greatest risk, so I want to develop an understanding of how it’s going to work.

Undo and Redo

Alice and Bob have specified they want undo and redo functionality, and they want to be able to connect to the server and still be able to undo (or redo) edits made by a previous connection. But what does undo/redo even mean for a distributed concurrent editor? In particular, when a user wants to undo or redo, what are they undoing, or redoing? Just their own changes? Each other’s changes? What happens if one of them presses undo whilst the other is making an edit?

Although I’ve not yet decided on the details of an edit event, either as communication between clients and the server, or as stored on disk, I want to think about whether the protocol needs to distinguish between edit events, and undo or redo events. Maybe I can have a single edit event type: maybe the clients could calculate the effect of the undo or redo key-press but then communicate the effect of that key-press as an edit?

The history of edits

In the following examples, I talk about edit events being adding entire words. I am aware that this is not necessarily the case - by far the majority of editors I know treat individual characters as edit events rather than complete words. Just for the sake of illustration, it seems easier to talk about entire words.

Consider the following sequence of events:

  1. Edit 3: Alice appends the word jumps. Document is "The red fox jumps"
  2. Edit 4: Alice appends the word over. Document is "The red fox jumps over"

At this point, if I draw out the list of edit events, I would expect something like this:

A list of edit events, showing edit 2, then edit 3, then edit 4

Example 5: All edit events applied

The arrow indicates the current state of the document: the document has all these edits applied, and any additional edits would add to the end of the list.

  1. Undo: Alice presses undo. Document is "The red fox jumps"
  2. Undo: Alice presses undo. Document is "The red fox"

With these two undos processed, I would like the list of edit events to look like this:

A list of edit events, showing edit 2, then edit 3, then edit 4, but now edits 4 and 3 have been undone.

Example 6: Edits 3 and 4 still exist, but are not applied any longer

Edit events 3 and 4 still exist. But they have been undone, and so the arrow now points to a position right after edit 2, indicating that edit 2 (and probably preceding edit events) are applied, but 3 and 4 are not. I have to remember what 3 and 4 are in case the user presses redo. But, if the user makes an edit at this point, that edit destroys any memory I have of edits 3 and 4:

  1. Edit 5: Alice appends the word trips. Document is "The red fox trips"
A list of edit events, showing edit 2, then edit 5, which has replaced both edits 3 and 4

Example 7: Edits 3 and 4 have been destroyed because of the new edit 5 event

With the edit 5 event added, edit events 3 and 4 have now been forgotten, and the arrow is right at the end, after edit 5.

  1. Redo: Alice presses redo. Document should not change as the redo is invalid.

There is nothing to redo. I don’t think there’s necessarily much need to signal an error to the user: most document editors don’t give the user an error for trying to undo or redo when there is nothing to undo or redo, though you frequently see in the menu that one or both of those functions are greyed-out and unselectable.

What can I learn from this? Well, what would happen if edit 5, rather than appending the word trips, instead appends the word jumps? If these edit events have no other identifying information, then edit 5 would be indistinguishable from edit 3. That would mean it would be impossible to tell whether this was a new edit (thus destroying the memory of edit 3 and edit 4), or whether it’s a redo of edit 3 (in which case we should advance the arrow past edit 3 and continue to remember edit 4). So although this hasn’t given a clear answer on whether I need to explicitly represent undos and redos as distinct from edits, it seems to be the case that if I do only have edit events, they’ll need to carry some sort of unique identifier, so that the history of edits can be maintained properly.

Concurrent edits and undos

Earlier I spent some time thinking about the effect of different types of concurrent edits. What happens when I add undos and redos into this mix?

Original text: The fox. Alice edits to: The red fox; and then to: The red fox jumps. Bob receives: The red fox; and then wants to undo. What should the server do?

Example 8: Undo concurrent with an edit

Imagine Alice makes an edit to the document. She changes "The fox" into "The red fox". Bob receives this update and so sees on his screen "The red fox". Bob now presses undo. At the same time, Alice is editing the document further and appends the word jumps. What should happen? What is being undone?

You could argue that Alice’s second edit, appending the word jumps depends on her first edit, which inserted the word red. So Bob’s undo, if it undoes that inserted red, creates a big problem: how can Alice’s appended jumps still exist without the red? From Alice’s point of view she finishes typing in jumps before she sees any effect of Bob’s undo. It could be surprising to Alice if Bob’s undo removes red but jumps survives, given that Alice would see red disappear after she’s finished typing jumps.

You could argue that Bob’s undo should undo everything that Alice did subsequently too, so that the document reverts back to just "The fox" for everyone. But you could equally argue that it would be weird for a single undo key-press to cause potentially many edit events to be undone in order to rewind the list of edits back to the view that Bob had at the time Bob pressed undo. If I adopted that approach, it would mean that the user with the worst network connection and the slowest computer, when pressing undo, would potentially undo a lot of changes, all-but-one of which they are not aware of. This seems dangerous.

This is the very nature of shared state in a distributed system: different people have different frames of reference and see events in different orders. What is apparently simple and unremarkable in a non-distributed single-user scenario often becomes really awkward and non-obvious in a distributed setting.

There is one user that I could bless with being the canonical source of truth for the state of the document: the server. I could choose that the protocol is going to have distinct undo, redo, and edit events, and I could choose that when a user presses undo, the client sends to the server a plain undo message, with no further identifying information at all. The undo (or redo) message acts purely on the document as the server sees it. This is a bit of a technical solution that satisfies no one but the software engineer proposing it: in the above example, the server could receive the undo message and the appended jumps edit message in either order:

  • The server receives Alice’s appended jumps message first, then Bob’s undo message. This means the server undoes Alice’s appended jumps. This might make sense to Alice, but to Bob, well he ends up seeing no change at all (or maybe the word jumps briefly flickers onto his screen and then disappears again as the server informs Bob that jumps got undone).

  • The server receives Bob’s undo message first, then Alice’s appended jumps message. In which case red does get undone, as Bob expects, but Alice’s jumps could survive, depending on the details of how I model edits: whether I model any sense of dependency. As already mentioned, this gives a potentially surprising order of events from Alice’s perspective. Thinking back to the history of edits, if Alice’s jumps comes after the undo message, then that jumps destroys the memory of red. So no one could press redo and get red back.

One advantage of this approach is that it means the undo and redo messages don’t have to make any claims about what they’re trying to undo and redo: it’s entirely left to the server to decide. This in turn means the clients don’t have to know and maintain the edit history of the document. When a client connects to the server, the server will have to send the client the entire document. That’s unlikely to be a huge amount of data. But the complete edit history could be a lot bigger. If undo and redo messages had to indicate which edit event they’re trying to undo or redo, then the client needs to know the edit history (potentially just a list of identifiers if the client is going to wait for the server to calculate and communicate what the updated document is going to be). Yes, I could make it load the edit history lazily in the background and all that stuff, but that would make the protocol a fair bit more complex which I’d like to avoid. The same is even more true(!) if the client itself calculates the effect of the undo (or redo): the client would have to know the edit history in order to be able to do those calculations (and would need the full edit history, not just identifiers of edits), but on the up-side it could send to the server only edit messages (as shown earlier, they’d need to have appropriate identifiers, but that’s no great problem).

So this is what I’ve chosen to go with: the server is the canonical source of truth for the state of the document. Clients send simple undo and redo messages to the server (and edit messages). Yes, this can potentially cause odd behaviour when you have concurrent editing and undo/redo-ing going on, but there doesn’t seem to be any obvious correct behaviour to me, and taking this approach keeps the client quite a bit simpler: it only needs to know about the current state of the document, and not its history. This therefore reduces the amount of shared distributed state, which is always advantageous: less distributed state means less chance for divergence and confusion.

What it does require is that I make sure a client’s stream of edit, undo and redo messages arrive at the server in the order in which the client sent them. It would be disastrous if a client sends an edit, followed by an undo, but somehow the server receives the undo first. This is why REST, and potentially a single HTTP/TCP connection per message is not appropriate: from a browser, if I attempt to make two HTTP calls to the same server in quick succession, I don’t believe I have any guarantees as to which will connect and complete first. This could allow messages to arrive at the server in a different order than the client intended. I could chain all of these calls one after another so that the next message doesn’t get sent until the previous one completes, but I find this approach highly distasteful:

  1. There’s nothing to be gained from waiting for a response from the server. The UI needs to be responsive so there’s no way I’m going to delay putting the new characters on the screen until after I’ve heard back from the server. I want to be able to send these updates to the server and safely assume they’re going to get there and be processed correctly. I want the server to send updates asynchronously to all connected users. For this, there’s no need to treat differently the user who just sent in some edits, so there’s no need for any sort of synchronous response to sending an edit to the server.

  2. TCP gives me all the properties I need: data is received in the order it was sent, lost packets get automatically retransmitted, and it’s perfectly acceptable to send messages and not expect any sort of application-level reply. I have no desire to reimplement all of that on top of HTTP to get around the fact that HTTP was not designed for this use case.

So I’m going to use a WebSocket. The client will connect to the server and open a WebSocket, and then it’ll use the WebSocket to send its stream of messages, and it will expect to receive from the server a stream of messages which are updates to the document that the client is displaying. I can safely use this WebSocket to send messages and know that they’ll be received in the order in which they were sent, I don’t need to worry about any sort of reply, and if it all goes wrong and the socket gets closed for some reason, I can be informed of it, and can then attempt to reconnect and synchronise with the server.

There is a wrinkle though: because the client doesn’t have the edit history, it must wait for the server whenever the user presses undo or redo. But this means that for Alice and Bob, when they’re working on documents whilst travelling, on aeroplanes and so forth, there is no undo or redo functionality when they’re disconnected from the network. This might be a problem. Maybe for version 2 I’ll come back and revisit this. The big challenge is that if you have two users, both doing work offline, and amassing locally a list of edit, undo and redo events, and then they manage to reconnect to the server, how do you safely and correctly merge those lists of events together?

The nub of it is this: our intuition about how editing works, and how undo and redo works, is all based on our mental model of a simple list of edits. But in a distributed setting, you naturally have a tree of edits (or maybe a lattice), not a list. So flattening that tree into a list in some sensible way becomes a major focus, and it seems to me there’s no easy and good way of doing so.

So far, I seem to have figured out more about what undo and redo look like than the edit events themselves. That’s the next task then: what is an edit event, and how does it work? At the start of this article I discussed a number of different types of concurrent changes; how for some scenarios it feels like certain behaviour is expected, and for other scenarios it’s much less clear if there is any particular expectation. The edit event, how it’s structured and what it contains, needs to be able to meet those expectations.

KISS: Send the whole document

Starting with a mantra of keep it simple, stupid, what goes wrong if the client sends the whole document to the server on every edit?

Well, in truth, quite a lot. Putting to one side arguments about inefficient use of the network, I come back to example 1:

Original text: The fox. Alice edits to: The quick fox. Bob edits to: The fox jumps. What should the server do?

Example 1: How does the server achieve this merging?

The server would receive a document from Bob, "The fox jumps", and a document from Alice, "The quick fox". How is it going to do the merge? The server can only hope to merge these different changes together if it knows what the changes are. It could be that the edit message contains some way to identify what the previous version of the document was that the client knew about, and so the server could calculate the difference. This seems really unpleasant though: the client knew what the difference was: it knew where the cursor was and it knew what the key-presses were. But it’s thrown away that information to calculate the new version of the document. It’s now sent the new document to the server, and is asking the server to reverse all those calculations in order to figure out what the change actually was. So I really don’t like this approach; it feels very ugly to me: throwing information away only to try to recover it again later on. Oh and it’s also a waste of network bandwidth.

Character coordinates

As I just mentioned, each client knows where its cursor is when a key is pressed. So how about an edit message is something like this:

type EditMessage struct {
	Position  uint
	Character rune
}

(I know: just 6000 words into this series and I’m already showing code snippets - too soon right?!)

Yep, this idea is rubbish too. In the previous idea I had the problem of having to know what is the previous version of the document so that I could then calculate the difference. With this idea I have pretty much the same problem: let’s say the server receives an edit message that says “at position 15 insert the letter j”. Which version of the document do I use to learn where position 15 is? Because if the server received this message from Alice, it could be she was editing an older version of the document and she’s not yet seen the changes from Bob that the server has just received and relayed on. So her position 15 could be quite different to the server’s position 15 and quite different from Bob’s position 15.

I think that any approach that relies on knowing some sort of distance from the start (or end) of the document is going to cause problems: it’s forcing me to treat the whole document as one atomic unit. I want something more fine-grained.

A list of words

Looking back at all of the early examples in this article, it seems to me that it’s defensible to ensure that if two people are editing different words then their edits should be merged. But, if they’re editing the same word then it’s not the end of the world if one of them just loses their edits and the other wins.

So how about I treat a document as a linked-list of words. I can have an edit message a bit like this:

type Word struct {
	WordId       WordId
	Letters      string
	PreviousWord *WordId
	NextWord     *WordId
}

type EditMessage []Word

I need to be able to express that an existing word has been edited, which means I need to be able to give every word a unique identifier so the clients and the server all understand what is being modified. And then it’s a linked list (actually a doubly-linked-list but it turns out not to matter in this case), so I need to indicate the word before and the word after. They are pointers so that the very first word can have a nil PreviousWord to indicate the start, and the very last word can have a nil NextWord to indicate the end. All the other words will have non-nil PreviousWord and NextWord fields.

To insert a new word between existing words, I need to provide the new word, but also provide updates for the word before and the word after: their content hasn’t changed, but their NextWord and PreviousWord fields will have changed to refer to the new inserted word. So that’s why EditMessage is a list of Words and not just a single Word. When it comes to deleting a word, the deleted word itself doesn’t change. But its neighbours get updated to point directly at each other; the deleted word is simply disconnected from the list and can be garbage collected somehow.

With this design, if two people are editing different words concurrently, it should be safe for their edit messages to be processed by the server in any order and there be no surprises: all those edits should survive. There could be some slight surprises if Alice is editing a word and Bob is inserting or deleting a neighbouring word, and there will definitely be some disappointment if they both edit the same word. But, as they both said to me in the previous article, they’re colleagues, they want to collaborate, and so if they spot that their edits are conflicting because they’re too close to each other, they’re going to step back from the keyboard for a moment to see what the other is thinking. At least, that’s the defence I’m desperately clinging too.

So what is this WordId and where does it come from? I could just use a GUID, that would be completely fine. For no particularly good reason, I’m going to use a tuple of a random client identifier and a counter. The idea is that the client, when it connects to the server, gets an identifier (some sort of GUID or just a random number) and then that forms a name-space under which the client can safely generate WordIds, and never have to worry about collisions.

type WordId struct {
	ClientId uint64
	Counter  uint64
}

The only possible advantage of this over just using GUIDs everywhere is that you could look back and see which clients created which words in your linked list. But in truth, that’s really unlikely to be an important detail.

Ordering edits

So far, there is no concept of a version on a word or a document. Alice could be editing a word. Her computer is very slow and her network connection is terrible. Bob is editing the same word. His connection is much faster and reliable. So Bob is sending off lots of edit messages to the server and the server is accepting and processing them, and relaying them through to Alice. But Alice is receiving them very slowly, and updates from Alice are getting through to the server very slowly. After a while, Bob has finished that sentence, paragraph, or chapter, and moves on. Eventually, an edit message from Alice gets through to the server. Now this message was sent by Alice a long time ago, possibly before any of Bob’s edits. But it’s only just got to the server. The server applies this ancient edit without fuss, applying it as if it’s the latest, most recent, edit. The semantics here are quite arguably wrong; what I’ve described is often called “last write wins”: literally the last edit (or write) that the server receives for any given word, wins and replaces all previous versions of that word. This is the case even if the order in which the server received the edits is provably different to the order in which these edits were sent, from different clients.

In my experience, a lot of building reliable distributed systems comes down to being able to prove whether or not one event happened before another. Importantly, for any two events X and Y, you may not be able to prove X happened before Y, or Y happened before X: all you can say is that they’re concurrent events. In that situation, sure, the server is free to do whatever it wants and solve the question of which concurrent edit wins by some arbitrary means. But if it can prove that the edit it received happened before other edits that it’s already received and processed, it would be preferable if it ignores the newly-received-but-ancient edit.

To achieve this, I’m going to use Lamport Clocks (or something very similar) to add the concept of a version to each individual word. Lamport clocks are pretty simple. As the Wikipedia page says:

  1. A process increments its counter before each local event (e.g., message sending event)
  2. When a process sends a message, it includes its counter value with the message after executing step 1
  3. On receiving a message, the counter of the recipient is updated, if necessary, to the greater of its current counter and the timestamp in the received message. The counter is then incremented by 1 before the message is considered received

My Word struct now gains an extra field:

type Word struct {
	WordId       WordId
	Version      Version
	Letters      string
	PreviousWord *WordId
	NextWord     *WordId
}
  • When the server receives an edit, for each word in the edit message, it asks the question: “Is the version of the word I’ve just received greater than the version I already know about?”
    • If the received word’s version is greater, then it processes the edited word, and remembers the new version for the word.
    • If the received word’s version is not greater, then it ignores the edited word.
  • The client, when it receives updates from the server, follows exactly the same logic.
  • Whenever the client edits a word, it increments the version.

Aside: in step 3 of the quoted section from Wikipedia above, it says that “The counter is then incremented by 1 before the message is considered received”. I am not doing that here. By incrementing when a message is received, you’re able to make arguments about the order in which messages have been received, even if those messages themselves carry the same version number. This is not important for me: all I want is to be able to make arguments about the order in which messages were sent. So that’s why I am only incrementing version numbers when sending. Additionally, Lamport Clocks are often used to give a version number to a local copy of some large piece of state, and messages being sent about carry this version along with instruction to mutate some smaller part of that state. But here, each word has its own independent version, and one word’s version never influences another word’s version.

So what is a Version and how do I order them?

type Version struct {
	Counter  uint64
	ClientId uint64
}

func (a Version) GreaterThan(b Version) bool {
	return a.Counter > b.Counter ||
		(a.Counter == b.Counter && a.ClientId > b.ClientId)
}

func (self *Version) Bump(clientId uint64) {
	self.Counter++
	self.ClientId = clientId
}

Whenever a client edits a word, it calls Bump() on the word’s Version field.

Version is rather similar to WordId, except the Counter and ClientId fields are swapped around: for the WordId struct, the ClientId is the more significant element, but for a Version, the Counter is more important. Recall that the ClientId is unique to every client. This means the server will never see edits carrying the same Version on the same Word from different clients: at a minimum, the ClientId field would differ.

The effect is that if Alice and Bob have both received the same version of the same word from the server, and they both edit it at the same time, they will both call Bump(), which would increment the Counter field to the same value. But it would also set the ClientId field to different values. When the server receives these edit messages from Alice and Bob, it will be able to order them; yes, an arbitrary order based on which of them randomly has the higher ClientId, but nevertheless, an unambiguous total order.

In the earlier scenario of Alice using a very slow computer with a terrible connection, the edit message she sends that very slowly makes its way to the server will likely have a lower Counter number than in the rapid stream of edits the server has already received from Bob. So the server will be able to prove that the edit message that it’s just received from Alice was based on an earlier, and now outdated, version of the word. And then the server will ignore this ancient edit message and preserve Bob’s work.

In general, I always prefer to use logical clocks and counters to wall-clock timestamps. I am always fearful about wall-clock timestamps: it seems to me to be very difficult to be sure you can trust a clock on a computer, and wall-clock timestamps give a total order, which may disguise truly concurrent events. Logical clocks and counters are often just as easy to work with, and I find it’s easier to understand when they’re giving you an ordering you can really rely on, and when they’re telling you events happened concurrently and no strict ordering can be proven.

Specifying the protocol

I’ve chosen to use bebop to specify the protocol for this project. In the past, for specifying protocols, I’ve used JSON, protobuf, Cap’n Proto, and others.

There are several reasons I don’t like JSON for this purpose:

  1. There is no schema from which you generate language bindings. This means if your project uses several different languages then you have to maintain and update all bindings manually and there’s no single source of truth. I’ve seen projects where different teams have updated notionally shared JSON objects in incompatible ways.
  2. About the only useful types that are supported are objects/maps, lists, and strings.
  3. On the wire it’s inefficient: it’s self describing and so includes all the field names all the time.

Cap’n Proto I’ve used before and seems OK, although last time I tried to use it, language bindings for JavaScript were either non-existent or problematic, I can’t quite remember. Protobuf always drives me up the wall: for some reason I find the APIs of the language bindings frustrating to work with.

Bebop has good bindings for TypeScript and for Go, which is what I need for this project. It allows me to define the protocol in a single file and generate language bindings. It has a useful range of supported types and is efficient on the wire. So far, the APIs that get generated haven’t caused me any anguish. Also, I just wanted to try something new.

The complete protocol is given by this single file:

struct WordId {
	uint64 ClientId;
	uint64 Counter;
}

struct Version {
	uint64 Counter;
	uint64 ClientId;
}

struct Word {
	WordId WordId;
	Version Version;
	string Letters;
	Links Links;
}

message Links {
	1 -> WordId Previous;
	2 -> WordId Next;
}

message Message {
	1 -> Word[] Updates;
	2 -> bool Undo;
	3 -> bool Redo;
}

The clients connect to the server, creating a WebSocket. They send Messages to the server. The server will send Messages to the clients too, but the server will only ever send Updates: it’ll never send Undo or Redo messages to the clients: I’m just using Message as the general purpose envelope for all messages between client and server, in either direction.

Boot-strapping

There is one final thing to discuss: how to boot-strap a new document. The client will connect to the server, and indicate in some way the document it wishes to edit. The server will send down the current version of the document: a list of all the words (which means we can use the Message Updates field for this). From that, the client can build the linked list and display all the words in order.

But what if the document is new and doesn’t exist? And what if the client wants to create a new document even when it’s disconnected?

If I hard-code the WordId of the very first word in each document, and if the algorithms ensure the very first word can never be deleted, and is always the first word (maybe I ensure the first word never has any content and is in fact never rendered or editable), then I can make this work. For both the client and server, I’ll fix the first word’s WordId to be ClientId = 0, Counter = 1. For the clients, they use an initial version of Counter = 1, ClientId = client.Id, and for the server it’ll use an initial version of Counter = 0, ClientId = 0. This means that the initial version of the first word on the client will always be greater, and thus win, over the server’s initial version. Then, if the client does try to create a new document whilst disconnected, when it reconnects, both client and server will have the same WordId for the first word, and the normal algorithms can proceed, checking the versions and accepting or ignoring edits to words as discussed previously.

If I didn’t do this, consider the scenario where two clients, both disconnected, create the same new document. Their documents are a list of Words, as normal, but none of those words have any WordId in common. Eventually they both connect to the server and send up their list of Words. The server is now faced with the following problem: it’s received two, valid, but completely disjoint lists of Words. How on earth does it decide which to adopt as the real document? By hard-coding the WordId of the first word, I ensure there’s always one word in common and so the server is never faced with completely disjoint lists of Words, and so can always start by comparing the version of that first word, and take it from there.


Next week, I think I’ll cover most of the client code. It’s all TypeScript; I’m not using any frameworks like React - I don’t think there’s any need for something this small. So next week should be a bit more code, and a bit simpler. After that I’ll move to the server and discuss how I’m going to store this edit history on disk, in such a way that I don’t get any slow memory leaks over time.