Complexity and software engineering

OK, it’s definitely not just the software industry. If you’ve seen the film The Big Short you may remember the seemingly endless secondary markets, adding complexity that led to no one understanding what risks they were exposed to. The motivation there seemed to be purely making money.

Look at the food on your plate at dinner, and try thinking about the complexity of where all the ingredients came from to make that meal. If you have meat on your plate it might have been grown in the same country as you, but maybe not for the food the animal ate. You’re probably also eating animal antibiotics (or the remains of them). Where were they made? How can you start to get a hold on the incredible complexity of the human food chain? The motivation here seems also to be to make money: if you can make the same product as your competitors, but cheaper, then you can undercut your competitors a little, have bigger margins, and make more money. Who cares if it requires enormous environmental damage, right? Products are sure as hell not priced to reflect the damage done to the environment to create, maintain, or dispose of them.

As an aside, have you ever marvelled at how incredible plants are? They literally convert dirt, sunlight, water, and a few minerals, into food. Ultimately we’re all just the result of dirt, sunlight, water, and a few minerals. Bonkers.

Software does seem a little different though. We seem to utterly fetishize complexity, mostly for bragging rights. I’ve certainly been guilty in the past (and I suspect in the future too), of building far more complicated things than necessary, because I can. In a number of cases I could concoct a benchmark which showed the new code was faster, thus justifying the increased complexity of the code, and the consequence of a more difficult code-base to maintain. I definitely get a buzz from making a complex thing work, and I suspect this is quite common. I’ve been told that at Amazon, promotion requires being able to demonstrate that you’ve built or maintained complex systems. Well I love hearing about unintended 2nd-order effects. The consequence here is pretty obvious: a whole bunch of systems get built in ludicrously complex ways just so that people can apply for promotion. I guess the motivation there is money too.

As I say, when building something complex, it can be rewarding when it works. Six months later I’ve often come to regret it, when the complexity is directly opposed to being able to fix a bug that’s surfaced. It can cause silos by creating “domain experts” (i.e. a ball and chain around your feet). I’ve had cases where I’ve had to build enormously complex bits of code in order to work around equally bonkers levels of tech-debt, which can’t be removed because of “business reasons”. The result is unnecessary complexity on top of unnecessary complexity. No single person can understand how the whole thing works (much less write down some semantics, or any invariants) because the code-base is now too large, too complex, and riddled with remote couplings, broken abstractions, poor design, and invalid assumptions. Certain areas of the code-base become feared, and more code gets added to avoid confronting the complexity. Developer velocity slows to an absolute crawl, developers get frustrated and head for the door. No one is left who understands much. With some time and space since that particular situation, it’s now easy for me to sit here and declare that sort of thing a red-flag, and that I should have run away from it sooner. Who knows what’ll happen next time?

I find it easy to convince myself that complexity I’ve built, or claim to understand, is acceptable, whilst complexity I don’t understand is abhorrent.

As an industry we seem to love to kid ourselves that we’re all solving the same problems as Google, Facebook, or Amazon etc. At no job I’ve ever worked do I believe the complexity that comes from use of Kubernetes is justified, and yet it seems to have become “best practice”, which I find baffling. On a recent project I decided to rent a single (virtual) server, and I deploy everything with a single nixos-rebuild --target-host myhost switch. Because everything on the server is running under systemd, and because of the way nixos restarts services, downtime is less than 2 seconds and is only visible to long-lived connections (WebSockets in this case): systemd can manage listening-sockets itself and pass them to the new executable, maintaining any backlog of pending connections.

To me, this “simplicity” is preferable than trying to achieve 100% service availability. I’m not going to lose any money because of 2 seconds of downtime, even if that happens several times a day. It’s much more valuable to me to be able to get code changes deployed quickly. Is this really simpler than using something like Kubernetes? Maybe: there are certainly fewer moving parts and all the nixos stuff is only involved when deploying updates. Nevertheless, it’s not exactly simple; but I believe I understand enough of it to be happy to build, use, and rely on it.

I was recently reading Nick Cameron’s blog post on 10 challenges for Rust. The 9th point made me think about the difficulty of maintaining the ability to make big changes to any large software project. We probably all know to say the right words about avoiding hidden or tight couplings, but evidence doesn’t seem to suggest that it’s possible in large sophisticated software projects.

We are taught to fear the “big rewrite” project, citing 2nd-system-syndrome, though the definition of that seems to be about erroneously replacing “small, elegant, and successful systems”. It’s not about replacing giant, bug riddled, badly understood and engineered systems (to be super clear, I’m talking about this in general, not about the Rust compiler which I know nothing about). I do think we are often mistaken to fear rebuilding systems: I suspect we look at the size of the existing system and estimate the difficulty of recreating it. But of course we don’t want to recreate all those problems. We’ve learnt from them and can carry that knowledge forwards (assuming you manage to stop an employee exodus). There’s no desire to recreate the mountains of code that stem from outdated assumptions, inevitable mistakes in code design, unnecessary and accidental complexity, tech-debt, and its workarounds.

I’ve been thinking about parallels in other industries. Given the current price of energy in the UK and how essential it is to improve the heating efficiency of our homes, it’s often cheaper to knock down existing awful housing and rebuild from scratch. No fear of 2nd-system-syndrome here: it’s pretty unarguable that a lot of housing in the UK is dreadful, both from the point of view of how we use rooms these days, and energy efficiency. Retrofitting can often end up being more expensive, less effective, slower, and addresses only a few problems. But incremental improvement doesn’t require as much up-front capital.

If you look at the creative arts, artists, authors, and composers all create a piece of work and then it’s pretty much done. Yes, there are plenty of examples of composers going back and revising works after they’ve been performed (Bruckner and Sibelius for example), sometimes for slightly odd reasons such as establishing or extending copyright (for example Stravinsky). But a piece of art is not built by a slowly changing team over a period of 10 years (film may be an interesting exception to this). When it’s time to start a new piece of art, well, it’s time. Knowledge, style, preferences, techniques: these are carried forwards. Shostakovich always sounds unmistakably like Shostakovich. But his fifth symphony is not his fourth with a few bug fixes.

At the other end of the spectrum, take the economic philosophy known as Georgism. From what I can gather, no serious economist on the left or right believes it would be a bad idea to implement, and it seems like it would have a great many benefits. But large landowners (people who own a lot of land, not people who own any amount of land and happen to be large) would probably have to pay more tax. Large landowners tend to currently have a lot of political power. Consequently Georgism never gets implemented in the West. So despite it being almost universally accepted as a good idea, because we can’t “start again”, we’re never going to have it. From what I can see, literally the only chance would be a successful violent uprising.

Finally, recently I came across “When Do Startups Scale? Large-scale Evidence from Job Postings” by Lee and Kim. Now this paper isn’t specifically looking at software, and they use the word “experiment” to mean changing the product the company is creating in order to find the best fit for their market – they’re not talking about experimenting with software. Nevertheless:

We find that startups that begin scaling within the first 12 months of their founding are 20 to 40% more likely to fail. Our results show that this positive correlation between scaling early and firm failure is negated for startups that engage in experimentation through A/B testing.

It’s definitely a big stretch, but in the case of software this could be evidence that delaying writing lots of code, for as long as possible, is beneficial. Avoid complexity; continue to experiment with prototypes and throw-away code and treat nothing as sacrosanct for as long as possible. Do not acquiesce to complexity: give it an inch and it’ll take a mile before you even realise what’s happened.

So what to do? I’ve sometimes thought that say, once a month, companies should run some git queries and identify the oldest code in their code-bases. This code hasn’t been touched for years. The authors have long since left. It may be that no one understands what it even does. What happens if we delete it? Now in many ways (probably all ways) this is a completely mad idea: if it ain’t broke, don’t fix it, and why waste engineering resources on recreating code and functionality that no one had a problem with?

But at the same time, if this was the culture across the entire company and was priced in, then it might enable some positive things:

  • There would be more eyes on, and understanding of, ancient code. Thus less ancient code, and more understanding in general.

  • This ancient code may well embody terribly outdated assumptions about what the product does. Can it be updated with the current ideas and assumptions?

  • This ancient code may also encode invariants about the product which are no longer true. There may be a way to change or relax them. By doing so you might be able to delete various workarounds and tech-debt that exists higher up.

Now because I would guess a lot of ancient code is quite foundational, changing it may very well be quite dangerous. One fix could very quickly demand another, and before you know it you’ve embarked upon rewriting the whole thing. Maybe that’s the goal: maybe you should aim to be able to rewrite huge sections of the product within a month if it is judged to be beneficial to the code-base. But of course this requires such ideas to be taken seriously and valued right across the company. For the engineering team to have a strong voice at the top table. And really is this so different from just keeping a list of areas of the code that no one likes and dedicating time to fixing those? I guess if nothing else, it might give a starting point for making such a list.

Unnecessary complexity in software seems endemic, and is frequently worshipped. This, and a fear of experiments to rewrite, blunts the drive to simplify. Yet the benefits of a smaller and simpler code-base are unarguable: with greater understanding of how the product works, a small team can move much faster.