Have you ever been to the Arctic Circle?
I’ve recently returned from a week in Ivalo, northern Finland, with my family. Ivalo is 160 miles north of the Arctic Circle.
We went snowmobiling, ice fishing, husky dog sledging and reindeer sledging. The huskies were amazing – small, very strong dogs whose only wish in life is to pull a sledge through the beautiful snow-capped forest at an impressive speed.
The scenery was stunning. And cold.
I took two jumpers with me. Seemed sensible. It occurred to me that if my jumper got wet somehow, I wouldn’t be able to use it until it dried out. You could call it “having a spare”, like the spare tyre we used to carry in our cars.
Or you could call it “being prepared”, a la boy scouts.
My point is that we should recognise that sometimes that stuff hits the fan and things don’t work out as we’d hoped.
It’s the same in life and in business. Especially when you bring technology into it.
In the 18 years we’ve been supporting Linux, we’ve seen businesses that are vulnerable in that way with their systems. They might have multiple servers, but there’s no resilience, no spare.
It’s tempting to think that a spare would be too expensive, that completely replicating a working system “just in case” is a poor use of resources.
I’d agree.
Forking out a bit under £100 for a spare jumper – which I can make use of anyway – is one thing.
Forking out a few tens of thousands of pounds – or more – to completely replicate your IT systems is something else altogether.
Eliminating Single Points Of Failure
The truth is that you don’t need to fully replicate your working systems to be able to eliminate the “SPOF”, the Single Point Of Failure.
Let’s consider a very simple example, one which some of our clients have running live today.
The key components of a simple, contemporary website will be some kind of web server software (typically Apache or NGINX) and a database (typically MySQL/MariaDB or PostgreSQL).
Those components can both run on a single server, provided the website isn’t too busy.
But, if that server fails, there’s no website.
If that server needs to be taken down for some kind of maintenance, there’s no website.
What’s the solution?
A simple fix is to run two servers together with some clustering software, such as Pacemaker with Corosync. The clustering software is configured with some rules:
- Apache must be running on one server
- MariaDB must be running on one server
- Ideally, Apache and MariaDB must NOT run on the same server
Normal operation would have Apache running on, say, server A and MariaDB running on server B.
Let’s assume server A fails. The cluster software realises that rule number 1 has been broken. Rule number 3 suggests that server B should not be used for Apache, but that’s the only option. Rule 3 isn’t mandatory: it’s idealism, so the cluster software brings up Apache on server B.
On a day to day basis, the workload is split across two servers. In the event of a hardware failure, the workload seamlessly shifts to the remaining server.
That’s a trivial example, but the same techniques may be used when there are three or more servers. In that case, you almost certainly don’t need to double the number of additional servers. What’s key is that there is some redundancy built it.
What about the cloud?
The situation is a little different with cloud systems. Whilst a failed cloud instance will typically restart – possibly on different hardware – automatically, there are other gotchas to be aware of. We’ll look at those another time.
What you should do next
If you’ve got – or suspect you might have – a single point of failure in your Linux infrastructure, get in touch and let’s talk about how that can be resolved.
If you’re going to Finland, just pack an extra jumper. I recommend Cashmere.