Linux Server Resilience
When people talk of ‘resilience’, they mean the continued ‘availability of a service’, particularly under adverse conditions (eg, hardware failure). A service might be email, a website, a code repository, and so on. For some businesses – for example, banks or airlines – there may be a requirement that services be available 24/7. More than a few seconds’ downtime would be a major problem. For many of us, our requirements are less onerous but downtime is still to be avoided where possible.
How do I know whether my systems are resilient?
If any of the following are true, you may have a resilience problem:
- A single system failure could adversely affect your business
- IT problems interrupt your work
- You don’t have effective system monitoring in place
- You regularly find yourself working around the same issues
- There are unplanned service outages
What do I need to do?
Your focus should be on the availability of the service itself rather than the availability of any one server or system that provides it. The options available to improve resilience have changed enormously over the past decade or so. To best understand where we are now, it’s useful to look back on how we got here.
History lesson
Prior to the mid-2000s, businesses would buy or lease hardware to run their applications. Those servers would be hosted either in a server room owned by the business or in a dedicated data centre, and each server would run multiple applications. Even relatively simple applications, such as a WordPress website, consist of multiple components. WordPress needs a web server of some kind (maybe Apache), a way of handling PHP code, and a database. So they’d build a server, set up Apache, PHP and a MySQL database, install WordPress and they’d have their company website live.
By and large, that worked. It worked well enough that there are still a huge number of servers configured in exactly that way today. But it wasn’t perfect, and one of the bigger problems was resilience.
Lack of resilience meant that any significant issue on the server would result in a loss of service. Clearly a catastrophic hardware failure would mean no website, but there was also no room to carry out certain kinds of scheduled maintenance without impacting the website. Even installing a security update may necessitate restarting some services, such as a database, which in turn meant a (usually momentary) service outage.
The lack-of-resilience problem was largely solved by building ‘high-availability clusters’. The principle was to have two or more servers running the service – say a website – and configured in such a way that the failure of any one server didn’t result in the website being down. The service was resilient even if the server wasn’t.
Later, a better solution to the resilience problem arrived in the form of cloud computing. You would set up a server instance or two on AWS or Google Cloud, and if one of the instances failed for some reason, it could be restarted automatically.
The cloud solution is much more flexible than the traditional standalone server, but it isn’t the utopia it first appeared to be. Updating the running cloud instances is not straightforward. Developing for the cloud has challenges too: the laptop your developers are using may be similar to the cloud instance, but it is not the same. If you commit to AWS, migrating to Google Cloud is a non-trivial undertaking. And what if, for whatever reason, you simply don’t want to hand over your computing to Amazon, Google or Microsoft?
The best way forwards (for some)
The contemporary solution is much smarter. The components of your application are decomposed into microservices: discrete units that provide just one service of the overall application, such as a database or a web server. Each microservice is run in a container, which is a self-contained software unit that can run anywhere: on AWS, on your own hardware, on Google Cloud, on your developer’s laptop. If your developer is running OS X on her MacBook, you’re running Windows on your office system, and your production system is running Linux, it still works. The container neither knows nor cares what operating system the hardware is running.
There are tools to help us build resilient, scalable and maintainable infrastructures that are not tied to a specific environment. One component is Docker, a mechanism for running the containers themselves, again in any environment from your laptop to AWS. Kubernetes is a container orchestration tool that can build and manage an infrastructure of containers. It will ensure sufficient containers are running, manage connectivity between them, load balance, manage container upgrades, and much more.
Kubernetes has its roots in Google, stemming from a Google project, GIFEE (‘Google’s Infrastructure For Everyone Else’). Google has been wildly successful running huge numbers of servers in a well-coordinated, secure way worldwide, and Kubernetes makes that technology available to everyone. Without doubt, Kubernetes is changing the face of big application computing.
That’s too complex for me!
You don’t need to be the next Google to benefit from Kubernetes: there are lots of everyday businesses running it. Before deciding it isn’t right for you, it is worth a little investigation. That said, Kubernetes won’t be right for everyone.
There are a variety of approaches listed above, ranging from multiple servers configured as a high-availability cluster through traditional cloud approach to Kubernetes. Somewhere along that path lies the sweet spot that’s right for you.
If you feel that you need to stick with a single traditional hardware server, there are still things you can do to improve resilience. You should ensure that your server:
- has multiple, redundant power supplies (PSUs)
- has each power supply connected to an independent feed
- is connected to an Uninterruptible Power Supply (UPS)
- has redundant disks (RAID) configured, with notification if a disk fails
- uses enterprise-class disks
- is no more than five years old
Such a server isn’t immune from hardware failure, of course, but it does reduce the likelihood of such a failure impacting your business.
What next?
Why not arrange a free 30 minute call to discuss your situation, and we’ll advise you of the best way forwards?