How to backup your Linux system
Linux backups doesn’t just mean copying data. To get backups that are appropriate for your business, you need to do some planning. That’s the part many businesses skimp on – and they only find out when it’s too late.
Here’s what we’ll cover in this article:
- Why you need to classify your data
- Determine your Recovery Time Objective (RTO)
- Determine your Recovery Point Objective (RPO)
- Combining data classifications, RPO and RTO
- Techniques to reduce recovery time (RTA)
- Replication versus backups
- Local versus off-site backups
- Single versus multiple backups
- The psychology of backups
- How to monitor backups
- Testing backups
- How to do it yourself
- How to have it done for you
Why you need to classify your data
Before you consider how to implement backups, you need to classify your data. It’s easy to say, ‘All data must be backed up remotely every night’, but that’s neither practical nor necessary. All data is not equal, and may be categorised under one of the following headings:
Business-critical data
This is data that, if unavailable, would threaten the survival of the business. This will be a very small subset of the overall data, and often comprises the business’s intellectual property. Examples include:
- The contents of a source code repository
- The database and code comprising customer websites
- The detailed methodologies used by a consulting company
Business-operational data
This data is used to manage the business day to day, and consists of client lists, calendars, accounts, proposals, and so on. If lost, the business would undoubtedly lose efficiency, and it may well lose money or even a client or two. However, the business itself would probably survive.
Static data
Data that never changes. Some examples:
- Reference data, often from an external source
- Executable programs
- Files from the project that was completed last August
Users need access to this data, but they will never be changing it.
Ephemeral data
This data is transient in nature. It has a short lifespan and is easily, cheaply and often automatically regenerated if needed. Examples tend to be somewhat technical in nature, and include:
- Cached data for a website
- The local copy of a central code repository
- The intermediate files generated when compiling code, such as logs and object files
Normal data
This is all other data. This is data that, if unavailable, would be irritating, annoying or inconvenient, but would not have a major impact upon the day-to-day operation of the business.
Identifying the category that each element of data falls into is necessary because our requirements for the availability of each category will be different. If you ask your staff what data is business-critical, a common answer would be ‘all of it’. That’s an understandable perspective: if some data that they had identified as not being business-critical is lost and there’s no backup, they may well feel that they will be blamed.
So what are the backup requirements for each type of data?
To answer that, we need to consider how quickly we need to recover failed services, and how much data we can afford to lose.
For the sake of example, let’s imagine a somewhat old-school tape backup is run every evening at 22:00, and takes 90 minutes to run. Later, we’ll consider more contemporary backup techniques.
Determine your Recovery Time Objective (RTO)
The Recovery Time Actual (RTA) is a measure of how quickly a business service is restored after failure. In our example, imagine that a catastrophic disk failure occurs at 12:00. We’ll assume that the spare hardware is available an hour after failure, and recovery begins at that point. The restore takes 90 minutes and thus the service is back online at 14:30. The RTA is thus 2h30m.
RTA is always a real-life measurement and will vary from one instance to the next. It’s helpful to define a target recovery time, and that is our Recovery Time Objective or RTO.
The RTO is the maximum duration that we can tolerate a service outage. A log of our RTAs, drawn from both actual and simulated application failures, shows how well we are meeting our RTO.
Determine your Recovery Point Objective (RPO)
It may be that we have set a Recovery Time Objective of four hours. In the example above, our RTA was 2.5 hours, so we were well within our target. There is, however, another consideration.
In the example above, the failure occurred at 12:00, but the last backup was carried out at 22:00 the previous day, and thus ten hours of operational data was lost. We refer to this time as the Recovery Point Actual (RPA).
As before, it’s helpful to define a target, and that’s our Recovery Point Objective or RPO.
So, our Recovery Time Objective is the maximum time we can tolerate a service being unavailable, and our Recovery Point Objective is the maximum time period of data we can tolerate losing.
Combining data classifications, RPO and RTO
Broadly speaking, the shorter the RPO and RTO times, the more expensive the infrastructure needed.
For static data, we need a backup but we don’t need to refresh that backup. One or two copies are fine. In some cases – for example, reference data from a manufacturer – we can obtain another copy relatively easily, so we may not need a backup at all. In other cases, we may have an archive of the data and keep a local copy for reference. We may decide that an RPO of a week is acceptable.
If this static data is merely a local copy of data that is readily available elsewhere, such as manufacturer’s specification sheets, we may decide that an RTO of one or even two days is acceptable.
Business operational data tends to have the focus on RTA. We need the application available quickly, but it may be acceptable to have to re-enter today’s invoices.
Conversely, business critical data tends to put the focus on RPA. We can accept that the code repository for the new chip takes half a day to recover, but we don’t want to lose more than an hour’s worth of work.
Ephemeral data doesn’t need to be backed up at all. It is regenerated as required, often without any intervention.
Techniques to reduce recovery time (RTA)
RTA comprises two parts: making a platform available and carrying out the data restore.
If your platform is a VPS or cloud instance, making a new platform available should be straightforward.
If you’re running on conventional hardware, you can speed things up by having a rapid response hardware support contract, holding spare parts on site or even having an entire spare server.
In the case of a major disaster, it may be necessary to start by installing an operating system with utility and application software: time which will add to the recovery.
Techniques to reduce the platform installation and configuration time include:
Configuration Managements System (CMS)
A CMS will automatically install and configure software according to a predetermined template. There are many
such systems to choose from (see this Wikipedia page), with Puppet and Ansible being among the most common. The advantages of using a Configuration Management System include:
- speed: the system does not need to wait for human interaction;
- consistency: the system will be set up exactly as it was before, assuming that it was originally build using the same CMS;
- completeness: all steps will be carried out as defined by the template;
- accuracy: no typos or similar human errors
Infrastructure as code
For cloud-based systems, every element of the solution – server instances, load balancers, database services and so on – is defined in code such that the entire infrastructure may be automatically rebuilt without intervention. Similar to a Configuration Management System, but more comprehensive and more powerful for an initial cloud infrastructure build.
Highly available systems
For more conventional hardware, Highly Available (HA) systems mitigate against hardware failure. In essence, multiple systems are configured to provision the service such that any one (or possibly multiple) hardware systems can fail and the service remain available. With suitably connectivity, it’s possible to have the various components hosted in different physical locations.
Replication versus backups
Replication is a valuable tool to mitigate against hardware failure, but it sits alongside backups rather than replacing them. Consider, for example, the simple case of a pair of mirrored disk drives. All data written to one drive is replicated to the second, hence creating a “mirror”. If one disk fails, the data persists on the other. It’s important to detect and repair the now-broken mirror, but at this point no data has been lost.
If data is deleted, that action will be propagated across both drives to preserve the integrity of the mirror. If that data was deleted in error, it is gone from both drives. Backups are needed to recover from such a situation.
Local versus off-site backups
Off site backups are invariably slower and less convenient than local backups, but if you have to choose between them, choose remote. If the location hosting the application suffers a major catastrophe (fire, flood, etc), it’s likely that whatever fate befalls the application servers will also impact any local backups. You need remote backups.
In an ideal world, you’d have both local and remote backups. The local backups will provide rapid access for both backup and restore, and it’s often practical to run very frequent backups for critical data. The vast majority of restores can be serviced from the local backups – but the remote backups are the life insurance of the business and must take priority.
Single versus multiple backups
How many backups do you need? Ultimately the backups are just another data store, and as such they are vulnerable to all the same risks as the original data. Pragmatism should prevail, however: use data classification to determine how many backups are appropriate. One scheme might be:
- Normal data: local backup only, or local plus one remote backup.
- Business Operational Data: a minimum of one remote backup.
- Business Critical Data: a minimum of two independent remote backups.
“Independent” in this context means separate locations that are not related (for example, they should not be provided by the same hosting company).
The psychology of backups
The biggest problem by far with backups is how we think about them. Other than finger trouble whereby we inadvertently delete a file, we don’t expect to need backups.
In too many organisations, there are known deficiencies with backups. The danger signs include:
- some data that should be backed up is not
- junior staff are solely responsible for backups
- backups are not monitored
- backups are never tested
Common sense tells us that it can happen to you, and if it does you’ll need solid, reliable, off-site backups. At that point it will be too late to create them: if you don’t have them, it’s quite possible that your business won’t survive
Hoping that it doesn’t happen to you is not good business strategy. It’s preferable to invest the time, money and resources to get solid backups that to explain to clients, board members, investors and potentially the Information Commissioner’s Office why data has been lost.
How to monitor backups
Ideally your backups will be fully automated, but whether they are automated, manual or a mixture of both, you should monitor them. There are all sorts of things you could monitor, but the following three are a good starting point:
- success: perhaps obviously, any failure in backups that is detected must be investigated. There likely will be failures from time to time – network glitches, power issues and so on – but you need to satisfy yourself that you understand why each backup failure occurred, and you need to do what you can to prevent similar failures occurring again.
- elapsed time: backups must finish in a reasonable time. What is reasonable will depend upon your defined RPO, but you should determine the maximum time backups should take and monitor for backups that exceed – or even get close to – that time.
- age: the last backup may have been successful, but if it completed seven months ago, you have a problem.
When your monitoring reports a problem, it must be rigorously investigated. A backup failure is an urgent, high priority issue, and the support team must find a resolution as soon as is reasonably possible.
Testing backups
Testing backups is the only way to know in advance whether they are fit for purpose. The test restores should not be performed by the “go to” person: assume that they are unavailable.
The person performing the restore should work from a written set of instructions. You are checking:
- are the instructions available when systems are down?
- are the instructions clear and complete?
- is all additional information (passwords, IP addresses and so on) available?
- was the restore carried out effectively?
- was all data restored?
- does the restored data match the source data?
- was the RTO met?
It is very likely that the first few test restores will highlight problems, and test restores are the best time to find such problems.
How to do it yourself
In summary, this is what you need to do:
- Classify your data
- Determine the RPO and RTO for each category of data
- Select techniques for backups, disaster recovery and high availability that will enable you to achieve the Recovery Point and Recovery Time Objectives
- Devise a test schedule that will measure and record RPA and RTA
- Monitor backups for success, age and elapsed time
- Have a “restore” manual with a copy stored off-site
- Review test restores at least annually
How to have it done for you
Tiger Computing has been providing Linux support and consultancy services to businesses since 2002. Get in touch to discuss how you get effective, automated, monitored backups in place: