Key to the very best availability of your servers is effective system monitoring. This is the first in a series of articles where we’ll look at exactly what that means. We won’t get too technical, but if you believe you already have effective system monitoring in place, you may want to check you’ve incorporated all these suggestions.
Disk Failures
System monitoring is about more than just reducing downtime. Take disk drives, for examples. How likely is it that a disk will fail? Research by Google suggests that, for disks aged 2-5 years, an annual failure rate between 6% and 9% is realistic. If you have 15 drives installed across your servers – a very modest number of disks – you should expect around one failure per year. So are you going to experience disk failure, and soon.
The impact of failure of a disk is high: it’s likely that some, maybe all, data on that disk will be lost; however, that’s no problem if you have RAID implemented in such a way as to have redundant disks. The data on the failed disk has gone, but that data is replicated on other disks.
Part of the point of RAID is to make disk failures transparent. So long as you are monitoring the health of the RAID arrays, you’ll be aware of, and can react to, the disk failure. If you’re not monitoring them, you’ll only find out that you have a problem when sufficient disks have failed to give you data loss.
Even if you detect the failed disk, you could be unlucky. You wouldn’t be the first person to replace a RAID disk and have another disk fail during the RAID array rebuild. It’s not uncommon for two reasons. Firstly, it’s likely that all the disks in the RAID array are the same age from the same batch from the same manufacturer, and are thus likely to have a similar lifetime. Secondly, rebuilding a RAID array will stress the remaining disks, possibly enough to cause one more to fail.
But that’s OK: you have backups – but they need to be monitored too. Was the last backup successful? If it was successful, how old is it? Monitoring the success and age of the most recent backup is a sensible policy.
So we can see that monitoring is not simply about making sure things services are available. It’s also about checking that our safety measures – RAID disks, backups, etc – are working as intended.
Types of Monitoring
We will consider three types of monitoring:
- Status monitoring
- Trend monitoring
- Log monitoring
Status Monitoring
The principle of status monitoring is simple: periodically, the monitoring system connects to each server in turn and runs some checks, usually with aid of some locally-installed agent. This takes place every five minutes or so. The results of those checks are passed back to the monitoring server, which will typically:
- record the data for later analysis if required
- present them, perhaps via a web page or other program
- instigate notifications as required
The monitoring server will usually also facilitate:
- running single checks, or perhaps all checks for a specific client, on demand
- creation of reports showing host or server availability statistics, periods of downtime, notifications issued, and so on
- scheduling downtime for a host or service such that checks (and notifications) are suspended
- the acknowledgement of issues reported
- the recording of performance data – for example, CPU load
An example of status monitoring would be monitoring how full a disk partition is. If we are alerted once a partition is 80% full, we can investigates. Crucially, at that point there is nothing wrong with the system: all that has happened is that the partition has gone from below 80% full to above 80% full. It may be that someone has saved a number of large temporary files or that user data in general has increased, or even that there is a system problem that is causing free disk space to be used, but the warning should allow us to resolve the problem before it causes an unscheduled outage.
Summary
The concept of system monitoring has been around for a long time, but to use it effectively we need to think carefully about exactly what should be monitored.
In the next article, we’ll look at trend and log monitoring.
Could This Article Be Improved?
Let us know in the comments below.