Figuring out what to monitor

Modern servers have a lot of active hardware and software that can stop working at any time. A failure can start with the operating system, storage, database, network connectivity, heat, or a number of other sources.

So, which elements do we rank highest to ensure system availability? Which hardware needs the closest monitoring? What kind of tests should we use to ensure that the software is operating as expected?

When dedicating monitoring resources to check hardware and software, we must answer several questions to distribute effort efficiently. Every test takes time, uses network resources, and must save its results to a status file. If our system checks are too frequent or numerous, we could end up overwhelming our monitor server. Failing to prioritize the alerting criteria can actually be more dangerous; if we become too accustomed to ignoring irrelevant alerts, legitimate system issues can propagate unchecked.

Thus, the first step in building a monitoring infrastructure is to decide what it will monitor and why.

Getting ready

We're going to be building a spreadsheet. This spreadsheet will rank all of our hardware and software so that we know which systems deserve the most focus. Have a spreadsheet program available before starting.

How to do it...

Follow these steps to rank the priority and frequency of monitoring hardware and software:

  1. Create a spreadsheet with six columns labeled Monitor, Importance, Frequency, Warning Level, Critical Level, and Action.
  2. Under the Monitor column, list every piece of hardware and software on the server.
  3. Under the Importance column, rank every monitor at one of these three levels: minor, major, or critical.
  4. Under the Frequency column, choose a monitoring interval. We suggest that you use one of these choices: 10 seconds, 30 seconds, 1 minute, 1 hour, 12 hours, or 1 day.
  5. Under the Warning Level column, choose a threshold where the status of this resource should be considered a warning and might require further examination.
  6. Under the Critical Level column, choose a threshold where the status of this resource should be considered critical and in need of immediate attention.
  7. Under the Action column, pick an appropriate action that the monitor should take when a check triggers an alert. We suggest one of these choices: ignore, email support, email DBAs, and panic.

How it works...

The spreadsheet we're making requires only six columns to fit this recipe. Feel free to include any other relevant information when making your own spreadsheet. In fact, we suggest that you retain this document in source control for reference purposes and revisions. Its mere existence can prove beneficial as a necessary compliance document.

When we say to list every piece of hardware or software under the Monitor column, we expect a few to be forgotten. Part of this step is a mental filter; if we can't think of the resource, it probably isn't important enough to watch. There are limits to this, and we strongly suggest that you have at least two other objectives for people to verify that the list is complete.

For Importance and Frequency, we're really deciding how active this resource is and its likelihood to fail or require intervention. For example, consider a disk space monitor. Usable disk space is a major concern, but it's not likely to grow quickly. We can safely check disk space every hour or even every day and remain completely covered.

The Warning Level and Critical Level columns are essential to route the triggered alerts. A level of warning means a resource may need someone to double-check its status or acknowledge a problem for later review. If a resource reaches a critical status, every person interested in the server should be alerted immediately.

Finally, the monitoring software needs to know what action to take if an alert is triggered. If we ever choose ignore, we should simply disable that particular alert entirely. On the other hand, the support staff can usually solve simple resource problems or forward the alert to a DBA. At other times, we want the DBA to know immediately due to the importance or complexity of the hardware or software being monitored. As a last resort, the alert can merely panic and alert everyone in every contact list in the hope that at least one person is available to address the issue.

In the end, the first few lines of our spreadsheet may look something like this:

How it works...

There's more...

If we have access to a collaborative spreadsheet tool such as Google Docs or an internal Wiki, we should maintain this information there. Not only does this act as a central resource, but it ensures that all monitors have a logical reason to exist and have a predetermined escalation path. When problems arise, any time spent on deciding what to do or who to inform only serves to increase the overall amount of risk.

In the rare instance that management or business interests question our system monitoring policies, we have an immediate answer. As DBAs, we want our company to know that the database is in good hands, and a strict monitoring policy helps accomplish this.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.107.229