CHAPTER 14. Measuring and Improving CMDB Accuracy

We’ve touched on the issue of data accuracy many times already. More than anything else, the accuracy of your data will determine the amount of trust your organization places in the Configuration Management Database (CMDB). A trusted CMDB opens the door to a world of benefits, whereas a mistrusted CMDB will simply become another expensive information technology (IT) effort without significant payback.

But exactly how to do you make sure that the data in the CMDB is accurate? Clearly we begin by defining what we mean by accuracy. It may seem strange, but data accuracy is often misunderstood. Once we’ve established a working definition, we need to measure accuracy in a way that doesn’t negatively impact the overall configuration management service. After baselines have been established through a solid measurement system, we can track the sources of errors, and eventually reduce the number of errors to improve the overall accuracy.

These four steps—define, measure, track, and improve—are described in this chapter, and depicted in the visual outline shown in Figure 14.1.

Figure 14.1 There are four steps to improving the accuracy of the CMDB.

Image

Defining Accurate Configuration Data

The definition you choose to use for the accuracy of your CMDB is quite important. If the definition is too stringent, you will spend significant resources trying to achieve accuracy in unimportant details. If the definition you choose is too lax, you will be declaring your database very accurate when people depending on your data know it isn’t. This section helps you strike a balance that will work for your organization by balancing validity, correctness, and timeliness.

Accuracy of Attributes

The immediate thought when people think of CMDB accuracy is that the attributes of each configuration item (CI) are correct. This is a good start, but we’ll see that it is not nearly complete enough. Suppose that for every server, you are tracking the server name, number of CPUs, physical memory installed, and the status of the server. You find that one of the machines has the wrong number of CPUs, so there is an error in one of the specific attributes of that server.

You may find out, however, that a change was executed on the server just yesterday, so what looks like an error may actually be an issue of timing. The normal cycle for updating the CMDB in your process might be several days, and during those days there will be a discrepancy between what is recorded and reality. These timing errors are different from errors where data is inaccurate, and your policy on measuring accuracy must account for them. In most cases, you’ll want to exclude timing errors from any measures of CMDB accuracy.

Your accuracy policy will also need to further distinguish between errors and invalid entries. Using the previous example, the status of the server is probably one of a specific set of values. The tools and database constraints will ensure that one of those values is always assigned, so in some sense you could say that the value in the status attribute will always be valid. However, if the status is “production” but the server is really disconnected from the network and in a storage location, the status is not correct even though it is valid. So, you need to make a distinction between attributes that are invalid and those that are inaccurate.

You should also think about your position on empty attribute values. In some cases, an attribute is clearly in error if it is empty, such as in number of CPUs. In other cases, such as the date of the most recent audit, an empty attribute might be a perfectly valid value. There should be a difference between saying that no value is possible or available and saying that you simply forgot to put a value into an attribute. Perhaps you want to define “n/a” as a valid value in some places, and ensure that your procedures help everyone understand exactly when to use it. Whether or not an attribute can be empty is most likely defined in your database, but you should also think about it in terms of errors in the database. Empty values may or may not be errors, depending on your definitions.

Accuracy of Configuration Items

As stated in the preceding section, simply knowing whether attributes are accurate is not nearly enough to define the accuracy of your CMDB. It is also quite possible that a CI itself can be inaccurate.

One obvious error in CIs is incorrect classification. Although it would be relatively rare for someone to mistake a router for a server, classification errors will spring up with less tangible kinds of CIs, such as business processes, documents, or even compliance audits. Each CI can have only one classification, but it’s sometimes difficult to determine which classification applies. Your definition of database accuracy should define what is meant by an error in classification.


CLASSIFICATION ERRORS

One client I worked with wanted to define network file shares as CIs. Unfortunately, they also wanted to define permissions to network file shares as a separate configuration item. Although I told them this much detail might be expensive to maintain, they persisted and we went into production with this schema.

We soon learned that it was extremely difficult for even the IT people to tell the difference between a file share and the permission to use a file share. We were constantly getting errors in classifying the CIs, and people were confused by any search or report that showed both groups.

If you put too many categories, or categories that are too closely related, into your scope definition, you’re just inviting classification errors in the future.


Another common problem with CIs is that they aren’t in the CMDB at all. This is the most difficult of all errors to find because you never know what might be missing. Anyone who has spent any time in the IT field knows that as hard as you try, equipment will often get moved around without any record. We’ve all heard the stories of opening the door to a closet to find a whole room full of equipment that we never knew about.

The positive side of discovering new CIs is that at the very moment you make the discovery, you can collect data, and thus the error is corrected. Part of your policy around defining database accuracy should define how this is handled. At the very minimum, you should keep a count of the number of newly discovered CIs so that you can discover trends and form an educated opinion about how many more CIs you may be missing.

Accuracy of Relationships

The third part of the CMDB that must be considered for accuracy is the relationships between CIs. Just like attributes and CIs, there are several ways that relationships can be inaccurate.

By far the most common inaccuracy in relationships is that they grow stale. By stale, we mean that the relationship was broken in the real world, but the CMDB was not updated to reflect that fact. This happens because many organizations believe that “simple” changes do not need to be controlled as tightly as more complex changes, and this policy allows for many kinds of modifications to the environment that are not tracked.

As one example, consider a desktop PC with four software packages installed on it. A workstation discovery tool will find this PC, and faithfully record the PC with relationships to each software package. In most organizations, however, users will be allowed to remove any of these software packages whenever they like. When a user removes a package, the relationships recorded in the CMDB will still exist and be stale until the discovery tool is executed again and finds that the software is no longer installed on the workstation and removes those old relationships.

So, it is quite possible that relationships will be stale as a matter of timing, but it is also possible that someone will simply forget to update the CMDB. Timing issues probably will not be classified as an error, but forgotten updates are obviously errors. This is a distinction that should be made in your accuracy policy for the CMDB.

The opposite of stale relationships can happen as well—that is, it is possible that you will connect two things in the real world but not indicate the newly created relationship in the CMDB. This also can be a timing issue because a user has downloaded and installed a new software product on a desktop, but the discovery tool hasn’t run yet to update the CMDB. Undocumented relationships can also happen simply because the organization is not accustomed to thinking through all the implications of the changes being made.

Consider a more complex situation where a new database server is being installed to support databases for several different applications. The server CI is captured, and a separate CI is captured for each database. It is quite easy to forget to define the relationships between the server and the databases, or perhaps between the databases and the business applications that use those databases. IT professionals need to learn to think in configuration terms to remember to record all the different relationships being formed. While this learning is taking place, you will need significant help from the configuration manager to examine each of the changes being made and ensure that configuration information is being updated appropriately.

Finally, relationships can be in error because they are recorded as the wrong type. Just as CIs have a category, so do relationships. If you are implementing a SAN device and want to indicate the physical connection between a server and the SAN, it would be inaccurate to use a relationship type of “logical connection” rather than “physical connection.” Just as with CIs, the way to avoid this mistake is to have a smaller number of relationship categories, and clearly differentiate those categories so everyone understands when to use each one.

Figure 14.2 displays the different areas where errors can occur in the CMDB.

Figure 14.2 Errors can occur in attributes, configuration items, or relationships.

Image

Ultimately, you need to create a policy that defines what is meant by accuracy of the CMDB. This policy is the foundation for being able to measure and report on accuracy, as described in the next section.

Ways of Measuring Accuracy

Now that you’re armed with a well-considered way to define accuracy, you can move on to measuring the accuracy of the CMDB.

Defining accuracy was about policy and documentation, whereas measurement is about practical action. This section describes in detail how to go about measuring accuracy in your environment.

Counting Numbers of Errors

The first part of measuring the accuracy of your CMDB is to count the number of errors.

You know what an error is, but how do you find errors? The most reliable way is by comparing the data in the CMDB with other data to see whether there are differences. You can think of these as data audits or spot checks.

The sample size you use to conduct these spot checks will shrink as you find higher levels of accuracy and grow if you start to see less accuracy. A good rule of thumb is to spot-check 5 percent of your database. If you are finding very few errors with this sample, you should assume your overall CMDB accuracy is good and reduce the sample size to 2 percent or so. On the other hand, if your original sample of 5 percent shows numerous errors, grow the sample size to 10 percent for the next spot check and continue to grow until you’ve identified the major sources of errors.

There are several good sources to use for comparison. If you originally got the data from a discovery tool, you can compare the data in the CMDB to the same data in the discovery tool. If you are reconciling discovery data with your production CMDB daily, you should find very few errors.

Another possible source is to do a physical comparison back to the actual item being compared. For software, this could be as easy as selecting “Help” and “About” to determine the version of the software and comparing that to the version listed in the CMDB. For hardware, this might mean going to where the hardware is located and finding characteristics such as the serial number and model number printed on the case.

When doing this kind of comparison, it is clearly impossible to verify every single CI and relationship monthly. You will need to select a representative sample and extrapolate to determine the number of errors in the whole database. You’ll want to choose a large enough sample so that a single error doesn’t skew your results too far. If you compare only ten items and find that one is incorrect, you have a 10 percent error rate. Adding another 90 items to the comparison without finding another error brings your error rate down to only 1 percent. The larger the sample size, the longer it will take to do the comparison and investigate the differences. A larger sample will get you closer to the true accuracy of your database, however. Finding the proper balance of sample size and effort is a matter of maturity and may take some time.

In addition to finding errors by actively looking for them, you will be able to discover some errors by using the configuration data in your operational processes. Each time someone raises a request for a change, they should be looking at configuration data to assess the impact of the proposed change. Many times this person raising the request for change (RFC) will be very knowledgeable about the environment and will be able to spot some errors just by looking at the data. You need to train your staff to avoid the tendency to simply get frustrated in silence at this phenomenon. Instead, people finding errors while looking at the data should be rewarded and encouraged to report the error. It wouldn’t be out of the question to even put a “bounty” of some kind on each error, rewarding people for paying attention to details and improving the accuracy of the data.

The incident management process often results in someone digging deeply into a particular part of the environment. These people should be encouraged to take a few extra seconds to compare what they find with the CMDB. If they will do this comparison before resetting a router or rebooting a server, you will get many of the benefits of a physical inventory without having to incur the cost. By helping improve the accuracy of the CMDB in this way, you also improve the IT staff’s confidence IT in the data. The more people who feel responsible for accuracy, the higher your accuracy is likely to be. Figure 14.3 summarizes these three ways to detect CMDB errors.

Figure 14.3 There are three ways to actively find errors in the CMDB.

Image

Your operational processes are also the most likely place to find undiscovered CIs and stale relationship kinds of errors. As people become more experienced in using configuration data, they will begin to sense when pieces of the environment are not described completely. You’ll start to hear people say things like “I know there should be a router here, but it’s not described in the CMDB.” At this point, you know that the organization has really embraced configuration management and is maturing in their understanding of the interrelationships of all the processes.

Spot checks and operational processes should work together to give you the best possible coverage of your entire database. You should specifically select your physical audit data to include things that haven’t been involved recently in one of the operational processes. Likewise, if you have selected something to participate in a spot check and realize that a technician had just checked that CI recently, you can remove it from the list. Seeing both sides as part of the same effort helps to reduce your costs and increase coverage.

The important part of finding errors is having a good means of logging them. Some organizations like to create a special type of incident to track errors they find with configuration data. This allows you to use the incident tracking system you probably already have, but it requires that you work out reports and dashboards to not include these tickets that really don’t represent degradation of service to any of your IT consumers.

Another possibility, if your tool is flexible enough, is to create a table in the CMDB to track the errors. This has the benefit of easily relating the errors to the CI or relationship involved, but requires that you create input screens to enter the data. Of course, if neither the incident system nor the CMDB is a suitable place for you to store error information, you can always use a spreadsheet to do your tracking. The tool you choose is not nearly as important as the information you track.

For every error, you should capture the person who found it, the way it was found, the CI or relationship identifier where the error is found, and the date and time the error was uncovered. This information will help you determine the source of the error and find ways to prevent it from happening in the future.

Investigate and Sort Errors

After you have found and logged all errors, the next step is to investigate them to determine whether they truly are errors. This is where you take into consideration the timing situations described earlier. Each of your various operational processes is likely to have a different cycle, so you can’t simply check whether the error occurred some number of days after the last incident or change. Instead you need to consider each discovered error, understand when the data should have been changed by your standard processes, and then eliminate each one that has a valid explanation for the data inconsistency.

Investigating errors is a time-consuming process. You should actively look for opportunities to automate it as much as possible. For example, if you can create a clever query that cross-checks data between the CMDB, the incident ticketing system, and the change management tool, you could produce a report that shows the most recent change and incident for each CI that has an error. Such a report can help to eliminate many timing errors very quickly. As you consider the policy for defining accuracy, you’re likely to see other opportunities to build reports or queries that can sort the false errors from the real ones more quickly. It will probably not be possible to completely automate the error investigation, but you should get as close as you can.

When you have investigated all errors, you should indicate in your tracking system which ones remain as errors and which ones can be explained. Usually this can be accomplished with a simple status field for each error.

Reporting Overall CMDB Accuracy

Most senior IT managers want to think of CMDB accuracy in terms of a single percentage. In my experience, people want to talk about the CMDB as being somewhere between 90 and 100 percent accurate. That’s a great measurement, but getting there from a simple list of discovered errors is not straightforward.

The formula for percentage accuracy is going to be the number of errors discovered divided by the number of opportunities for error. It’s that second number that takes some consideration. Let’s think about how to define the denominator of the formula shown in Figure 14.4.

Figure 14.4 There is a simple formula describing the accuracy of the CMDB.

Image

Most organizations start out with the idea that they will simply divide the number of errors by the number of CIs in the database. This is simple to understand, but leads directly to a quandary. If a CI has 12 attributes associated with it and two of them are incorrect, is that 2 errors or 1? If you are simply dividing by number of CIs but counting each attribute or relationship as a possibility of an error, you can actually have more errors than CIs.

You could simplify your definition of accuracy to say that if any attribute or relationship of a CI is incorrect, that whole CI is wrong and counts as a single error. That is a case of making the measurement easier at the expense of losing the ability to really understand what is going wrong. If you don’t track the individual errors, you can’t be sure to fix them all.

So, the next possibility is that each attribute and each relationship is an opportunity for error. Even in a very small database, the denominator will quickly grow quite large, which will result in a much higher percentage of accuracy than just counting the CIs.

Note, however, that even adding all the existing relationships and attributes does not represent all the opportunities for errors. You could still have errors that are introduced as new CIs are discovered and missing relationships are added. When these are discovered, you will add to the numerator (number of errors) and the denominator (opportunities for error) at the same time, making these the most significant kinds of errors.

Like so many other things in IT operations, you should decide how to measure CMDB accuracy based on practicality. Initially, it might make sense to track each error individually, but calculate accuracy as the number of CIs with any error divided by the number of total CIs. As you grow in maturity, you can move on to dividing total errors on any aspect by the total number of attributes and relationships. When your configuration management service has been operating for a while, you can even use historical trends to predict how many errors will be found as new CIs or relationships are discovered, and create a formula to factor those into the overall error calculation.

The important thing is that you create a measurement that your stakeholders will understand and appreciate. Don’t simply publish the number, but be sure to explain exactly how you calculate the number. Unfortunately, the industry hasn’t yet selected a single method to calculate accuracy, so you will need to make people aware that a comparison with other organizations may not be as simple as comparing their percentage to yours.

Improving Accuracy

So now that you know how accurate your CMDB is, you can calculate the accuracy to several decimal places. Whatever your number turns out to be, you will immediately be challenged by someone to make it better. After all, operations people are never satisfied with less than perfection!

The next step is to understand how the errors came to be in your database. This involves a bit of detective work and a lot of understanding of your processes. With some experience, you’ll realize that the type of the error often determines what caused the error. In this section, we look at the various causes and work backward to what types of errors are likely to be seen with these causes. The results are summarized in Figure 14.5.

Figure 14.5 Different classes of error require different handling.

Image

Clerical Errors

Clerical errors are the most obvious. It is very easy to see most typographical or transposition errors. Clerical errors can also occur whenever a person uses a direct interface to create or update configuration data. This can involve choosing the wrong value from a drop-down list, selecting the wrong radio button, or any other such mistake made on an entry screen.

Errors can be introduced any time manual entry is needed, but the error may not actually be introduced in data entry. Many times we ask technicians to gather configuration information when they are in a data center server room or at a user desktop. Because most technicians don’t carry a laptop or other entry device with them, they pull out a pencil and write down the needed information; or worse yet, try to remember it until they get back to their desk. When they try to re-create the information, they have either lost the paper, can’t read their own handwriting, or have simply forgotten what they tried to remember.

Clerical errors are the most difficult kind of errors to detect, but perversely they are the simplest to correct. These errors are usually subtle, like the transposing of two digits in the serial number of a piece of equipment, or having the wrong location specified. These errors normally cannot be identified through clever queries or searches in the CMDB, but only through careful comparison or spot checks.

The best way to prevent clerical errors is to automate more of the process. This can take many different creative forms. You can implement a scanning tool on each piece of equipment so technicians do not have to write down information. You can even issue hand-held devices so that technicians can electronically scan external characteristics like asset bar codes and device serial numbers. You can implement electronic links from procurement or software licensing systems to create new CIs without having to manually enter them. Any way that you can avoid a person writing down or typing in configuration data will be one less way that clerical errors can be introduced into the database.

It is probably not possible to eliminate all manual activities from the configuration management process. You can also eliminate some clerical errors by increasing the training of your key people who will manually be handling or entering data. Teach people not to try to remember information, give them at least a clipboard and a paper form to capture data accurately, and train them to double-check data as they enter it. It is surprising how even these low-tech safeguards can improve the accuracy of the CMDB.

Process Errors

A second class of errors to be aware of are process errors. These occur typically when some process has failed or hasn’t been followed well. For example, there are organizations that fail to properly link the change management process with the incident management process. The service desk gets a call that a desktop PC won’t power on. An incident is opened and sent to the desk side support queue. A technician goes to the desk and finds that the system board on the computer is bad, so a replace computer is needed.

The correct step would be to branch to the change management (or Install/Move/Add/Change) process at this point and follow that until the appropriate configuration update is made before coming back to incident management to indicate resolution. In organizations that don’t have well-integrated processes, the new PC may be acquired from a storage or redeployment locker without consideration of the configuration database update. The result is an incorrect state for both the broken machine and the machine that has been newly called into service—errors introduced because of a failed process.

The most common process errors are introduced because of unauthorized changes. People who decide that the “needs of the business” supersede the need for a change record will rarely update the CMDB to reflect the environment after they’ve gone ahead with their change. When these changes are made, errors are introduced to the CMDB, and the rest of the IT organization suffers with inaccurate data. There are also cases where people are honestly trying to follow the operational processes, but overlook steps that provide correct configuration data. All these are cases of process errors.

Some process errors can be found through standard operational reporting or special queries designed to look for certain types of errors. Other process errors, such as unauthorized changes, are found through scanning technology or spot audits. The difference is that processes should yield predictable results—you can often use this fact to find errors by finding where the results are not predictable. Several times already we’ve considered a report that shows recently completed changes and whether the CMDB has been updated after the change. This same report can often be used to find process errors.

The primary way to avoid process errors is to create more control points in the processes and to train all of your people to follow the processes. Each control point is another opportunity to have a report that might find whether process errors have been made. Of course, more control points and measurements will also cause more process overhead, driving up the cost and slowing down the process. Use these control points judiciously to minimize your errors, but then relax them as your people are more accustomed to the discipline of configuration management.

Programming Errors

The final kind of error we’ll consider is the programming error. Hopefully these will be the rarest form of error in your CMDB, and the easiest to spot. Programming errors potentially could be introduced by your CMDB tool itself or by the underlying database management middleware, but those would be very rare situations. In most cases, programming errors are the result of some integration software that is trying to pull data from other sources into the CMDB or federate data and reflect that data into the CMDB. These integration programs can be written in a variety of ways, and often aren’t built with the full rigor your business applications use.

Programming errors can be spotted with reporting. Normally they show up as impossible inconsistencies in the data. For example, you might see all workstations added on a single day as having two gigabytes of memory, when you know your environment is more heterogeneous than that. Keeping an adequate report of what CIs are added or changed by each execution of your integration tools is extremely important to protect against the possibility of programming errors.

Programming errors are eliminated through stronger testing of your integration work. If the programs are tested adequately, preferably in a development or testing environment first, there should be very few cases where programming causes errors in the CMDB.

Finding errors that exist, and preventing errors from creeping into the CMDB in the first place, should be a significant investment area for you for the first several years of your configuration management effort. Only after your accuracy is consistently better than 97 percent should you start directing your investments toward higher business value. After all, the value of all other processes and projects depending on configuration data will be diluted if your accuracy is not high enough.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.21.109