What is troubleshooting?

Troubleshooting (TRBL) is a complete process where you (with the role of VMware administrator) identify an issue, try to find the origin of the problem, and define the way to resolve it.

The main steps involved during the troubleshooting process are therefore the following:

Defining the problem
Identifying the cause of the problem
Resolving the problem

The complexity of VMware environments is that different layers are involved and the problem could impact any of the sources due to different reasons:

Hardware failures
Software problems
Network problems
Resources contention
Mistakes in configuration

A big mistake that occurs quite often, is considering TRBL only when your environment has totally failed, for example, with a Purple Screen of the Death (PSOD) error. NO! TRBL is about all problems and you should start troubleshooting when there is a problem or when users report problems in terms of performance, reliability or usability. The first step of every TRBL process is collecting all the symptoms. Here, you must be careful because symptoms and the origin of the problem can be totally different. This stage is very important for gathering additional information to define the problem.

The typical questions could be— Can the problem be reproduced? What is the scope? Was the system changing before the notification of the problem? Is the problem documented in the KB VMware? and so on.

There is a good new for you, yes, it is possible to repeat the problem then you can concentrate directly on the issue. For example, if only hosts with Serial-Attached SCSI Host Bus Adapter (SAS HBA) don’t work, successfully applying a new firmware for FC HBA can be the solution.

When you have all the information, you can start TRBL from the following three components:

You start on the VM OS level and continue down to hardware
You start at the hardware level and continue up to the VM OS level
You can start in the middle, at the VMkernel level, and continue up or down

After identifying the cause, you must specify the level of the problem to be fixed for your production environment, assigning a priority:

High: Resolve as fast as a possible
Medium: Resolve during the first possible window
Low: You can wait for the next maintenance window

Solutions levels can be classified as follows:

Short: Typical workaround
Long: Reconfigure or change the advanced configuration and so on
Impact: Apply available patches from VMware or other vendors

A problem resolution may require the use of different solutions together. But I think the theory is done, and we can start with real examples of how to troubleshoot your production.

Table of Contents for What is troubleshooting?

Create new playlist

Sign In

Sign Up

Table of Contents for
What is troubleshooting?