What is troubleshooting?

Troubleshooting (TRBL) is a complete process where you (with the role of VMware administrator) identify an issue, try to find the origin of the problem, and define the way to resolve it.

The main steps involved during the troubleshooting process are therefore the following:

  1. Defining the problem
  2. Identifying the cause of the problem
  3. Resolving the problem

The complexity of VMware environments is that different layers are involved and the problem could impact any of the sources due to different reasons:

  • Hardware failures
  • Software problems
  • Network problems
  • Resources contention
  • Mistakes in configuration

A big mistake that occurs quite often, is considering TRBL only when your environment has totally failed, for example, with a Purple Screen of the Death (PSOD) error. NO! TRBL is about all problems and you should start troubleshooting when there is a problem or when users report problems in terms of performance, reliability or usability. The first step of every TRBL process is collecting all the symptoms. Here, you must be careful because symptoms and the origin of the problem can be totally different. This stage is very important for gathering additional information to define the problem.

The typical questions could be— Can the problem be reproduced? What is the scope? Was the system changing before the notification of the problem? Is the problem documented in the KB VMware? and so on.

There is a good new for you, yes, it is possible to repeat the problem then you can concentrate directly on the issue. For example, if only hosts with Serial-Attached SCSI Host Bus Adapter (SAS HBA) don’t work, successfully applying a new firmware for FC HBA can be the solution.

When you have all the information, you can start TRBL from the following three components:

  • You start on the VM OS level and continue down to hardware
  • You start at the hardware level and continue up to the VM OS level
  • You can start in the middle, at the VMkernel level, and continue up or down

After identifying the cause, you must specify the level of the problem to be fixed for your production environment, assigning a priority:

  • High: Resolve as fast as a possible
  • Medium: Resolve during the first possible window
  • Low: You can wait for the next maintenance window

Solutions levels can be classified as follows:

  • Short: Typical workaround 
  • Long: Reconfigure or change the advanced configuration and so on
  • Impact: Apply available patches from VMware or other vendors

A problem resolution may require the use of different solutions together. But I think the theory is done, and we can start with real examples of how to troubleshoot your production.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.200.136