Chapter 11. Under the Hood with the ESXi Tech Support Mode

As you migrate your environment from VMware ESX to ESXi, what was your stable ESX environment will become your stable ESXi environment. Your management methods may change, but you’ll slowly grow accustomed to the nuances of managing ESXi.

If you have to perform low-level troubleshooting of your ESXi hosts, the differences in architecture will become quite obvious and you’ll be in a situation where some aspects are familiar, but others quite different. In this chapter, you explore the following aspects:

  • Accessing Tech Support Mode

  • Auditing access to Tech Support Mode

  • Exploring the boot process and filesystem for ESXi

  • Understanding standing system backups and repairs

  • Using Tech Support Mode to troubleshoot your hosts

Accessing Tech Support Mode

The Direct Console User Interface (DCUI) provides a local user interface to the console of an ESXi host. As discussed in Chapter 3, “Management Tools,” the DCUI is a simple, menu-driven interface that you can use to configure and manage components, including performing the following actions:

  • Configuring the password for the root account

  • Configuring the Internet Protocol (IP) settings and network interface cards (NICs) used for management access

  • Viewing system logs

  • Restarting management services

With the DCUI, you can control access to Tech Support Mode (TSM), whether that be for local access or remote access via Secure Shell (SSH). To enable TSM with the DCUI, follow these steps:

  1. Access the DCUI for the host and press F2 to open the System Customization menu.

  2. Select Troubleshooting Options and press Enter.

  3. Select either the Enable Local Tech Support Mode or Enable Remote Tech Support Mode (SSH). If either option starts instead with the word “Disable,” this indicates that the option has already been enabled.

  4. Press Enter to enable the service.

  5. Select the Modify Tech Support Timeout option and press Enter.

  6. Enter a value between 0 and 1440 to set the timeout value for TSM in minutes. Setting the timeout value to 0 disables the timeout option.

Once you have enabled TSM, you can press Alt+F1 to access Local TSM or use an SSH client to access Remote TSM via SSH. You require a login that has been granted the Administrator role on the host or a local user that is a member of the root group. If Local TSM is not enabled when you press Alt+F1, you receive the error message Tech Support Mode Has Been Disabled by the Administrator. For Remote TSM, the connection is refused as ESXi does not have the port for SSH enabled unless the Remote TSM service is running. If you have Lockdown Mode enabled on the ESXi host, you are prevented from using TSM. In that case, the error message is Login Incorrect for Local Tech Support Mode and Access Denied for Remote TSM.

Note

The timeout value sets the number of minutes that can elapse before you must log in to TSM. If the timeout expires, you must enable TSM again before you can log in. If you have an existing session open to TSM when the timeout value expires, that session is not closed.

TSM may also be enabled with the vSphere client, both manually and using Host Profiles. To enable TSM manually, use the following process:

  1. Log in to your ESXi host or vCenter Server with the vSphere client.

  2. Select the Configuration tab for the host and click Security Profile.

  3. Click the Properties link to access the Services Properties screen.

  4. Select either Local Tech Support or Remote Tech Support (SSH) and click Options.

  5. On the Options screen, click Start. You can optionally configure the startup policy for the service.

  6. After the service has started and is showing a Status of Running, click OK to close the Options screen.

  7. Click OK to close the Services Properties screen.

You can also enable the TSM timeout value in the vSphere client. On the Configuration tab, select Advanced Settings in the Software section. Find the UserVars.TSMTimeOut parameter and set it to a value between 0 and 86400 seconds. A value of 0 disables the TSM timeout.

Tip

Setting a short timeout value between one and two minutes ensures that you have ample time to connect to TSM, but safeguards that the service is not left running after you have connected.

If you’re using host profiles, you can configure the settings for TSM mode as shown in Figure 11.1. In this example, the Remote TSM is set to have a startup policy of Off and the Advanced Setting UserVars.TSMTimeOut has been set to 120 seconds. Note that Host Profiles does not check for the status of the TSM services to verify if they are stopped or running.

Configuring Tech Support Mode with vSphere Host Profiles.

Figure 11.1. Configuring Tech Support Mode with vSphere Host Profiles.

You can also configure the TSM services with PowerCLI. The next example connects to each ESXi host and stops the Local or Remote TSM services if they are running. The startup policy for the services is set to Start and Stop Manually, and the timeout value is set to 2 minutes.

$VMhosts = Get-VMHost
ForEach ($VMhost in $VMhosts)
{
Set-VMHostService -HostService (Get-VMHostService -VMHost $VMhost |'
     Where {$_.key -eq "TSM-SSH"}) -Policy "Off"
Set-VMHostService -HostService (Get-VMHostService -VMHost $VMhost |'
    Where {$_.key -eq "TSM"}) -Policy "Off"
$status = Get-VMHostService -VMHost $VMhost | Where {$_.key -eq "TSM-SSH"}
    If ($status.Running -eq "True") {Stop-VMHostService -HostService'
    (Get-VMHostService -VMHost $VMhost | where {$_.key -eq "TSM-SSH"} ) }
$status = Get-VMHostService -VMHost $VMhost | Where {$_.key -eq "TSM"}
    If ($status.Running -eq "True") {Stop-VMHostService -HostService  '
    (Get-VMHostService -VMHost $VMhost | where {$_.key -eq "TSM"} ) }
If((Get-VMHostAdvancedConfiguration -VMHost $VMhost -Name'
    UserVars.TSMTimeOut).Values -ne 120 ){Set-VMHostAdvancedConfiguration'
    -VMHost $VMHost -Name UserVars.TSMTimeOut -Value 120'
    -Confirm:$False
  }
}

Note

TSM can be used when testing and debugging the pre-boot, post-boot, or first boot portions of the automated installation scripts for ESXi. This is not recommended for production environments. To enable Remote TSM during your installation script, you can add the following lines to the appropriate section for when you want to enable Remote TSM:

vim-cmd hostsvc/enable_remote_tsm
vim-cmd hostsvc/start_remote_tsm

To enable Local TSM with your installation script, you can add the following lines:

vim-cmd hostsvc/enable_local_tsm
vim-cmd hostsvc/start_local_tsm

You can also set the TSM timeout value within your installation script. The following command sets the timeout value to 300 seconds:

vim-cmd hostsvc/advopt/update UserVars.TSMTimeOut long 300

Auditing Tech Support Mode

If you’re managing a large environment, you may wish to take steps to ensure that TSM is not improperly used. Although TSM provides an important tool for troubleshooting, problems can result if it is used for other management tasks that can be performed with the vSphere client or automated with the vCLI or PowerCLI.

One line of defense to ensure that TSM is not casually accessed is to enable Lockdown Mode. As discussed in Chapter 7, “Securing ESXi,” Lockdown Mode prevents all user access to both Local and Remote TSM. However, if users have access to log in to the DCUI or can access the Configuration tab for the host, they can disable Lockdown Mode, then access TSM. To monitor for a change in Lockdown Mode, you can use the following process to enable a vCenter Server alert:

  1. Start the vSphere client and connect to vCenter Server.

  2. At the appropriate level, select the Alarms tab and click Definitions.

  3. Select File > New > Alarm.

  4. Enter an Alarm Name and Description for the new alarm.

  5. Select an Alarm Type of Host and check the option Monitor for Specific Events Occurring on This Object.

  6. Select the Triggers tab.

  7. Click Add to create a new trigger.

  8. Set the Event to Host Administrator Access Enabled.

  9. On the Action tab, configure an appropriate Action for your alarm.

  10. Click OK to create the new alarm.

Caution

When you disable Lockdown Mode in the DCUI, a corresponding event is not currently logged with vCenter Server. If you are sending syslog data to a management server, you can look for a rapid succession of log entries containing vim.Authorization Manager.setEntityPermissions. You can also monitor for a number of login events by the dcui account as shown in Figure 11.2. When you enable Lockdown Mode, you should be aware that it does not terminate existing TSM sessions to the host.

Disabling Lockdown Mode with the DCUI does not generate a specific event within vCenter Server.

Figure 11.2. Disabling Lockdown Mode with the DCUI does not generate a specific event within vCenter Server.

If you prefer using PowerCLI rather than vCenter Server alarms, you can use Get-VIEvent to monitor for the event that indicates that Lockdown Mode has been disabled. The following script checks all hosts connected to your vCenter Server for this event. When Lockdown Mode is enabled, the event HostAdminDisableEvent is generated.

Get-VIEvent -Entity $_.Name | Where { $_.Gettype().Name'
                -eq "HostAdminEnableEvent"}'
                | Select CreatedTime, UserName, FullFormattedMessage
PowerCLI may also be used to query for the events related to enabling Local and Remote TSM.
The following script queries vCenter Server for events of the type LocalTSMEnabledEvent and
RemoteTSMEnabledEvent:

Get-VIEvent -Entity $_.Name | Where { ($_.Gettype().Name'
                -eq "LocalTSMEnabledEvent") -Or ($_.Gettype().Name'
                -eq "RemoteTSMEnabledEvent") }'
                | Select CreatedTime, UserName, FullFormattedMessage

Caution

If you enable TSM in the DCUI or when connected directly to the host with the vSphere client, vCenter Server records the LocalTSMEnabledEvent and RemoteTSMEnabledEvent type of error. If you start the TSM services via vCenter Server, a Service Start task is recorded instead of the TSM events. You will also note some differences in how the user is recorded when TSM is enabled. When TSM is enabled in the DCUI, the event is recorded by ESXi and vCenter Server as having been initiated by the DCUI user. When the change is made by a user connected directly to the ESXi, both ESXi and vCenter Server record the actual user that initiated the change. Lastly, if the change is made when connected to vCenter Server, vCenter Server records the actual user, but ESXi records the task as initiated by the vpxuser account.

The last option for auditing TSM and the actions taken in TSM is to use syslog. Setup of syslog is discussed in Chapter 6, “System Monitoring and Management.” The following events correspond with TSM being enabled. You could create an alert that would be triggered based on the event text VMware Tech Support Mode available:

Oct 1 23:13:17 Hostd: [2010-10-01 23:13:17.147 23506B90 verbose 'ServiceSystem'
    opID=E3CD7415-0000012C] Invoking command /bin/ash /etc/init.d/TSM start
Oct 1 23:13:17 root: TSM Displaying TSM login: runlevel =
Oct 1 23:13:17 init: init: process '/sbin/initterm.sh TTY1 /sbin/techsupport.sh
    ++min=0,swap' (pid 579733) exited. Scheduling it for restart.
Oct 1 23:13:17 init: init: starting pid 584732, tty '/dev/tty1': '/bin/sh'
Oct 1 23:13:17 root: techsupport VMware Tech Support Mode available

The following events record a user’s session with TSM. The initial dropbear events record a remote SSH session initiated from a remote host. This corresponds with the warning message that the user receives when accessing TSM shown in Figure 11.3. Note that the commands executed by the user in TSM are logged from a source of shell. The last event records the end of the user’s TSM session.

When TSM is accessed, a warning message is issued and the event is logged to the VMkernel log file.

Figure 11.3. When TSM is accessed, a warning message is issued and the event is logged to the VMkernel log file.

Oct 1 22:46:44 dropbear[582158]: Child connection from 192.168.1.225:62951
Oct 1 22:46:51 dropbear[582158]: pam_per_user: create_subrequest_handle():
    doing map lookup for user "root"
Oct 1 22:46:51 dropbear[582158]: pam_per_user: create_subrequest_handle():
    creating new subrequest (user="root", service="system-auth-generic")
Oct 1 22:46:51 dropbear[582158]: PAM password auth succeeded for 'root' from
    192.168.1.225:62951
Oct 1 22:46:51 shell[582159]: Interactive shell session started
Oct 1 22:50:13 shell[582378]: esxcfg-nics -l
Oct 1 22:50:21 shell[582378]: vdf -h
Oct 1 23:47:35 dropbear[582158]: exit after auth (root): Exited normally

Exploring the File System

ESXi was designed to be easily deployed to thousands of nodes and at the same time to enable the deployment of very small turnkey installations. As you’ve seen in prior chapters, ESXi can be preinstalled on a flash device or installed to a small local or remote hard drive. In the future, you can expect to see ESXi booting on completely diskless systems. These design goals required a change in the way that the boot files are stored, and ESXi differs significantly from how a traditional operating system (OS) is installed or even how ESX is installed and boots.

Note

VMware Auto Deploy is an experimental product from VMware Labs that enables automatic Preboot Execution Environment (PXE) boot and customization for VMware ESXi. With Auto Deploy, you can use completely diskless and stateless ESXi hosts. Each time an ESXi host boots from the PXE server, it is automatically configured by Auto Deploy using Host Profiles and other information stored within vCenter Server. The host can then join a cluster and begin to handle your virtual machine workload. VMware Auto Deploy is not supported in production environments, but provides a preview of how you might deploy your ESXi hosts in the future. It is available for download from http://labs.vmware.com.

The system partitions for ESXi are summarized in the following list. The partition layout is the same whether you’re using ESXi Embedded or Installable. The system partitions include the following:

  • Bootloader partition. This 4MB partition contains SYSLinux, which is used as a bootloader to start ESXi.

  • Boot bank partition. This 250MB partition stores the required files to boot VMware ESXi. The partition is also referred at as Hypervisor1.

  • Alt boot bank partition. This 250MB partition is initially empty. The first time you patch ESXi, the new system image is stored here. The partition is also referred to as Hypervisor2.

  • Core dump partition. This 110MB partition is normally empty, but the VMkernel will store a memory dump image if a system failure occurs. The partition can be managed with vicfg-dumppart from the vCLI.

  • Store partition. This 286MB partition is used to store system utilities such as the ISO images for VMware Tools and floppy disk images for virtual device drivers. The partition is also referred to as Hypervisor3.

When you are installing ESXi Installable on a disk with at least 5GB of storage, ESXi also creates a 4GB scratch partition. This partition is mounted as /scratch and is used to store the output from vm-support and to store upgrade files, and is set as the default location for the advanced parameter Syslog.Local.DatastorePath. If you are installing ESXi to boot from a storage area network (SAN), you can allocate a 5GB logical unit number (LUN) for the boot disk. Note that if you’re using a scripted installation, the installer requires an additional 1GB of space as it attempts to create a datastore.

When your ESXi host first starts, SYSLinux is loaded. SYSLinux looks at the file boot.cfg, which is located both on Hypervisor1 and Hypervisor2. SYSLinux uses the parameters build, updated, and bootstate to determine which partition to use to boot ESXi. The following is a typical boot.cfg file:

kernel=b.z
kernelopt=
modules=k.z- - -s.z- - -c.z- - -oem.tgz- - -license.tgz- - -m.z- - -vpxa.
   - - -state.tgz- - -aam.vgz
build=4.1.0-235786
updated=2
bootstate=0

After a new installation, ESXi boots from Hypervisor1, which is mounted by ESXi as /bootbank. Hypervisor2 is mounted as /altbootbank and is initially empty. When you patch or upgrade your ESXi host, a completely new firmware image for ESXi is loaded on the host and stored in /altbookbank. The boot parameters are updated so that when ESXi next starts, it will recognize Hypervisor2 as containing the version of ESXi that should be booted. That partition is mounted as /bootbank, whereas Hypervisor1 becomes the new /altbootbank. The partition roles are reversed the next time you patch your ESXi host. With this design, your host boot partitions always contain two complete images that can be used to boot ESXi. If you patch a host and a problem is detected when ESXi starts to boot, the system automatically reverts to the previously installed version of ESXi.

Tip

Update activities on ESXi generate a log file called esxupdate.log. This file can be found in /store/db.

If you experience problems with ESXi after installing a patch, you can manually revert to the prior version with the following process. Begin by rebooting the ESXi host. At the initial Loading VMware Hypervisor screen, press Shift+R. The warning message shown in Figure 11.4 is displayed, where you can press Shift+Y to revert back to the prior version. The screen then displays an option to view the log for the event. Press Esc to view the log or press Enter to continue booting. If the operation was successful, the log displays the message Fallback hypervisor restored successfully. You can then press Esc to exit the log screen or press Enter to continue the boot process. The version of ESXi that you experienced problems with becomes /altbootbank and will be overwritten the next time you apply a patch. If you attempt to repeat the process to revert to the boot version, because the bootstate parameter is set to 3, you will receive the error No valid fallback hypervisor found. If you need to boot ESXi to that version again, you can reapply the ESXi patch or change the parameter bootstate to 0. If you’re applying customizations from a third party, each patch causes a change in the boot bank used. If you apply two or more patches, you won’t be able to revert to the prepatch state using this method. You instead need to run a repair installation and restore a system configuration backup.

The Loading VMware Hypervisor screen displaying a warning message.

Figure 11.4. The Loading VMware Hypervisor screen displaying a warning message.

After SYSLinux determines which system image to boot, boot.cfg is read to determine the files that are used to boot the VMkernel. The files that ESXi uses are loaded into memory and then not accessed again on storage until the host is rebooted. It is possible, although not recommended, to remove the boot device from an ESXi host that has completed the boot process. For the most part, the host will function properly, only having difficulty with system processes that access the boot media such as system backups and that access VMware Tools ISO images. Likewise, changes made to the ESXi memory file system are lost when a host reboots. As you will see in the following section, the ESXi system backup process backs up the necessary system state files, but if the random access memory (RAM) disk fills up due to a technical issue, that problem will not persist after a reboot.

There are three significant file types that ESXi uses to boot. First are the Executive files tboot.gz (Trusted Platform Module files), vmkboot.gz (small core), and vmkernel.gz, which make up the VMkernel are loaded into memory as executables. These files exhibit no presence within the ESXi memory filesystem. Second, a series of Archive files with the extension vgz, called tardisks, are mounted and extracted to form the filesystem. Those files include system. vgz, vpxa.vgz, aam.vgz, extmod.tgz, and oem.tgz. These packages use the vSphere Installation Bundle (VIB) format. system.vpz contains core system files, vpxa.vgz contains files for the vCenter Agent, and aam.vgz contains the High Availability system files. VIB updates from third parties may also be listed. The files in these VIBs are extracted in a progressive manner. If a duplicate file is found in both system.vgz and oem.tgz, the file in oem.tgz is extracted later, as is the version of the file that is used by ESXi. The last file type is the State archive file. This file is called state.tgz with ESXi Installable and local.tgz with ESXi Embedded. The contents in both versions are the same and the archive file contains a backup of the files necessary for the configuration of your ESXi host to persist between reboots. This file is discussed further in the following section.

When the ESXi filesystem has been extracted into the RAM disk, the end result is similar to the directory listing of the root folder shown in Figure 11.5. The root of the filesystem and most folders—such as bin, etc, and sbin—are stored in memory. Note that ESXi does mount the disk partitions that correspond to bootbank, altbootbank, scratch, and store. As you browse the filesystem, it will appear similar to what you would experience with ESX. /sbin contains a number of esxcfg executables that can be used to configure and manage the host if you are unable to do so via vSphere client or other vSphere application programming interface (API) client.

The VMware ESXi filesystem.

Figure 11.5. The VMware ESXi filesystem.

If you run df -–h, you get another view of the filesystem showing the disks that ESXi has mounted, as shown in Figure 11.6. Listed first is visorfs, which is the RAM disk that ESXi has created. The four vfat partitions are bootbank, altbootbank, scratch, and store. In this case, the boot disk for ESXi also contains a Virtual Machine File System (VMFS) datastore.

The filesystems mounted for a typical ESXi installation.

Figure 11.6. The filesystems mounted for a typical ESXi installation.

The command vdf is new to ESXi 4.1. It provides some valuable data about the RAM disk. The listing that follows first shows the tardisks that ESXi has extracted to create the filesystem. These entries correspond to the Archive and State file types in boot.cfg, discussed earlier. The Space value listed represents the extracted size of the tardisk, not the compressed size within /bootbank. The first tardisk listed is using 199MB of memory on the host. If the root filesystem is running low of space, you can use vdf to check whether one of the tardisks is using too much memory. The command also displays information about the mounts that are available on the RAM disk. The following output shows four mounts: MAINSYS, tmp, updatestg, and hoststats. MAINSYS is the root folder, whereas tmp is /tmp. hoststats is used to store real-time performance data on the host, and updatestg is used as storage space for staging patches and updates. These four mounts and tardisk mounts correspond to the 1.3GB size of visorfs, shown in Figure 11.6.

~ # vdf -h
tardisk          Space    Used
SYS1              199M    199M
SYS2               55M     55M
SYS3               12K     12K
SYS4               12K     12K
SYS5                4K      4K
SYS6               42M     42M
SYS7               20K     20K
SYS8               12M     12M
- - - - -
Ramdisk           Size      Used Available Use% Mounted on
MAINSYS            32M      1M      30M     3%- -
tmp               192M      0B     192M     0%- -
updatestg         750M      8K     749M     0%- -
hoststats          53M      1M      51M     3%- -

In the output from vdf, you can observe the size of memory that has been allocated to the mounts MAINSYS, tmp, updatestg, and hoststats. updatestg has been allocated 750MB, but is currently using only 8KB of actual memory. You can also view the resource allocation to the visorfs components and other system processes for ESXi using the System Resource Allocation screen as shown in Figure 11.7.

You can view resource allocation for the ESXi RAM disk filesystem with the System Resource Allocation screen.

Figure 11.7. You can view resource allocation for the ESXi RAM disk filesystem with the System Resource Allocation screen.

One last command that is useful to explore is vdu. The following example shows the output of that command for /etc. The command summarizes the source of files within a folder structure.

~#vdu-hs/etc
For'/etc':
                 tardiskSYS1:   4M       (  221inodes)
                        heap:  84K       (   43inodes)
              ramdiskMAINSYS:  60K       (    6inodes)
                 tardiskSYS6:   6K       (    5inodes)
                 tardiskSYS8:   4K       (    2inodes)
                 tardiskSYS2:  10K       (   22inodes)

As you navigate the ESXi filesystem, similarities to the ESX service console will be evident, but you will note that some of the Linux commands that you may have used, such as nano, are missing. The command interface to the VMkernel is based on BusyBox. BusyBox, a single executable designed for use with a Linux kernel, provides many of the standard tools that you would find in Linux, including cp (copy), kill (kill process), tar, and tail. Given the small size of BusyBox, it is typically used with embedded devices. BusyBox uses the ash shell. If you are monitoring your ESXi host with resxtop, you can observe a single process for BusyBox and one ash process for each Local or Remote TSM session that is in progress.

It is important to note that TSM is intended to be the last method of access to your ESXi host. The first level of management should be via the vSphere API and tools such as the vSphere client, the vCLI, and PowerCLI. The DCUI provides the next level of support; with the DCUI, you can restart the management agents, troubleshoot the management network, and reset the configuration of your ESXi host. The TSM provides the last resort for access, and misconfiguration at this level can have significant consequences on the host. Ideally, TSM access should be made under the guidance of VMware Support. In practice, you may find that you need to perform configuration with TSM, as the vSphere API tools may not include specific capabilities, or problems such as those related to the management network may be difficult to troubleshoot without TSM access. Some of the commands that you may use in TSM are discussed later in the section “Troubleshooting with Tech Support Mode.”

Tip

If you’re developing a complex installation script that employs the %firstboot section, you can use TSM to work through your script in a test environment. The system tardisk that is loaded for the installation process is the same used to boot ESXi normally. The installation merely adds a specific tardisk to include the necessary files for the installation process. The esxcfg and similar commands that you’ll find in TSM also exist when the ESXi installer is booted.

Understanding System Backups and Restores

As discussed in the previous section, ESXi employs a State tardisk to ensure that configuration changes made to the ESXi host persist across a reboot. For ESXi Installable, that tardisk is called state.tgz, whereas for Embedded it is local.tgz. The State tardisk consists of any files in /etc that have been marked with the sticky bit. The initial copy of state.tgz is empty, but files extracted from the other tardisks have the sticky bit enabled and these files are subsequently backed up into state.tgz. For example, the vCenter Agent tardisk vpxa.vgz contains a number of system files that are extracted to /opt/vmware/vpxa, but also the configuration files dasConfig.xml and vpxa.cfg found in /etc/opt/vmware/vpxa. Both files have the sticky bit enabled and are thus backed up into state.tgz. On a subsequent boot of ESXi, these files are extracted from both vpxa.vgz and state.tgz, but as state.tgz is extracted second, those versions of the files overwrite the copies from vpxa.vgz. If you make any changes to files in /etc that do not have the sticky bit enabled, those files are changed only in the RAM disk filesystem and are gone when ESXi is rebooted. The same applies to any files changed, added, or deleted outside of / etc with the exception of mounts that are made to physical partitions such as /bootbank.

To view the contents of state.tgz, you can issue the following commands in TSM or copy the file to a management server for extraction:

~ # cd tmp
/tmp # mkdir state
/tmp # cd state
/tmp/state # cp /bootbank/state.tgz state.tgz
/tmp/state # gzip -d state.tgz
/tmp/state # tar -xvf state.tar
local.tgz
/tmp/state # gzip -d local.tgz
/tmp/state # tar -xvf local.tar

At every minute past the hour, a backup job defined in /var/spool/cron/crontabs/root is executed. The backup job runs the script /sbin/autobackup.sh. This script creates a new state.tgz file and copies it to the appropriate location. The script does check the parameters in both boot. cfg files to see whether a reboot is pending from a patch or update installation. If that is the case, then the new state.tgz file is copied to both /bootbank and /altbootbank. autobackup.sh calls another script, /sbin/backup.sh. In part, the function of that script is to make sure that the backups are made in a consistent manner and to ensure file integrity.

Given that there is a period of time between any configuration changes and the scheduled backup, there is a risk of loss of changes should the host experience an unexpected failure. If this occurs, this loss typically affects the registration of virtual machines, as host configuration tends to be more static. The time period between backups originated from the need to minimize write operations to flash devices. Excessive write operations can limit the life of those devices. If you restart or shut down an ESXi host, part of the shutdown process includes updating state.tgz. Configuration changes should not be lost when you restart or shut down a host.

Repairing ESXi and Restoring from Backups

The vSphere API allows you to make a configuration backup of your ESXi host. In Chapter 6 the process was shown with the vCLI command vicfg-cfgbackup. You can also back up your host’s configuration with the PowerCLI cmdlet Set-VMHostFirmware, as shown in the following example:

Get-VMHost esx06.mishchenko.net | Set-VMHostFirmware -BackupConfiguration
       -DestinationPath c:ackupsesx06

The backup file generated has the file name similar to configBundle-esx06.mishchenko.net.tgz. With either method, the vSphere API merely transfers the state.tgz file from the host to the management server. If you extract the backup file that is generated you can view the same files that ESXi bundles into state.tgz to preserve the host’s configuration. To restore the configuration backup with PowerCLI, you would issue the following command:

Get-VMHost esx06.mishchenko.net | Set-VMHostFirmware -Restore
      -SourcePath c:ackupsesx06configBundle-esx06.mishchenko.net.tgz

The host has to be in maintenance mode before you start the restore and it is rebooted after the restore process has completed.

In some cases, you may not be able to boot your ESXi host due to a corruption of the system partitions, or you may have to revert to a version of ESXi that does not exist in /altbootbank. In such cases, you want to perform a repair installation followed by the restore of the configuration backup. As state.tgz contains the entire configuration of your host, you do not need any further configuration files to restore your ESXi host. To repair your ESXi installation, use the following process:

  1. Insert the ESXi 4.1 installation CD into the host’s CD-ROM drive.

  2. Restart the host and select to boot from the CD.

  3. At the installation Welcome screen, shown in Figure 11.8, press R to begin the repair process.

    The ESXi Repair option is available on the Welcome screen during an ESXi installation.

    Figure 11.8. The ESXi Repair option is available on the Welcome screen during an ESXi installation.

  4. Read the VMware end-user license agreement and press F11 to continue.

  5. Select the disk that contains the original installation of ESXi and press Enter.

    Caution

    The ESXi repair process preserves any VMFS datastores that exist on the installation disk as long as those partitions do not exist within the first 900MB of storage. If you do not select the original installation disk, then a new system image is installed and any existing partitions on that disk will be lost.

  6. If the disk contains an existing partition, you are prompted to confirm your choice of disk. Press Enter to confirm your choice.

  7. Press F11 to begin the repair process.

  8. When the repair process has completed, eject the installation CD and press Enter to reboot.

  9. After the host has rebooted, make a note of the IP address assigned to the host. You may need to assign an IP address manually if you do not have a Dynamic Host Configuration Protocol (DHCP) server for that network subnet.

  10. Restore the host configuration file with the following commands: Set-VMHost -State "Maintenance" Set-VMHostFirmware -Restore -SourcePath c:ackupsesx06configBundle-esx06.mishchenko.net.tgz

  11. Start the vSphere client and connect to your vCenter Server. The host will have a state of Not Responding, as the vCenter Agent software does not exist on the host at this point.

  12. Right-click on the host and select Connect. This initiates the installation of the vCenter Agent on the host. After the process is complete, the host should show a normal status and any registered virtual machines should no longer have a status of Orphaned.

Troubleshooting with Tech Support Mode

If you’re connecting via Remote TSM, you need a client that supports SSH, and if you want to transfer files, your client should support Secure Copy (SCP). Most Linux distributions include a client with these capabilities. For a Windows computer, you can download Putty from http://www.putty.org/ to use for your SSH sessions and WinSCP from http://winscp.net/ to transfer and edit files. ESXi includes only vi for editing files, so it is worthwhile to get a client that includes an easy-to-use file editor.

If you start in /bin, you find a number of useful utilities that you can use to manage files and processes within TSM. If you run ls –l, you will note that many of the program names are symbolic links to other programs. kill, gzip, and tail link back to the busybox executable. BusyBox also includes wget, which you can use to download files from a Web server. This command can prove useful to download patch files directly to your host. If you use wget, make sure that you’re in a location, such as /scratch or a datastore, with sufficient space to store your downloads.

scp, which links to /sbin/dropbearmulti, is a basic SCP client that you can use to transfer files to and from an SCP-capable server. dropbearmulti is also used to provide SSH connectivity to the Remote TSM. Both ping and ping6 link to /sbin/vmkping. If you’re troubleshooting vMotion network issues, you can use either of the commands to check for network connectivity. There is no separate networking stack as there is with the ESX Service Console, so any network traffic is sent by the VMkernel and thus both ping and vmkping work the same way on ESXi.

There have been a number of cautions about working with TSM. You should further note that all commands do not operate in a consistent manner. For the most part, if you don’t know the command options, you can simply issue the command without any options to see a list of available options. With /sbin/reboot, which links to BusyBox, issuing the command without any options will restart your host. To see the options, you have to run reboot -- help. Likewise, if you execute /sbin/techsupport.sh, your screen clears, you see the message Tech Support Mode Has Been Disabled by the Administrator and your TSM session is effectively over.

Given that ESXi can be hosted in a virtual machine even on ESXi, there’s no significant reason not to set up a test host to use for exploring TSM. Instructions for running ESXi in a virtual machine on VMware Workstation were provided in Chapter 4, “Installation Options.” A similar process can be used to host an ESXi virtual machine within your current vSphere environment. Hosting your training environment for ESXi in this manner will provide a method to learn and explore ESXi with no impact on your production environment. With a change to the virtual machine configuration file, documented at the following URL http://communities.vmware.com/docs/DOC-8970, it is possible to run a nested virtual machine on your virtual ESXi host.

This caution isn’t intended to totally dissuade use of TSM. TSM is supported for troubleshooting, remediation, and in some cases configuration purposes. The official support policy for TSM can be found at http://kb.vmware.com/kb/1017910.

There are a few other commands in /bin that are noteworthy. Both vdf and vdu link to /sbin/ vmkvsitools. The vsi of vmkvsitools stands for VMware Sysinfo Interface. As shown in the following output, a number of commands link to vmkvsitools. This command can be used to provide a wealth of information about the host’s hardware, running processes, and memory. As shown in the Knowledge Base article at http://kb.vmware.com/kb/1024632, the command may be used to gain information that you would have obtained from /proc on an ESX host.

/bin # vmkvsitools
Usage: 'vmkvsitools [-c/- -cache vsicache] cmd args' where cmd is one of
    amldump, bootOption, hwclock, hwinfo, lsof, lspci, pci-info, pidof,
    ps, vdf, vdu, vmksystemswap, vmware
A symlink to vmkvsitools with an above command name can also be used.

Lastly in /bin is vim-cmd, which can be used to control and configure a wide range of aspects of ESXi. vim-cmd links to /sbin/hostd. When you execute the command without options, the following is displayed:

~ # vim-cmd
Commands available under /:
hostsvc/        proxysvc/      supportsvc_cmds/       vmsvc/
internalsvc/    solo/          vimsvc/                help

You can add the commands listed to vim-cmd to see additional commands. vmsvc/ deals with the management of virtual machines. That command displays the following options:

~ # vim-cmd vmsvc
Commands available under vmsvc/:
acquiremksticket         get.configoption          power.on
acquireticket            get.datastores            power.reboot
connect                  get.disabledmethods       power.reset
convert.toTemplate       get.environment           power.shutdown
convert.toVm             get.filelayout            power.suspend
createdummyvm            get.guest                 power.suspendResume
destroy                  get.guestheartbeatStatus  queryftcompat
device.connection        get.managedentitystatus   reload
device.connusbdev        get.networks              setscreenres
device.disconnusbdev     get.runtime               snapshot.create
device.diskadd           get.snapshotinfo          snapshot.dumpoption
device.diskaddexisting   get.summary               snapshot.get
device.diskremove        get.tasklist              snapshot.remove
device.getdevices        getallvms                 snapshot.removeall
device.toolsSyncSet      gethostconstraints        snapshot.revert
device.vmiadd            login                     snapshot.setoption
device.vmiremove         logout                    tools.cancelinstall
devices.createnic        message                   tools.install
get.capability           power.getstate            tools.upgrade
get.config               power.hibernate           unregister
get.config.cpuidmask     power.off                 upgrade

As you can see, vim-cmd provides the access to manage all aspects of your virtual machines. The options to manage and configure your hosts are equally numerous. Although this is not the tool to use for day-to-day management, you can leverage vim-cmd in your scripted installs to configure the networking and storage aspects of your hosts.

Within /etc, you find the configuration files for your host. Because the system state backup contains files only from /etc, any configuration changes that you make outside of this folder are temporary. Changes to your host’s configuration files within this folder are permanent if the file has the sticky bit enabled. Some of the configuration files within this folder structure are listed in Table 11.1. As the permissions on those files allow you to overwrite the files with vifs, you could also edit these files in TSM. In either case, you would want to ensure that you have made a system backup first. If you use either TSM or vifs to update these configuration files, changes are not verified by the vSphere API and a misconfiguration could render the host unusable.

In the /opt folder structure, you’ll find the binaries for any agents you install on the host, such as for the vCenter Agent or for High Availability. Configuration files for these agents are found in /etc. If you change the IP address of your vCenter Server host, you can update the file /etc/ opt/vmware/vpxa/vpxa.cfg on your hosts and then issue the command /sbin/services.sh restart to restart the management services on your host. If you need to uninstall any agents from your host manually, you may find an uninstall script in /opt/vmware/uninstallers.

Within /var/logs, you’ll find the log files for your host. /var/log/messages is the main VMkernel log file. This includes events from the VMkernel, any agents running on the host, from the hostd daemon, and commands issued in TSM. ESXi maintains a rotation of 10 copies of the messages file and rotates the files when the current file reaches 1MB. The prior copies are compressed to save space within the ESXi RAM disk. If the ESXi installation created a scratch partition on disk, this log file will be mirrored to /scratch/log/ and the log files in that folder will persist across a reboot. However, if you require long-term storage of your host’s log files, you should enable the syslog service or use vi-logger from the vSphere Management Assistant (vMA) to capture this log file.

Also in /var/logs is sysboot.log. This file captures the boot process from the time the VMkernel initializes to the completion of the boot process. This file is useful to troubleshoot any problems you experience when your host boots up. In /var/log/vmware, you’ll find the hostd log files. Also within subfolders are the logs for agents such as the vCenter Agent service.

There are numerous other folders that you can examine and that provide valuable insight into the inner workings of ESXi. The last folder that this section examines is /sbin. Within /sbin, you’ll find the commands that will be the most useful to your troubleshooting sessions with TSM. To begin with, a number of esxcfg binaries are included. These closely match the functionality you would find with the vCLI or in the Service Console for ESX. A version of esxcli is also included that tends to be more feature-rich than the version found in the vCLI. For this reason, some Knowledge Base articles will direct you to TSM to perform advanced configuration changes. esxupdate is found in /sbin and can be used to manage patches and updates on the host.

If you’re having problems with performance, you can use esxtop. This functions the same as resxtop from the vCLI, but also adds replay mode. With replay mode, you can record and then replay esxtop statistics for a specific period of time. This can be helpful if you need to send the performance data to VMware Support. To record data, you can use the command vm-support -S -i 5 -d 120. The -i parameter sets the query interval, and -d sets the duration of the capture. This generates a support bundle within /var/tmp that contains the performance data and the log files from the host. The output from vm-support includes instructions to extract the file, as shown in this example:

To see the files collected, run: tar -tzf
   '/var/tmp/esx-2010-10-02--02.02.13217.tgz'

After you have extracted the tar file, you can issue the following command to replay the data that was captured:

esxtop -R vm-support- esx-2010-10-02--02.02.13217

The vm-support command can also be used to generate support bundles. While the log files messages, hostd.log, and vpxa.log are accessible via a number of methods, the log files for High Availability and other services are not accessible through the vSphere API. All log files for ESXi are available within the support bundle, as well as core dumps, configuration files, and log files for virtual machines. You can generate a support bundle with the vSphere client by selecting File > Export > Export System Logs to display the screen shown in Figure 11.9. Both methods create a support bundle within /var/tmp. With vm-support, you then need to copy that bundle to another computer. If you need to provide VMware Support with a core dump file, such a file can be found in /var/core.

Generating a support bundle with the vSphere client.

Figure 11.9. Generating a support bundle with the vSphere client.

Tip

Generally the ESXi RAM disk mounts should not exceed 90% usage and thus should not run out of space. However, a software bug in a driver for example could cause one of the mounts to fill in which case the following event would be generated. If disk space is a concern on your host you can set up your syslog receiver to monitor for similar events.

Oct 1 01:22:08 vmkernel: 0:00:55:09.839 cpu0:4149)BC: 3837: Failed to flush 52
  buffers of size 8192 each for object '1.z' 1 4 1167 0 0 0 0 0 0 0 0 0 0 0:
  No free memory for file data

Conclusion

From the Spiderman movies comes the quote, “With great power comes great responsibility.” This applies not only to your “spidey senses,” but also to your use of ESXi’s TSM. When you access TSM, you have complete and unrestricted access to the VMkernel. The commands you issue through TSM are not necessarily filtered for problems or mistakes as the same commands being issued through the vSphere API are. Properly used, TSM is a great tool and it can certainly save the day. However, improperly used, it can have significant negative consequences.

Getting to know TSM means spending time in it exploring the various aspects of the ESXi filesystem and commands that it contains. Given the ease of setting up a virtual ESXi host, that’s the best way to get into the TSM without needing to worry about any mistakes. For production use, TSM access should be made under the appropriate circumstances and be audited.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.160.119