14 Troubleshooting

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14 Troubleshooting

Images

CERTIFICATION OBJECTIVES

14.01 Troubleshooting Tools

14.02 Documentation and Analysis

14.03 Troubleshooting Methodology

Two-Minute Drill

Q&A Self Test

Service and maintenance availability must be a priority when choosing a cloud provider. Having the ability to test and troubleshoot the cloud environment is a critical step in providing the service availability an organization requires. This chapter introduces you to troubleshooting tools, discusses documentation and its importance to company and cloud operations, and presents a troubleshooting methodology with various sample scenarios and issues that you might face in your career and on the CompTIA Cloud+ exam.

CERTIFICATION OBJECTIVE 14.01

Troubleshooting Tools

An organization needs to be able to troubleshoot the cloud environment when there are issues or connectivity problems. A variety of tools are available to troubleshoot the cloud environment. Understanding how to use those tools makes it easier for a company to maintain its service level agreements. This section explains the common usage of those tools.

There are many tools to choose from when troubleshooting a cloud environment. Sometimes a single tool is all that is required to troubleshoot the issue; other times a combination of tools might be needed. Knowing when to use a particular tool makes the troubleshooting process easier and faster. As with anything, the more you use a particular troubleshooting tool, the more familiar you become with the tool and its capabilities and limitations.

Connectivity Tools

Connectivity tools are used to verify if devices can talk to one another on a network. These include ping, traceroute, and nslookup. Ping verifies that a node is talking on the network, traceroute displays the connections between source and destination, and nslookup performs DNS queries to resolve names to IP addresses.

Ping

One of the most common and previously most utilized troubleshooting tools is the ping utility. Ping is used to troubleshoot the lack of reachability of a host on an IP network. Ping sends an Internet Control Message Protocol (ICMP) echo request packet to a specified IP address or host and waits for an ICMP reply.

Ping can also be used to measure the round-trip time for messages sent from the originating workstation to the destination and to record packet loss. Ping generates a summary of the information it has gathered, including packets sent, packets received and lost, and the amount of time taken to receive the responses. Starting with Microsoft Windows XP Service Pack 2, Windows Firewall was enabled by default and blocks ICMP traffic and ping requests. Figure 14-1 shows an example of the output received when you use the ping utility to ping comptia.org.

Ping allows an administrator to test the availability of a single host.

FIGURE 14-1 Screenshot of ping data

Images

Traceroute

Traceroute is a troubleshooting tool that is used to determine the path that an IP packet has to take to reach a destination. Unlike the ping utility, traceroute displays the path and measures the transit delays of packets across the network to reach a target host.

The command in Microsoft Windows is written as tracert. Issuing the traceroute command followed by an FQDN or IP address will print the list of hops from source to destination. Some switches that can be used with traceroute include the following:

–d Disables hostname resolution

–h Specifies the maximum number of hops to trace

–j Specifies an alternate source address, so traceroute executes from that node instead of the one from which you are issuing commands

–w Specifies the timeout to use for each reply

Traceroute sends packets with gradually increasing time-to-live (TTL) values, starting with a TTL value of 1. The first router receives the packet, decreases the TTL value, and drops the packet because it now has a value of zero. The router then sends an ICMP “time exceeded” message back to the source, and the next set of packets is given a TTL value of 2, which means the first router forwards the packets and the second router drops them and replies with its own ICMP “time exceeded” message. Traceroute then uses the returned ICMP “time exceeded” messages with the source IP address of the expired intermediate device to create a list of routers until the destination device is reached and returns an ICMP echo reply.

Most modern operating systems support some form of the traceroute tool: as mentioned, on a Microsoft Windows operating system it is named tracert; Linux has a version named trace; on Internet protocol version 6 (IPv6), the tool is called traceroute6. Figure 14-2 displays an example of the tracert command being used to trace the path to comptia.org.

FIGURE 14-2 Screenshot of data using the tracert command

Images

Nslookup and Dig

Another tool that can be used to troubleshoot network connection issues is the nslookup command. With nslookup, it is possible to obtain domain name or IP address mappings for a specified DNS record. Nslookup uses the computer’s local DNS server to perform the queries. Using the nslookup command requires at least one valid DNS server, which can be verified by using the ipconfig/all command.

The domain information groper (dig) command can also be used to query DNS name servers and can operate in interactive command-line mode or be used in batch query mode on Linux-based systems. The host utility can also be used to perform DNS lookups. Figure 14-3 shows an example of the output using nslookup to query comptia.org.

FIGURE 14-3 Screenshot of nslookup addresses

Images

Configuration Tools

Configuration tools are used to modify the configuration of network settings such as the IP address, DHCP, DNS, gateway, or routing settings. Three important configuration tools you should know are ifconfig, ipconfig, and route.

Ifconfig

Ifconfig is a Linux command used to configure the TCP/IP network interface from the command line, which allows for setting the interface’s IP address and netmask or even disabling the interface. Ifconfig displays the current TCP/IP network configuration settings for a network interface.

Figure 14-4 shows the ifconfig command standard output, which contains information on the network interfaces on the system. The system this command was executed on has an Ethernet adapter called enp2s0 and a wireless adapter called wlp3s0. The item labeled “lo” is the loopback address. The loopback address is used to test networking functions and does not rely on physical hardware.

FIGURE 14-4 Screenshot of interfaces using ipconfig

Images

Ifconfig lacks some command-line switches that ipconfig has that allow you to perform more advanced tasks, like clearing the DNS cache and obtaining a new IP address from DHCP, rather than just displaying TCP/IP configuration information.

Ipconfig

Ipconfig is a Microsoft Windows command used to configure a network interface from the command line. Ipconfig can display the network interface configuration, release or renew IP version 4 and 6 addresses from DHCP, flush the cache of DNS queries, display DNS queries, register a DHCP address in DNS, and display class IDs for IP versions 4 and 6. Figure 14-5 shows the command-line switch options available with the ipconfig command.

FIGURE 14-5 Screenshot of ipconfig options

Images

Route

The route command can be used to view and manipulate the TCP/IP routing tables of Windows operating systems. The routes displayed show how to get from one network to another. A computer connects to another over a series of devices, and each step from source to destination is called a hop. The route command can display the routing tables so that you can troubleshoot connectivity issues between devices or configure routing on a device that is serving that function.

Modification of a route requires modifying a routing table. A routing table is a data table stored on a system that connects two networks together. It is used to determine the destination of network packets it is responsible for routing. A routing table is a database that is stored in memory. It contains information about the network topology that is located adjacent to the router hosting the routing table.

When using earlier versions of Linux, the route command and the ifconfig command can be used together to connect a computer to a network and define the routes between the networks; later versions of Linux have replaced the ifconfig and route commands with the iproute2 command, which adds functionality such as traffic shaping. Figure 14-6 shows the route command using the print switch to display the current IP versions 4 and 6 routing tables.

FIGURE 14-6 Screenshot of the route command displaying current routing tables

Images

Query Tools

Query tools are used to view the status of network services. The two commands you should be familiar with are netstat and arp. Netstat displays network connections, routing tables, and network protocol statistics. Arp displays the MAC addresses that a computer or network devices know about.

Netstat

If you want to display all active network connections, routing tables, and network protocol statistics, you can use the netstat command. Available in most operating systems, the netstat command can be used to detect problems with the network and determine how much network traffic there is. It can also display protocol and Ethernet statistics and all the currently active TCP/IP network connections. Figure 14-7 shows the options available with the netstat command.

FIGURE 14-7 Screenshot of active connections using netstat

Images

Recently while troubleshooting a network connection, we were having issues determining what DNS mapping an IP address had. We used the nslookup tool and entered the IP address that we were trying to map to a DNS name; nslookup returned the result of the DNS registration for the particular IP address.

Arp Command

Another helpful troubleshooting tool is the arp command. The arp command uses the Address Resolution Protocol (ARP) to resolve an IP address to either a physical address or a media access control (MAC) address. The arp command makes it possible to display the current ARP entries or the ARP table and to add a static entry. Figure 14-8 uses the arp –a command to view the ARP cache of a computer.

FIGURE 14-8 Screenshot of the ARP cache showing both the Internet address and the physical address

Images

Remote Administration Tools

Remote administration tools allow connectivity to systems or network devices. The two tools you should know about for troubleshooting are used to connect to network devices such as switches and routers. They include Telnet and Secure Shell (SSH).

Telnet

If a user wants to connect their computer to another computer or server running the Telnet service over the network, they can enter commands via the Telnet program, and the commands are executed as if they were being entered directly on the server console. Telnet enables the user to control a server and communicate with other servers over the network.

Telnet and SSH both allow an administrator to connect to a server remotely, the primary difference being that SSH offers security mechanisms to protect against malicious intent.

A valid username and password are required to activate a Telnet session; nonetheless, Telnet has security risks when it is used over any network because credentials and data are exchanged in plaintext. Figure 14-9 shows an example of a Telnet session established with a remote server.

FIGURE 14-9 Screenshot of a Telnet session

Images

SSH

SSH is another protocol that enables the user to securely control a server and communicate with other servers over the network. Secure Shell and its most recent version Secure Shell version 2 (SSHv2) have become a more popular option for providing a secure remote command-line interface than Telnet because they encrypt credentials and data.

Figure 14-10 shows an example of a SSH session established with a remote server 192.168.254.254. In this screenshot, a connection has been established and the remote server is asking for a username to log in. After a username is provided, the system will ask for a password.

FIGURE 14-10 Screenshot of a SSH session

Images

CERTIFICATION OBJECTIVE 14.02

Documentation and Analysis

Being able to use the proper tools is a good start when troubleshooting cloud computing issues. Correctly creating and maintaining the correct documentation makes the troubleshooting process quicker and easier. It is important for the cloud administrator to document every aspect of the cloud environment, including its setup and configuration and which applications are running on which host computer or virtual machine. Also, the cloud administrator should assign responsibility for each application and its server platform to a specific support person who can respond quickly if an issue should arise that impacts the application.

When issues come up, cloud professionals need to know where to look to find the data they need to solve the problem. The primary place they look is in log files. Operating systems, services, and applications create log files that track certain events as they occur on the computer. Log files can store a variety of information, including device changes, device driver loading and unloading, system changes, events, and much more.

EXAM AT WORK

A Real-World Look at Documentation

We were recently tasked with creating documentation for an application that was going to be monitored in a distributed application diagram within Microsoft SharePoint. To have a satisfactory diagram to display inside of Microsoft SharePoint for the entire organization to view, we needed to collect as much information as possible. The group wanted to monitor the application from end to end, so we needed to know which server the application used for the web server, which server it used for the database server, which network devices and switches the servers connected to, the location of the end users who used the application, and so on.

The information-gathering process took us from the developer who created the application to the database administrator who could explain the back-end infrastructure to the server administrator and then the network administrator and so on. As you can see, to truly document and monitor an application, you need to talk to everyone who is involved in keeping that application operational.

From our documentation, the organization now has a clear picture of exactly what systems are involved with keeping that application operational and functioning at peak performance. It makes it easier to troubleshoot and monitor the application and set performance metrics. It also allows for a true diagram of the application with true alerting and reporting of any disruptions. As new administrators join the organization, they can use the documentation to understand better how the application and the environment work together and which systems support each other.

Documentation

Documentation needs to be clear and easy to understand for anyone who may need to use it and should be regularly reviewed to ensure that it is up to date and accurate. Documenting the person responsible for creating and maintaining the application and where it is hosted is a good process that saves valuable time when troubleshooting any potential issues with the cloud environment.

In addition to documenting the person responsible for the application and hosting computer, an organization also needs to record device configurations. This provides a quick and easy way to recover a device in the case of failure. By utilizing a document to swap a faulty device and mimic its configuration quickly, the company can immediately replace the failed device.

When documenting device configuration, it is imperative that the document is updated every time a significant change is made to that device. Otherwise, coworkers, auditors, or other employees might operate off out-of-date information. For example, let’s say you are working on a firewall that has been in place and running for the last three years. After making the required changes, you then update or re-create the documentation so that there is a current document listing all the device settings and configurations for that firewall. This makes it easier to manage the device if there are problems later on, and it gives you a hard copy of the settings that can be stored and used for future changes.

Also, the firewall administrator would likely rely on your documented configuration to design new configuration changes. If you failed to update the documentation after making change, the firewall administrator would be operating off old information and wouldn’t factor in the changes that you made to the configuration.

Configuration management tools are available that can automatically log changes to rule sets. These, along with orchestration tools and runbooks, can be used to update documentation programmatically following an approved change. See more on orchestration and automation in Chapter 13.

Log Files

Logs files are extremely important in troubleshooting problems. Operating systems, services, and applications create log files that track certain events as they occur on the computer. Log files can store a variety of information, including device changes, device drivers, system changes, events, and much more.

Log files allow for closer examination of events that have occurred on the system over a more extended period. Some logs keep information for months at a time, allowing a cloud administrator to go back and see when an issue started and if any issues seem to coincide with a software installation or a hardware configuration change.

Figure 14-11 shows the event viewer application for a Microsoft Windows system. The event viewer application in this screenshot is displaying the application log and an error is highlighted for the Microsoft Photos application that failed due to a problem with a .NET module.

FIGURE 14-11 Screenshot of an error in the application event log

Images

There are a variety of software applications that can be used to gather the system logs from a group of machines and send those records to a central administration console, making it possible for the administrator to view the logs of multiple servers from a single console.

Logs can take up much space on servers, but you will want to keep logs around for a long time in case they are needed to investigate a problem or a security issue. Logs are set by default to grow to a specific size and then roll over. Rolling over overwrites the oldest log entries with new ones. This can cause considerable problems when you need to research what has been happening to a server because rollover can cause valuable log data to be overwritten.

For this reason, you might want to archive logs to a cloud logging and archiving service such as Logentries, OpenStack, Sumo Logic, syslog-ng, Amazon S3, Amazon CloudWatch, or Papertrail. These services will allow you to free up space on your cloud servers, while still retaining access to the log files when needed. Some of these services also allow for event correlation, or they can tie into event correlation services.

Event correlation can combine authentication requests from domain controllers, incoming packets from perimeter devices, and data requests from application servers to gain a more complete picture of what is going on. Software exists to automate much of the process. Such software, called security information and event management (SIEM), archives logs and reviews the logs in real time against correlation rules to identify possible threats or problems. See Chapter 10 for a review on SIEM.

Network Device and IoT Logs

Log files are not restricted to servers. Network devices, even IoT devices, have log files that can contain useful information in troubleshooting issues. These devices usually have a smaller set of logs. These logs may have settings for how much data is logged.

If the standard log settings do not seem to provide enough information when troubleshooting an issue, you can enable verbose logging. Verbose logging records more detailed information than standard logging but is recommended only for troubleshooting a particular problem since it tends to fill up the limited space on network and IoT devices. To conserve space and prevent essential events from being overwritten, verbose logging should be disabled after the issue is resolved so that it does not impact the performance of the application or the computer.

Syslog

Network devices can generate events in different formats, and they often have very little space to store logs. Most devices only have space for system and configuration data. This means that if you want to store the logs somewhere, you will need to use a syslog. The syslog protocol is supported by a wide range of devices and can log different types of events. A syslog server receives messages sent by various devices and collects those. Each device is configured with the location of the syslog collector, and each sends his or her logs to that server for collection. syslog can be analyzed in place, or it can be archived to a cloud logging and archiving service, just like other logs. For more information on syslog, see Chapter 7.

CERTIFICATION OBJECTIVE 14.03

Troubleshooting Methodology

CompTIA has established a troubleshooting methodology consisting of six steps:

Step 1: Identify the problem.

Determine when the problem first occurred and what has changed since then. Determining the first occurrence is typically accomplished by interviewing the user. Identify the scope of the problem, including which machines and network devices, subnets, sites, or domains are affected. Identifying the scope may involve interviewing others in the department or company. Before moving to further steps, ensure that a backup is taken of the system so that you can revert back if changes make the problem worse as you are troubleshooting.

Since both these steps require asking the user or others questions, it is important to be courteous and respectful. Show the user that you are concerned about the problem by actively listening to what they have to say. All too often, IT professionals want to run off as soon as a problem is mentioned so that they can begin fixing it. However, in the rush to fix the problem, they may not truly understand the problem and give the user the impression that their problem is not important.

A critical step in identifying the problem is to ask the user to demonstrate what is not working for them. If they say the Internet is not working on their cloud virtual desktop, ask them to demonstrate. They may show you that a certain site does not come up, but when you ask them to go to another site, it loads properly. In this way, both you and the user are able to better understand the scope of the problem and the user might be able to do more while you troubleshoot.

Step 2: Establish a theory of probable causes.

The information gathered in Step 1 should help in generating possible causes. Ask yourself what is common between devices that are experiencing the issue. Do they share a common connection or resource? Be sure to question the obvious (the simple things). Many times problems can be solved by something rather simple such as plugging in an Ethernet cable, verifying that users are typing a URL or UNC correctly, or verifying that target systems are turned on. Don’t spend your time working out potential complex solutions until you have eliminated the simple solutions.

Documentation was mentioned earlier in this chapter and it comes in very handy when troubleshooting issues. Be sure to review documentation on how systems should be configured. You may need to review vendor documentation as well, so know where to find this documentation. Make sure you identify the product version number that the user is running so that you can refer to the correct vendor document.

Step 3: Test the theory to determine the cause.

At this point, you will likely have multiple theories on what might be the problem. Only one of those theories will be correct so you will need to test the theories to identify which one it is. Start by testing the simplest theories before thinking through the complex ones. As mentioned in the previous step, most issues are caused by simple things, and simple things are easier to test. Ensure that the system you use for testing is similar enough to the one that is experiencing issues and ensure that you can replicate the issue on the test system before attempting a fix. Some IT professionals have worked hard to deploy a solution to a system that was not experiencing the issue and they then falsely believe they fixed the issue when they later test.

If you need to have the user test some things, be sure to ask politely. Phrase your request in such a way as to not cause the user to think that you are blaming them for the problem, even if you believe it is a user error. If you are incorrect and it is not a user error, you will look foolish and the user might be offended.

Step 4: Establish a plan of action to resolve the problem and implement the solution.

Document the steps that you want to take to resolve the problem. Ensure that you can demonstrate that your tests confirmed a non-working condition and then a working condition following implementation of the proposed actions. Review the plan with others and ensure that change controls are followed. The change management process was discussed in Chapter 13. Lastly, after approval has been given to implement the change, perform the outlined steps to fix the problem.

Step 5: Verify full system functionality and, if applicable, implement preventative measures.

Check with all users who were experiencing the issue to ensure that they are no longer experiencing the issue. Also check with others around them to ensure that you have not created other issues by implementing the fix. Lastly, implement restrictions or additional controls to prevent the problem from occurring in the future. This could involve retraining the user or placing technical controls on the system to prevent such actions from taking place again. In some cases, permissions may need to be changed or system configurations updated. System changes should follow the same change control process as the troubleshooting change did.

Step 6: Document findings, actions, and outcomes.

This last step is important to ensure that you or others at your company do not continue to solve the same problems from scratch over and over. If you are anything like us, you will need to write down what you did so that you can remember it again later. IT professionals lead busy lives, and there never seems to be time to document. However, if you do not document, you will find that you spend time performing the same research when you could have simply consulted your documentation.

The CompTIA troubleshooting steps provided here will be demonstrated in the scenarios that follow to help you understand better how the troubleshooting methodology is applied to real-world problems.

In the course of your career, you will run into a wide variety of issues that you will need to troubleshoot. No book could be comprehensive enough to cover all of them, so we have selected a few issues that you are likely to see. The issues are also ones that you are likely to see on the CompTIA Cloud+ exam.

Deployment Issues

Application deployment issues are relatively commonplace. Most applications will be deployed without issue, but you will deploy so many apps that deployment issues will be something that you see quite often.

Incompatible or Missing Dependencies

One problem you might see is incompatible or missing dependencies. When deploying a web application, ensure that programming libraries are installed first. Windows applications written in a .NET programming language such as C# will require a certain version of .NET on the machine. Other applications may require PHP or Java to be installed. Read through deployment documentation carefully to ensure that you meet all the requirements. Of course, you will also need Internet Information Services (IIS) and any other operating system roles and features. Ensure that all this is in place before application installation.

The Java Runtime Environment (JRE) can be particularly troublesome when running multiple Java-based applications on the same machine because they might not all support the same version. For example, three applications are installed on the server, and you upgrade the first one. You read through the documentation before upgrading and find that you need to update the Java version first. The Java upgrade completes successfully, and then you deploy the new version of the application. Testing confirms that the new app works fine, but a short time later, users report that the other two applications are no longer working. Upon troubleshooting, you find that they do not support the new version of Java that was deployed.

The most straightforward fix to this issue is to deploy dedicated virtual machines for each application. You can also use application containers to host each application so that dependencies can be handled individually for each container. Containers are more lightweight, quicker to deploy (less disk space since OS is generally not in the container), and start up much more quickly than virtual machines.

Now that you understand the potential problem, let’s try a scenario: You have been asked to set up a new website for your company. You purchase a hosted cloud solution and create a host record in your company’s hosted DNS server to point to the IP address of the hosted cloud server. You test the URL and see the default setup page. You then use the cloud marketplace to install some website applications and themes. However, when you navigate to your website, you now receive the following error message:

Step 1: Identify the problem. New applications and themes were installed since the site last came up correctly, so the error is most likely related to the new software and themes.

Step 2: Establish a theory of probable causes. You research the error online and see issues relating to missing PHP files. You theorize that PHP is installed incorrectly or that the PHP dependency is missing.

Step 3: Test the theory to determine the cause. To test these theories, you can reinstall PHP on the server or install it if it is missing. You first identify the required level of PHP from the software that you installed earlier. Then you log onto the cloud server and check the PHP version. You find that PHP is not installed so it seems like installing the required PHP version will solve the problem.

Step 4: Establish a plan of action to resolve the problem and implement the solution. You log into the cloud management portal and go to the marketplace. After locating the PHP version required by the software, you review the release notes for it to determine if it is compatible with your other software and system. You find that your cloud vendor maintains a database of compatible applications and software and it has already queried your systems and noted that this version of PHP is compatible with your cloud installation.

You place a change request to install PHP and include relevant documentation on why the software is needed. Once the change request is approved, you proceed to install the software from the marketplace and then verify that the software installs correctly.

Note that because this is a new installation, no users are accessing the site. If this were a production site, installation of a major dependency like PHP would take the site down, so you would need to perform the install in a downtime.

Step 5: Verify full system functionality and, if applicable, implement preventative measures. You open a web browser and navigate to the company website URL and verify that you can access the site. The installation of the PHP dependency solved the problem. Additionally, you find that you can enable the system to automatically install dependencies in the future so that you can avoid such a problem. You create another change request to enable this feature and wait for approval. Once approval is provided, you enable the feature.

Step 6: Document findings, actions, and outcomes. You update both change request tickets to indicate that the work was completed successfully and that no other changes were required. Additionally, you send a memo to the other team members noting the issue and what was done to resolve it and also that dependencies will be installed automatically moving forward.

Incorrect Configuration

Computer programs need to be configured perfectly for them to run. There is really no margin for error. An extra character in a UNC path or a mistyped password is all that is required for the program to crash and burn. It is important to double-check all configuration values to ensure that they are correct. If you run into issues, go back to the configuration and recheck it, maybe with another person who can offer some objectivity. Compare configuration values to software documentation and ensure that the required services on each server supporting the system are running.

Let’s look at this in a scenario and consider how the CompTIA troubleshooting methodology would help in solving a configuration issue: Your company is consolidating servers from two cloud environments into one for easier manageability. The transition team is responsible for moving the servers and the shares. The transition team successfully moves the servers to the new location and consolidates the shares onto a single server. A web application retrieves files from one of the shares, but users of the site report that they can no longer access files within the system. You are part of the troubleshooting team and you are assigned the trouble ticket.

Step 1: Identify the problem. The problem is that users cannot access files in the application. You send a message to the user base informing them of the problem and that you are actively working to resolve it.

Step 2: Establish a theory of probable causes. A number of changes were made when the servers were moved over from one cloud to another. The servers were exported into files and then imported into the new system. Each server was tested, and they worked following the migration. You check the testing notes and verify that the website was working correctly following the migration. The shares were consolidated after that. However, you do not see testing validation following the share consolidation. It is possible that the application is pointing to a share that no longer exists.

Step 3: Test the theory to determine the cause. You log into the server hosting the application and review the configuration. The configuration for the files points to a UNC path. You attempt to contact the UNC path but receive an error. You then message the transition team asking them if the UNC referenced in the application still exists or if it changed. They send you a message stating that the UNC path has changed and they provide you with the new path.

Step 4: Establish a plan of action to resolve the problem and implement the solution. You plan to change the application configuration to point to the new path. You put in a change request to modify the application configuration, and the change request is approved. You then adjust the application settings, replacing the old UNC path with the new one.

Step 5: Verify full system functionality and, if applicable, implement preventative measures. You log into the site and verify that files are accessible through the application. You then reach out to several users and request they test as well. Each user reports that they can access the files successfully. Finally, you message the users and let them know that the issue has been resolved.

Step 6: Document findings, actions, and outcomes. You update the change request ticket to indicate that the work was completed successfully and that no other changes were required. Additionally, you send a memo to the transition team members noting the issue and what was done to resolve it. Management then creates a checklist for application transitions that includes a line item for updating the UNC path in the application if the backend share path changes.

Integration Issues with Different Cloud Platforms

Cloud applications typically do not reside on their own. They are often integrated with other cloud systems with APIs. A vendor will create an API for its application and then release documentation so that developers and integrators know how to utilize that API. For example, Office 365, a cloud-based productivity suite that includes an e-mail application, has an API for importing and exporting contacts. Salesforce, a cloud-based customer relationship management (CRM) application, could integrate with Office 365 through that API so that contacts could be updated based on interactions in the CRM tool. However, APIs must be implemented correctly, or the integration will not work.

Let’s try the CompTIA troubleshooting methodology with a scenario: You receive an e-mail from Microsoft informing you of a new API that works with Salesforce. You log into Salesforce and configure Salesforce to talk to Office 365. You educate users on the new integration and that contacts created in Salesforce will be added to Office 365 and that tasks in Salesforce will be synchronized with Office 365 tasks. However, user report that contacts are not being updated and that tasks are not being created. You also find when opening your tasks that there are hundreds of new tasks that should belong to other users.

For this scenario, consider the troubleshooting methodology and walk through it on your own. Think through each step. You may need to make some assumptions as you move through the process since this is a sample scenario.

Step 1: Identify the problem.

Step 2: Establish a theory of probable causes.

Step 3: Test the theory to determine the cause.

Step 4: Establish a plan of action to resolve the problem and implement the solution.

Step 5: Verify full system functionality and, if applicable, implement preventative measures.

Step 6: Document findings, actions, and outcomes.

Template Misconfiguration

When an organization is migrating its environment to the cloud, it requires a standardized installation policy or profile for its virtual servers. The virtual machines need to have a very similar base installation of the operating system, so all the devices have the same security patches, service packs, and base applications installed.

Virtual machine templates provide a streamlined approach to deploying a fully configured base server image or even an entirely configured application server. Virtual machine templates help decrease the installation and configuration costs when deploying virtual machines and lower ongoing maintenance costs, allowing for faster deploy times and lower operational costs. However, incorrectly configuring templates can result in a large number of computers that all have the same flaw.

Now that you understand the potential problems, let’s try a scenario: Karen is creating virtual machine templates for common server roles including a web server with network load balancing (NLB), a database server, an application server, and a terminal server. Each server will be running Windows Server 2016 Standard. She installs the operating system on a virtual machine, assigns the machine a license key, and then installs updates to the device in offline mode.

Karen applies the standard security configuration to the machine, including locking down the local administrator account, adding local certificates to the trusted store, and configuring default firewall rules for remote administration. She then shuts down the virtual machine and makes three copies of it using built-in tools in her cloud portal. She renames the machines and starts each up.

She then installs the server roles for web services and NLB on the web server, SQL Server 2016 on the database server along with Microsoft Message Queuing (MSMQ), SharePoint on the application server, and Remote Desktop Session Host services on the terminal server. She applies application updates to each machine and then saves the virtual hard disks to be used as a template.

A month later, Karen is asked to set up an environment consisting of a database server and a web server. She uses the built-in tools in her cloud portal to make copies of her database and web server templates. She gives the new machines new names and starts them up. She then assigns IP addresses to them. Both are joined to the company domain under their assigned names. However, server administrators report that the servers are receiving a large number of authentication errors.

Step 1: Identify the problem. The servers are receiving a large number of authentication errors.

Step 2: Establish a theory of probable causes. Karen theorizes that the authentication errors could be caused by incorrect licensing on the machines or by duplicate security identifiers.

Step 3: Test the theory to determine the cause. Karen issues unique license keys to both machines and activates them. However, the authentication errors still continue. She then clones another web server and runs Sysprep on it. She adds it to the domain and observes its behavior. The new machine does not exhibit the authentication errors.

Step 4: Establish a plan of action to resolve the problem and implement the solution. Karen proposes to remove faulty machines from the domain, run Sysprep on the defective machines to regenerate their security identifiers, and then add them back in. She puts change requests in for each activity and waits for approval. Upon receiving authorization, Karen implements the proposed changes.

Step 5: Verify full system functionality and, if applicable, implement preventative measures. Server administrators confirm that the authentication errors have ceased after the changes were made.

Step 6: Document findings, actions, and outcomes. Karen updates the change management requests and creates a process document outlining how to create templates with the Sysprep step included.

System Clock Differences

Networked computer systems rely on time synchronization in order to communicate. When computers have different times, some may not be able to authenticate to network resources, they may not trust one another, or they may reject data sent to them. Administrators can ensure that system clocks are kept in sync on virtual machines by installing hypervisor guest tools and then enabling time synchronization between host and guest. Virtual, physical, and cloud servers can synchronize time by configuring them to point to an external Network Time Protocol (NTP) server. Each computer will set its time to the time specified by the NTP server. The servers will poll the NTP server periodically to verify that their clocks are still in sync and avoid time synchronization issues.

Using the CompTIA troubleshooting methodology, let’s consider a scenario: Eddie is a cloud administrator managing over 40 servers in a hosted cloud. His monitoring system frequently sends out alerts that servers are unavailable. He restarts the machines and the problem goes away, but the problem comes back a few days later. He scripts reboots for each of the servers but realizes that this is a short-term fix at best.

Step 1: Identify the problem. Servers lose connectivity periodically.

Step 2: Establish a theory of probable causes. Eddie theorizes that there could be connectivity issues on the cloud backend. There also could be an issue with the template that each of the machines was produced from. Lastly, the machines could be losing time synchronization.

Step 3: Test the theory to determine the cause. Eddie creates a support ticket with the cloud provider and provides the necessary details. The cloud provider runs several tests and reports no issues. Eddie creates another machine from the template and finds that it also exhibits the same problems. However, he is not sure where the problem might lie in the template. Lastly, he configures a scheduled job to run three times a day that sends him the system time for each of the servers.

As he reviews the output from the scheduled job, it becomes clear that the domain controller is getting out of sync with most of the network every few hours. Upon analyzing the configuration of the servers that go out of sync and the others, he finds that some are configured to obtain their time from the cloud provider NTP server while others are set to obtain their time from a different server.

Step 4: Establish a plan of action to resolve the problem and implement the solution. Eddie proposes to set all servers to the same time server. He creates a change request documenting the proposed change and receives approval to move forward with the change during a scheduled downtime. He makes the change.

Step 5: Verify full system functionality and, if applicable, implement preventative measures. Eddie monitors the output from the scheduled task and confirms that each server remains in sync.

Step 6: Document findings, actions, and outcomes. Eddie documents the NTP server settings on a standard setup configuration document. He mentions the issue in a standup IT meeting the following Friday, and the document is circulated around and placed on the company intranet for reference.

Capacity Issues

Capacity issues can be found with compute, storage, networking, and licensing. Considerable attention needs to be paid to the design of compute, storage, and networking systems. The design phase must ensure that all service levels are understood and that the capacity to fulfill them is incorporated into its configurations. Once those configurations have been adequately designed and documented, operations can establish a baseline, as discussed in Chapter 7. This baseline is a measuring stick against which capacity can be monitored to understand both the current demand and trend for future needs.

Capacity issues can result in system or application slowdowns or complete unavailability of systems. Alerts should be configured on devices to inform cloud administrators when capacity reaches thresholds (often 80 percent or so). Define thresholds low enough that you will be able to correct the capacity issue before available capacity is fully consumed.

Compute

Appropriately distributing compute resources is an integral part of managing a cloud environment. Planning for future growth and the ability to adjust compute resources on demand is key to avoiding compute capacity issues. One potential capacity issue is overconsumption by customers. Because compute resources are limited, cloud providers must protect them and make certain that their customers only have access to the amount that they are contracted to provide. Two methods that are used to deliver no more than the contracted amount of resources are quotas and limits.

Now that you understand the potential problem, try a scenario: Tim manages the cloud infrastructure for hundreds of cloud consumers. He notices that some of the consumers are utilizing far more resources than they should be allocated.

Step 1: Identify the problem.

Step 2: Establish a theory of probable causes.

Step 3: Test the theory to determine the cause.

Step 4: Establish a plan of action to resolve the problem and implement the solution.

Step 5: Verify full system functionality and, if applicable, implement preventative measures.

Step 6: Document findings, actions, and outcomes.

For more information on resource allocation and performance best practices, see Chapter 8.

Storage

Companies are producing data at a rate never seen before. Keeping up with data growth can be quite a challenge. It is best to set thresholds and alerts on storage volumes so that when they reach a threshold (often 80 percent or so), you can proactively expand the storage. Set more aggressive thresholds and alerts on physical storage because physical storage cannot be extended as easily on the fly. Physical storage expansion requires the purchase of additional hardware, approval and other red tape associated with the purchase, shipping time, and installation. You want to make sure that you have enough of a buffer so that you do not run out of space while additional storage is on order.

Let’s demonstrate the CompTIA troubleshooting methodology with a scenario: Sharon, a cloud administrator, receives reports that users are experiencing sluggish performance and slow response times when accessing the company ERP systems that reside in their hybrid cloud.

Step 1: Identify the problem. Sharon identifies the problem as unacceptable application performance.

Step 2: Establish a theory of probable causes. Sharon collects metrics while users experience the issues. She then compares the metrics to the baseline to see if performance is within normal tolerances. The anomalies not only confirm that there is a problem, they tell where the problem might lie. The baseline comparison indicates that disk input/output operations per second (IOPS) are well below the baseline for several LUNs.

Step 3: Test the theory to determine the cause. Sharon isolates the LUNs that are outside of their normal IOPS range. Each of the LUNs was created from the same RAID group, and an analysis of the disk IOPS shows that a RAID group is rebuilding, causing the performance issues.

Step 4: Establish a plan of action to resolve the problem and implement the solution. Sharon discusses the risks and performance hit the rebuild is causing, and her manager agrees that the rebuild can be paused for the two hours that remain in the work day and that they should resume at 5:00 p.m. Sharon pauses the rebuild.

Step 5: Verify full system functionality and, if applicable, implement preventative measures. Sharon confirms that application performance has returned to normal. At 5:00 p.m., she resumes the rebuild and performance drops for the next few hours until the rebuild completes.

Step 6: Document findings, actions, and outcomes. Sharon documents the experience in the company’s knowledge management system.

Networking

Each device that is attached to the network is capable of generating traffic. A single user used to have only one or two devices attached to the network, but now many users have a desktop, laptop, multiple tablets, phones, and other devices that may connect through wired or wireless connections. Many of these devices connect to cloud services and request data from them. Some cloud services may be used to keep such systems in sync. The rapid growth of devices and increasing use of cloud services can result in contention for valuable network resources.

Now that you understand some potential network contention problems, let’s try a scenario: You work in the operations center of the company. Metrics show that nodes on a particular network segment are consuming a high amount of network bandwidth. You also receive alerts for a high number of network collisions on the segment, and the network switch for the segment is showing spanning tree errors.

Step 1: Identify the problem.

Step 2: Establish a theory of probable causes.

Step 3: Test the theory to determine the cause.

Step 4: Establish a plan of action to resolve the problem and implement the solution.

Step 5: Verify full system functionality and, if applicable, implement preventative measures.

Step 6: Document findings, actions, and outcomes.

Licensing

Purchased software and cloud services operate based on a license. The license grants specific uses of the software for a period. Software typically checks for compliance with licensing and may revoke access to service when the software vendor or cloud provider deems that compliance has not been met. Additionally, groups such as BSA | The Software Alliance (www.bsa.org) can perform license investigations and assess fines for companies that are not in compliance, so companies need to ensure that they are adhering to license requirements.

Software licenses may be per user of the software, or they could be based on the physical or virtual resources that are allocated to the software. For example, some products are licensed based on the number of CPU cores or vCPUs. It is important to know how many CPU cores you are licensed for when assigning resources so that you do not violate your license or cause a program to fail activation checks.

Let’s demonstrate the CompTIA troubleshooting methodology with a scenario: Your organization has a self-service portal where administrators can create new virtual machines based off virtual machine templates. The portal has been very popular, but now over 500 virtual machines have been deployed to the environment and the machines deployed over the last 30 days are unable to activate Windows.

Step 1: Identify the problem.

Systems are unable to activate and the organization may have exceeded available licenses. The company has 10 hypervisors in a cluster and 10 Server 2016 data center edition licenses as well as 100 Server 2016 standard edition licenses. An assessment of the virtual machines shows that there are 200 CentOS Linux servers and 312 Server 2016 Standard edition servers.

Step 2: Establish a theory of probable causes.

Step 3: Test the theory to determine the cause.

Step 4: Establish a plan of action to resolve the problem and implement the solution.

Step 5: Verify full system functionality and, if applicable, implement preventative measures.

Step 6: Document findings, actions, and outcomes.

API Request Limits

In order to guard against abuse or malicious use of application programming interfaces (APIs), companies set request limits, usually per IP address or subnet. This limits the ability of a single client from utilizing too much of the service. Some misuse of API calls might be an attempt to scrape resources, input malformed content to trigger buffer overflows, or disrupt API availability.

Let’s look at this in a scenario and consider how the CompTIA troubleshooting methodology would help in solving a configuration issue: You work for a consulting company as a developer on the DevOps team. Senior leadership have expressed concerns that some consultants may not be billing out all their time. They assume that this is just due to forgetfulness, since many do not enter their time each day, but wait till the weekend to enter most of it. You have developed an application that tracks e-mail usage against calendar entries and time entries to confirm that employees are not forgetting to put in billable time. The application ties into several APIs designed by the e-mail and time entry software companies. Your application works fine in testing. However, when you deploy it to production, the application works for about 30 minutes and then it ceases functioning. You need to figure out why the application ceases functioning and correct it.

Step 1: Identify the problem.

Step 2: Establish a theory of probable causes.

Step 3: Test the theory to determine the cause.

Step 4: Establish a plan of action to resolve the problem and implement the solution.

Step 5: Verify full system functionality and, if applicable, implement preventative measures.

Step 6: Document findings, actions, and outcomes.

Connectivity Issues

Connectivity issues can create a broad range of problems since most systems do not operate in isolation. There is a myriad of interdependencies on the modern issues of network and connectivity can be a digital monkey wrench that breaks a plethora of systems. The first indicator that there is a connectivity problem will be the scope of the issue. Because everything is connected, connectivity issues usually impact a large number of devices. Ensure that affected devices have an IP address using the ipconfig (Windows) or ifconfig (Linux) command described earlier in this chapter. If they do not have an IP address, it could be a problem with DHCP or with DHCP forwarders or cloud-based virtual network IP address ranges, firewall ACLs, and routing tables.

For example, the DHCP scope could be full so the administrator might need to expand the scope or reduce the lease interval so that computers do not keep their addresses for as much time. A user may reside in a different subnet from the DHCP server and no forwarder exists on the subnet to direct DHCP requests to the DHCP server. The connection may be on a different cloud-based virtual network IP address range from the servers it wishes to contact, and there are no rules defined to allow traffic between these ranges. There could be firewall ACLs that need to be defined to allow traffic between two nodes that are not communicating. Lastly, the default gateway or VPN concentrator, if the issue is with a VPN connection, may not have the correct information on the destination network in its routing table.

When identifying the problem, determine the scope by using the ping command described earlier in this chapter. Ping devices on the network starting with your default gateway. If the default gateway pings, try another hop closer to the Internet or to where others are experiencing issues. Try to connect to other devices that report problems as well. If the default gateway will not ping, attempt to ping something else on the same network. If neither will ping, it is likely an issue with the switch that connects both devices. If you can ping the other machine but not the gateway, it might be a problem with the gateway.

VLAN or VXLAN Misconfiguration

VLANs and VXLANs were discussed back in Chapter 4. Both the VLAN and VXLAN partition a network to create logical separation between subnetworks. Connectivity problems can appear when VLANs or VXLANs are configured incorrectly. For example, machines must be on the same VLAN or have inter-VLAN routing configured for the two machines to be able to communicate. It is common to configure virtual networks with specific VLANs or to add VLAN tagging to virtual networks. Incorrectly setting these values could allow machines to talk to machines they are not supposed to talk to, and they would be unable to talk to others. Subnets are usually assigned per VLAN, so if the IP address is configured manually for one subnet on the machine and it is placed on the wrong VLAN, it will not be able to communicate with any of its neighbors.

Let’s demonstrate the CompTIA troubleshooting methodology with a scenario: Geoff configures three VLANs named VLAN1, VLAN2, and VLAN3. He has four servers that are running on a virtual network, and he plans on cloning those servers several times and then assigning the servers to each of the VLANs for use. He performs the clones and then assigns the machines to the appropriate VLANs, but finds that they are unable to communicate with one another.

Step 1: Identify the problem. The cloned servers cannot communicate with each other.

Step 2: Establish a theory of probable causes. Geoff determines that the VLANs could be misconfigured, the tagging could be incorrectly set, the virtual switches could be misconfigured, or the IP addresses could be incorrectly assigned.

Step 3: Test the theory to determine the cause. Geoff tries to ping a single server called VM-DC1 from each of the other machines. None of the computers can communicate with the server. Geoff then creates a testing strategy where he will rotate VLANs and test. He explains the strategy to his manager and receives approval to proceed. Geoff then rotates the VLAN that is assigned to VM_DC1 and tries the tests again. He is unable to connect to the machine on any of the three VLANs. Geoff then removes VLAN tagging from the virtual switch configuration on VM_DC1 and receives an IP address conflict on the main VM_DC1 computer. Geoff suddenly realizes that the IP addresses are hard-coded into each of the machines and that they do not correspond to their assigned VLAN.

Step 4: Establish a plan of action to resolve the problem and implement the solution. Geoff documents IP addresses to assign to each of the machines in each VLAN. He then creates a change request to modify the IP addresses for each of the machines and explains why the change needs to be made. Once approval is given, Geoff modifies the IP addresses on each machine as planned.

Step 5: Verify full system functionality and, if applicable, implement preventative measures. Geoff verifies that each machine can talk to other machines on the same VLAN and that computers cannot talk to those on other VLANs.

Step 6: Document findings, actions, and outcomes. Geoff notifies his manager that the machines are now functioning. He also updates the change control ticket to note that the change corrected the issue.

Incorrect Routing and Misconfigured Proxies

Internetwork traffic, traffic that moves from one network to another, requires routers to direct the traffic to the next leg of its journey toward its destination. Routers do this because they have an understanding of where different networks reside and the possible paths to reach those networks. Incorrect routing can result in a loss of connectivity between one or more devices. In some cases, a proxy will be used to manage communications between nodes on behalf of one or more members.

Let’s now consider a routing/proxy issue and how it can be resolved using the CompTIA troubleshooting methodology: Pam is responsible for the network infrastructure, but her company recently moved many of the company servers to Amazon Web Services (AWS). A consultant configured VLANs and routing, but cloud administrators report that machines cannot communicate with devices on the Internet. Pam is asked to troubleshoot AWS routing for the VLANs. Pam confirms that devices can communicate with other devices on the same VLAN and that devices cannot communicate with the Internet.

Step 1: Identify the problem. Traffic from VLANs is not being routed externally to the Internet.

Step 2: Establish a theory of probable causes. Pam considers the possible causes and comes up with several theories. The problem could be that routing is not configured for the VLANs. It might also be possible that the default route was removed. Pam also theorizes that access lists could be preventing inside traffic from exiting the network.

Step 3: Test the theory to determine the cause. Pam uses the traceroute command from one of the machines exhibiting the problem to test the path from that machine to google.com, as shown in this example:

Pam issues the nslookup command on google.com to see if she can resolve the name to an IP address. She receives a non-authoritative answer with an IP address, shown here:

Images

She then issues the traceroute command again with the IP address instead of the name. The traceroute command shows a hop to the local proxy called box.local and then a connection to the default gateway 192.168.1.1, but the connection times out:

Images

Pam disables the proxy to test whether that is the issue and runs tracert again, but the request times out immediately after hitting the default gateway. Pam then logs into the AWS Virtual Private Cloud (VPC) console and observes the Route Tables page. She finds that the main route table was modified to include routes between the subnets, but the route to the virtual private gateway was replaced when these changes were made.

Step 4: Establish a plan of action to resolve the problem and implement the solution. Pam believes the problem lies with the missing route to the virtual private gateway, so she submits a change request to add this route.

Step 5: Verify full system functionality and, if applicable, implement preventative measures. Pam’s change request is approved, so she makes the change and then issues a traceroute along with the –d switch to skip resolving hostnames so that the trace will run faster. She issues the command from the same machine she was using to test and receives this output:

Images

Step 6: Document findings, actions, and outcomes. Pam notifies her manager that the machines are now functional. She then updates the change control ticket to note that the change corrected the issue. See Chapter 4 for more information on routing.

QoS Issues

Quality of service (QoS) is a set of technologies that can identify the type of data in data packets and divide those packets into specific traffic classes that can be prioritized according to defined service levels. QoS was introduced back in Chapter 8, and QoS technologies enable administrators to meet their service requirements for a workload or an application by measuring network bandwidth, detecting changing network conditions, and prioritizing the network traffic accordingly. QoS can be targeted at a network interface, toward a given server’s or router’s performance, or regarding specific applications. Incorrectly configured QoS can result in performance degradation for certain services and, consequently, irate users.

Let’s demonstrate how the CompTIA troubleshooting methodology can help resolve QoS issues: Marco is a cloud administrator for Big Top Training, a company that produces fireworks safety videos that are streamed by subscribers from the company’s cloud. Marco has been reading about QoS, and he thinks it can significantly improve performance on the cloud network. He discusses it with his boss and receives approval to test QoS settings in a lab environment that is set up on another cloud segment. He configures QoS priorities and tests several types of content, including streaming video, data transfers, active directory replication, and DNS resolution. He shows the results of his tests to Dominick, his manager, and they agree to roll the changes out to the rest of the network. A couple of weeks later, the backup administrator, Teresa, mentions that some backup jobs have been failing because they cannot complete in their scheduled time window and are terminated. She suggests that QoS might be the problem because the timeouts started happening the day after the QoS changes were put in place. Dominick tells Marco to look into the problem.

Step 1: Identify the problem. Marco identifies the problem as backups are unable to complete in scheduled time windows.

Step 2: Establish a theory of probable causes. Marco theorizes that the backup issues could be caused by a lack of a backup profile since the lab environment he worked in did not have any backups scheduled for it.

Step 3: Test the theory to determine the cause. Marco walks Dominick through his theory. Dominick suggests that he collect baseline data on traffic from the production network and then use that to build additional QoS rules. Marco collects the data for the baseline and then reviews the data with Teresa and Dominick.

Step 4: Establish a plan of action to resolve the problem and implement the solution. Marco, Teresa, and Dominick find that backup traffic communicates over a port that does not have a QoS rule, as Marco theorized. They also identify five other services that have no QoS rules defined, so they map out priorities for those items as well. The planned changes are put into the change management system, and Dominick schedules a downtime in the evening for the changes to be made. Dominick informs stakeholders of the downtime, and Marco implements the new QoS rules during the planned downtime.

Step 5: Verify full system functionality and, if applicable, implement preventative measures. Marco notifies Teresa when the work has been completed and Teresa manually executes the failing backup jobs to confirm that they do run within the normal time allotted. Marco and Teresa inform Dominick that the jobs now work and the downtime is concluded.

Step 6: Document findings, actions, and outcomes. Marco creates a QoS document outlining each of the priorities and the traffic that fits into each priority. He also schedules a time to collect a more intensive baseline to confirm that all critical services have been accounted for.

For more information on baselines, see Chapter 7, and for more information on QoS, see Chapter 8.

Latency

Network latency, when it is excessive, can create bottlenecks that prevent data from using the maximum capacity of the network bandwidth, resulting in slower cloud application performance. Latency metrics are essential for ensuring responsive services and applications and in avoiding performance or availability problems.

Let’s consider a scenario: You recently configured synchronous replication of a key ERP database to another site 2000 miles away. However, the ERP system is now running extremely slow. Performance metrics on the servers that make up the ERP system show plenty of capacity and very low utilization of system resources. Management is upset and demands a resolution ASAP.

Step 1: Identify the problem.

Step 2: Establish a theory of probable causes.

Step 3: Test the theory to determine the cause.

Step 4: Establish a plan of action to resolve the problem and implement the solution.

Step 5: Verify full system functionality and, if applicable, implement preventative measures.

Step 6: Document findings, actions, and outcomes.

See Chapter 4 for more information on latency.

Misconfigured MTU/MSS

The maximum transmission unit (MTU) is the largest packet or frame that can be sent over the network. Frames operate at the data link layer, while packets operate at the network layer. Segments also have a maximum size. Segments operate at the transport layer, and their maximum size is specified as the maximum segment size (MSS). MTU and MSS are typically measured in bytes.

Higher-level protocols may create packets larger than a particular link supports, so the TCP divides the packets into several pieces in a process known as fragmentation. Each fragment is given an ID so that the fragments can be pieced back together in the correct order. However, not all applications support fragmentation. When this routinely happens, the solution is to adjust the MSS so that packets are not fragmented. MSS is adjusted because it operates at a higher layer than the frames and packets, so the data that is provided to the lower-level protocols ends up an appropriate size and does not need to be fragmented.

Let’s look at this in a scenario and use the CompTIA troubleshooting methodology to resolve the situation: You configure a new VPN for your company using L2TP over IPSec. However, performance over the VPN is much slower than expected. You run a packet capture on the data over the network link using the tcpdump tool. You capture packets less than 64 bytes with the tcpdump < 64 command and then you capture packets greater than 60,000 bytes (the max packet size is 65,535 bytes) with the tcpdump > 60000 command.

Step 1: Identify the problem.

Step 2: Establish a theory of probable causes.

Step 3: Test the theory to determine the cause.

Step 4: Establish a plan of action to resolve the problem and implement the solution.

Step 5: Verify full system functionality and, if applicable, implement preventative measures.

Step 6: Document findings, actions, and outcomes.

Automation/Orchestration Issues

Automation and orchestration can be incredibly complex. The advantage of automation and orchestration over manual processes is that automation performs the task the exact same way every time. Unfortunately, things do not stay the same and processes will need to be updated from time to time. When automation or orchestration changes, evaluate the process and run a change management report to identify all changes made recently to the resources the automation depends on. Usually, something has changed in the environment that is not reflected in the automation workflow.

Some other issues that can arise in automation and orchestration include server name changes, domain changes, and incompatibility with automation tools.

Batch Job Scheduling Issues

Batch jobs often can encounter scheduling issues as data volumes grow, utilization increases, or as new jobs are added. These issues can be avoided through proper capacity planning and forecasting.

Now that you understand the potential problems, let us try a scenario to see how the CompTIA troubleshooting methodology can help: John is working for a small company that heavily uses cloud services. He has been working for the company for about a month after the previous IT administrator left. The previous administrator automated a number of tasks. John has been receiving an email each morning stating that space has been added to several virtual machines based on their usage. However, this morning, he received a message that stated the job has failed. When he checked the orchestration, no error trapping was present.

Step 1: Identify the problem.

John investigates the hypervisor cluster that the virtual machines reside on and finds that the virtual disks have been expanded using a large pool on a shared storage device. The logs on the device show expansions corresponding to the emails he has been receiving. He also finds alerts in the logs showing that the storage pool is full. John finds that the machines that were expanded each have 25 percent free space. He also finds that there is an additional 2.3TB available on the SAN that hosts the shared storage.

Step 2: Establish a theory of probable causes.

Step 3: Test the theory to determine the cause.

Step 4: Establish a plan of action to resolve the problem and implement the solution.

Step 5: Verify full system functionality and, if applicable, implement preventative measures.

Step 6: Document findings, actions, and outcomes.

Security Issues

Security issues can cause significant problems for system availability and data confidentiality or integrity. Some security issues you should be aware of are as follows:

Federations, domain trusts, and single sign-on Federations, domain trusts, and single sign-on (SSO) each are technologies that extend authentication and authorization functions across multiple, interdependent systems.

External attacks External attacks can be minimized using firewalls, intrusion detection systems, hardening, and other concepts discussed in Chapters 10 and 11.

Internal attacks Separation of duties and least privilege (discussed in Chapter 11) can help reduce the likelihood of internal attacks.

Privilege escalation System vulnerabilities, incorrectly configured roles, or software bugs can result in situations where malicious code or an attacker can escalate their privileges to gain access to resources that they are unauthorized to access.

External role change Role change policies should extend out to procedures and practices employed to change authorizations for users to match changes in job roles for employees.

Incorrect hardening settings Hardening, discussed in Chapter 11, reduces the risk to devices.

Weak or obsolete security technologies Security technologies age quickly. Those technologies that are out of support may not receive vendor patches and will be unsafe to use in protection of corporate assets.

Insufficient security controls and processes Insufficient security controls, such as antivirus and firewalls, and processes can increase the likelihood of successful attacks.

This section explores three security topics in more detail along with scenarios that utilize CompTIA’s troubleshooting methodology. These topics include authorization and authentication issues, malware, and certificate issues. Authorization and authentication issues include scenarios such as systems that are deployed without proper service accounts, account lockouts, employees who change positions, and changes made to permissions on a system that affects authorized user access. Malware is also a problem that is frequently encountered. Malware impact can range from low, such as malware that slows a machine, to high-risk malware that results in a data breach. Lastly, certificates are used to secure communication between devices and verify the identity of communication partners. When certificates or the systems around them fail, communication failures are sure to follow, and this can affect business operations significantly.

Authorization and Authentication Issues

Authorization governs which resources a user or service account can access and what the user or service can do with that resource.

Authentication issues can be as simple as users locking their accounts by entering their credentials incorrectly several times consecutively. The user’s account will need to be unlocked before they can access network resources. If many users report permission problems, check services like DNS and Active Directory, or LDAP on Linux servers, to verify that they are functioning. Problems with these services can prevent users from authenticating to domain services. Service accounts can also cause issues. Service accounts are created for very specific uses, and their permissions are usually very granularly defined. However, as needs change, so must the permissions change.

Let’s demonstrate the CompTIA troubleshooting methodology with a scenario: A service account is used to log into a database server. It issues queries to three databases. The service can add data to the tables of one database, but cannot modify the table structure. This account works fine for operating the application, but upgrading the application results in an error stating that tables could not be updated.

Step 1: Identify the problem. The application upgrade fails when updating tables.

Step 2: Establish a theory of probable causes. You theorize that this could be due to a permissions issue with the person running the upgrade or with the service account. You run a trace on the database as the application is upgraded and you identify the account that is being used to perform the upgrade and the queries that fail. The queries are related to adding new fields.

Step 3: Test the theory to determine the cause. You review the permissions for the account and find that it does not have permission to modify the table structure, and adding new fields is a change to the structure.

Step 4: Establish a plan of action to resolve the problem and implement the solution. You recommend that an account with permission to modify the table structure should be used to install the application. Management agrees, and you put in a service ticket to have an account created with the appropriate permissions and roles. Once the account is created, you provide the credentials to the application team.

Step 5: Verify full system functionality and, if applicable, implement preventative measures. The application team reports that the application installs correctly with the new credentials. You confirm that the application upgrade is complete and then submit a ticket to have the account disabled until the application team needs it again.

Step 6: Document findings, actions, and outcomes. You document the account that needs to be used for application updates and the process that must be followed to enable the account.

In this example, you could add the permissions to the account that runs the application, but this would not be the best approach. The application does not need that permission regularly, and something that exploited the application or service could use that to modify the table structure and do more harm to the application. It is best to exercise the principle of least privilege in both user and service accounts.

Malware

Another security issue you might face is the presence of malware. Malware infects machines through infected media that is plugged into a computer or other device, through website downloads or drive-by malware that executes from infected websites, or malicious ads known as malvertising. Malware is also distributed through e-mail, instant messages, and social networking sites.

Computers infected with malware might run slowly or encounter regular problems. Ransomware, a particularly troublesome form of malware, encrypts data on the user’s machine and on network drives the machine has access to.

Let’s demonstrate the CompTIA troubleshooting methodology with a scenario: Aimee, a cloud security engineer, receives reports that user files are being encrypted on the network.

Step 1: Identify the problem. Files are being encrypted on the company NAS. Access logs from the NAS around the time of the encryption show connections from a computer called LAB1014. LAB1014 has a number of encrypted files on its local drive. No other users report encrypted files on their machines, and a spot check by another administrator confirms no encrypted files on a sample of other machines.

Step 2: Establish a theory of probable causes. This could be due to a rogue script or ransomware running on LAB1014.

Step 3: Test the theory to determine the cause. Both theories have the same response. LAB1014 needs to be quarantined immediately so that the problem does not spread and continue. If it is the cause of a rogue script, the activity will cease after LAB1014 is quarantined. If it is the result of ransomware, the LAB1014 will continue encrypting files on its local drive, but uninfected machines on the network and the NAS will continue operating normally.

Step 4: Establish a plan of action to resolve the problem and implement the solution. The first step is to isolate LAB1014 from the network so that it cannot infect any other machines. Next, check other computers, starting with devices that were connected to the infected machine, LAB1014, such as file servers or departmental servers and surrounding workstations. Isolate all machines that have malware on them.

Next, make a forensic copy of LAB1014 in case an investigation is required. Once the forensic image is verified, you can begin the process of identifying the malware through virus scanning and removing the malware using virus scanning tools or specific malware removal tools. It is best to do a scan of the LAB1014 with installed antivirus tools and with bootable media that can scan the machine from outside the context of the installed operating system. Sometimes malware tricks the operating system into thinking parts of its code are legitimate. It might even tell the operating system that its files do not exist. Virus scanning tools installed on the operating system rely on the operating system to provide them with accurate information, but this is not always the case. Bootable antivirus tools work independently from the operating system, so they do not suffer from these potential limitations.

Step 5: Verify full system functionality and, if applicable, implement preventative measures. Verify that the ransomware has been removed from LAB1014 and any other machines that may have been identified as containing ransomware in the course of troubleshooting and that new machines are not being infected. Next, restore data to the machines where data was encrypted.

Step 6: Document findings, actions, and outcomes. Create a report of the impact and actions taken.

Certificate Issues

Certificates are used to encrypt and decrypt data, as well as to digitally sign and verify the integrity of data. Each certificate contains a unique, mathematically related public and private key pair. During the standard process of authentication to a website, a client is presented with a certificate from a website. It then verifies that the certificate is in its trusted root store, thus trusting the certificate was signed by a trusted certificate authority. Afterward, the client verifies that the certificate is coming from the correct web server.

When the certificate is issued, it has an expiration date; certificates must be renewed before the expiration date. Otherwise, they are not usable. Expired certificates or certificates that are misconfigured can make sites unavailable or available with errors for end users.

Misconfigured certificates include sites that have a different name from their certificate, such as a site with the URL www.example.com configured with a certificate for example.com. The missing “www” in the certificate name would result in certificate errors for site visitors.

Consider a scenario with a certificate issue and how the CompTIA troubleshooting methodology could be applied to resolve the issue: Users report that the company website is showing security errors and customers are afraid to go to the website. Some customers on Twitter are saying that the company site has been hacked.

Step 1: Identify the problem. You open the site and see that the site is displaying a certificate error.

Step 2: Establish a theory of probable causes. The certificate either is expired or has been revoked.

Step 3: Test the theory to determine the cause. View the certificate on the web server to see if it is expired. If it is not expired, check the certificate revocation list (CRL) to see if it has been revoked. In this case, the certificate expired.

Step 4: Establish a plan of action to resolve the problem and implement the solution. Discuss renewal of the certificate and receive approval to perform the renewal and a purchase order to purchase the certificate renewal. Complete the renewal of the server certificate.

Step 5: Verify full system functionality and, if applicable, implement preventative measures. Log onto the site to confirm that certificate errors are no longer displayed.

Step 6: Document findings, actions, and outcomes. Identify all certificates in use at the company and when they expire. Discuss which ones are still required and establish a process to review certificates needed at least annually. Next, create a schedule with alerts so that certificates are renewed before they expire. Share the schedule with management so that they can budget for the certificate renewal cost.

CERTIFICATION SUMMARY

This chapter introduced you to troubleshooting tools, described documentation and its importance to company and cloud operations, and explained CompTIA’s troubleshooting methodology. Troubleshooting tools can be used to help identify issues, validate troubleshooting theories, and refine theories. Some tools include ping, traceroute, nslookup, ifconfig, ipconfig, route, netstat, arp, Telnet, and SSH. Understanding which tools are best suited to troubleshoot different issues as they arise with a cloud deployment model saves an administrator time and helps maintain service level agreements set forth by the organization.

Documentation is another important concept for cloud and systems administrators. Documentation needs to be clear and easy to understand for anyone who may need to use it and should be regularly reviewed to ensure that it is up to date and accurate. Documenting the person responsible for creating and maintaining the application and where it is hosted is a good process that saves valuable time when troubleshooting any potential issues with the cloud environment.

Lastly, the CompTIA troubleshooting methodology provides an effective means for evaluating problems, identifying potential solutions, testing those solutions, and putting them into practice. The methodology is broken down into six steps as follows: Step 1: Identify the problem. Step 2: Establish a theory of probable causes. Step 3: Test the theory to determine the cause. Step 4: Establish a plan of action to resolve the problem and implement the solution. Step 5: Verify full system functionality and, if applicable, implement preventative measures. Step 6: Document findings, actions, and outcomes.

KEY TERMS

Use the following list to review the key terms discussed in this chapter. The definitions also can be found in the glossary.

Address Resolution Protocol (ARP) Protocol used to resolve IP addresses to media access control (MAC) addresses.

arp command A command prompt tool that resolves an IP address to either a physical address or a MAC address.

compute resources The resources that are required for the delivery of virtual machines: disk, processor, memory, and networking.

domain information groper (dig) Command-line tool for querying DNS servers operating in both interactive mode and batch query mode.

hop count The total number of devices a packet passes through to reach its intended network target.

ifconfig Interface configuration utility to configure and query TCP/IP network interface settings from a Unix or Linux command line.

Internet Control Message Protocol (ICMP) A protocol that is part of the Internet Protocol suite used primarily for diagnostic purposes.

ipconfig Command-line tool to display and configure TCP/IP network settings and troubleshoot DHCP and DNS settings.

limit A floor or ceiling on the amount of resources that can be utilized for a given entity.

load balancing Networking solution that distributes incoming traffic among multiple resources.

maximum segment size (MSS) The largest segment that can be sent over the network.

maximum transmission unit (MTU) The largest packet or frame that can be sent over the network.

netstat Command-line tool that displays network statistics, including current connections and routing tables.

network latency Any delays typically incurred during the processing of any network data.

nslookup Command-line tool used to query DNS mappings for resource records.

ping Command-line utility used to test the reachability of a destination host on an IP network.

quality of service (QoS) A set of technologies that provides the ability to manage network traffic and prioritize workloads to accommodate defined service levels as part of a cost-effective solution.

quota The total amount of resources that can be utilized for a system.

route command A command prompt tool that can be used to view and manipulate the TCP/IP routing tables of Windows operating systems.

Secure Shell (SSH) A protocol used to secure logins, file transfers, and port forwarding.

Telnet A terminal emulation program for TCP/IP networks that connects the user’s computer to another computer on the network.

time-to-live (TTL) The length of time that a router or caching name server stores a record.

traceroute Linux command-line utility to record the route and measure the delay of packets across an IP network.

tracert Microsoft Windows command-line utility that tracks a packet from your computer to a destination host and displays how many hops the packet takes to reach the destination host.

troubleshooting Techniques that aim to solve a problem that has been realized.

TWO-MINUTE DRILL

Troubleshooting Tools

The ping command is used to troubleshoot the reachability of a host over a network.

The traceroute (or tracert) command can be used to determine the path that an IP packet has to take to reach a destination. In order to query a DNS server to obtain domain name or IP address mappings for a specific DNS record, either the nslookup or dig command-line tools can be used.

Ipconfig and ifconfig are command-line utilities that can be used to display the TCP/IP configuration settings of the network interface.

The route command can be used to view and modify routing tables.

The netstat command allows for the display of all active network connections, routing tables, and network protocol statistics.

The arp command resolves an IP address to either a physical address or a MAC address.

Telnet and SSH allow for execution of commands on a remote server.

Documentation and Analysis

It is important for the cloud administrator to document every aspect of the cloud environment, including its setup and configuration and which applications are running on which host computer or virtual machine.

Documentation needs to be clear and easy to understand for anyone who may need to use it and should be regularly reviewed to ensure that it is up to date and accurate.

When issues come up, cloud professionals need to know where to look to find the data they need to solve the problem. The primary place they look is in log files.

Operating systems, services, and applications create log files that track certain events as they occur on the computer. Log files can store a variety of information, including device changes, device drivers, system changes, events, and much more.

Troubleshooting Methodology

CompTIA has established a troubleshooting methodology consisting of six steps:

Step 1: Identify the problem.

Step 2: Establish a theory of probable causes.

Step 3: Test the theory to determine the cause.

Step 4: Establish a plan of action to resolve the problem and implement the solution.

Step 5: Verify full system functionality and, if applicable, implement preventative measures.

Step 6: Document findings, actions, and outcomes.

SELF TEST

The following questions will help you measure your understanding of the material presented in this chapter. As indicated, some questions may have more than one correct answer, so be sure to read all the answer choices carefully.

Troubleshooting Tools

1. Which of the following command-line tools allows for the display of all active network connections and network protocol statistics?

A. Netstat

B. Ping

C. Traceroute

D. Ipconfig and ifconfig

2. You need to verify the TCP/IP configuration settings of a network adapter on a virtual machine running Microsoft Windows. Which of the following tools should you use?

A. Ping

B. ARP

C. Tracert

D. Ipconfig

3. Which of the following tools can be used to verify if a host is available on the network?

A. Ping

B. ARP

C. Ipconfig

D. Ipconfig and ifconfig

4. Which tool allows you to query DNS to obtain domain name or IP address mappings for a specified DNS record?

A. Ping

B. Ipconfig

C. Nslookup

D. Route

5. You need a way to remotely execute commands against a server that is located on the internal network. Which tool can be used to accomplish this objective?

A. Ping

B. Dig

C. Traceroute

D. Telnet

6. You need to modify a routing table and create a static route. Which command-line tool can you use to accomplish this task?

A. Ping

B. Traceroute

C. Route

D. Host

Documentation and Analysis

7. Users are complaining that an application is taking longer than normal to load. You need to troubleshoot why the application is experiencing startup issues. You want to gather detailed information while the application is loading. What should you enable?

A. System logs

B. Verbose logging

C. Telnet

D. ARP

8. How often should documentation be updated?

A. Annually.

B. Quarterly.

C. It depends on how many people are working on the project.

D. Whenever significant changes are made.

9. Fred manages around 50 cloud servers in Amazon Web Services. Each cloud server is thin provisioned and Fred pays for the amount of space his servers consume. He finds that the logs on the servers are rolling over and that each server has only about six days of logs. He would like to retain 18 months of logs. What should Fred do to retain the logs while conserving space on local hard disks?

A. Compress the log files.

B. Request that the cloud provider deduplicate his cloud data.

C. Purchase and configure a cloud log archiving service.

D. Log into the server every five days and copy the log files to his desktop.

Troubleshooting Methodology

10. Which is not the name of a step in the CompTIA troubleshooting methodology?

A. Seek approval for the requested change.

B. Identify the problem.

C. Document findings, actions, and outcomes.

D. Verify full system functionality.

11. Which step in the CompTIA troubleshooting methodology implements the solution?

A. Step 1

B. Step 2

C. Step 3

D. Step 4

E. Step 5

F. Step 6

SELF TEST ANSWERS

Troubleshooting Tools

1. A. The netstat command can be used to display protocol statistics and all of the currently active TCP/IP network connections, along with Ethernet statistics.

B, C, and D are incorrect. The ping utility is used to troubleshoot the reachability of a host on an IP network. Traceroute is a network troubleshooting tool that is used to determine the path that an IP packet has to take to reach a destination. Ipconfig (Windows) and ifconfig (Linux and Unix) are used to configure the TCP/IP network interface from the command line.

2. D. Ipconfig is a Microsoft Windows command that displays the current TCP/IP network configuration settings for a network interface.

A, B, and C are incorrect. The ping utility is used to troubleshoot the reachability of a host on an IP network. ARP resolves an IP address to a physical address or MAC address. Tracert is a Microsoft Windows network troubleshooting tool that is used to determine the path that an IP packet has to take to reach a destination.

3. A. The ping utility is used to troubleshoot the reachability of a host on an IP network. Ping sends an ICMP echo request packet to a specified IP address or host and waits for an ICMP reply.

B, C, and D are incorrect. ARP resolves an IP address to a physical address or MAC address. Ipconfig and ifconfig display the current TCP/IP network configuration settings for a network interface.

4. C. Using the nslookup command, it is possible to query the Domain Name System to obtain domain name or IP address mappings for a specified DNS record.

A, B, and D are incorrect. The ping utility is used to troubleshoot the reachability of a host on an IP network. The ipconfig command displays the current TCP/IP network configuration settings for a network interface. The route command can view and manipulate the TCP/IP routing tables of operating systems.

5. D. Telnet allows you to connect to another computer and enter commands via the Telnet program. The commands will be executed as if you were entering them directly on the server console.

A, B, and C are incorrect. The ping utility is used to troubleshoot the reachability of a host on an IP network. The dig command can be used to query domain name servers and can operate in interactive command-line mode or batch query mode. Traceroute is a network troubleshooting tool that is used to determine the path that an IP packet has to take to reach a destination.

6. C. You can use the route command to view and manipulate the TCP/IP routing tables and create static routes.

A, B, and D are incorrect. The ping utility is used to troubleshoot the reachability of a host on an IP network. Traceroute is a network troubleshooting tool that is used to determine the path that an IP packet has to take to reach a destination. The host utility can be used to perform DNS lookups.

Documentation and Analysis

7. B. Verbose logging records more detailed information than standard logging and is recommended to troubleshoot a specific problem.

A, C, and D are incorrect. System log files can store a variety of information, including device changes, device drivers, system changes, and events, but would not provide detailed information on a particular application. ARP resolves an IP address to a physical address or MAC address. Telnet allows a user to connect to another computer and enter commands, and the commands are executed as if they were entered directly on the server console.

8. D. Each time a significant change is made, the documentation should be updated to reflect the change. Otherwise, coworkers, auditors, or other employees might operate off out-of-date information.

A, B, and C are incorrect. Updating documentation annually or quarterly might do very little if nothing has changed in that interval. It is not the interval that matters, but the changes that are made to the device. Answer C also does not address what has changed. The number of people on a project does not matter. Even if there is a single person responsible for the system, it should still be documented.

9. C. A cloud log archiving service would allow Fred to retain the logs on the archiving service while freeing up space on the local disks. The log archival space would likely be cheaper than the production space, and log archiving services offer additional analytical and searching tools to make reviewing the logs easier.

A, B, and D are incorrect. Folder compression would do little to extend the length of time the logs could be retained for. Fred wants to keep 18 months of logs, and compression might be able to get him a few more days. Deduplication of cloud data would operate off the entire drive. This could save space overall, but it does not specifically address the issue of log sizes. The logs would benefit slightly as well as the rest of the system, but not enough to be able to store logs long-term on primary storage. D is incorrect because is it not practical for Fred to perform a manual process every five days. Fred will likely forget some days and the copies will be harder to search and subject to data loss on his desktop.

Troubleshooting Methodology

10. D. Change requests were discussed in the previous chapter, and it important to seek approval for changes before making them. However, this is not the name of a step in the CompTIA troubleshooting methodology.

A, B, and C are incorrect. Each of these is a step in the CompTIA troubleshooting methodology. The steps are as follows: Step 1: Identify the problem. Step 2: Establish a theory of probable causes. Step 3: Test the theory to determine the cause. Step 4: Establish a plan of action to resolve the problem and implement the solution. Step 5: Verify full system functionality and, if applicable, implement preventative measures. Step 6: Document findings, actions, and outcomes.

11. D. Step 4 implements the solution.

A, B, C, E, and F are incorrect. Only Step 4 implements the solution. The other steps either lead up to the solution or validate and document following the solution. The steps in the CompTIA troubleshooting methodology are as follows: Step 1: Identify the problem. Step 2: Establish a theory of probable causes. Step 3: Test the theory to determine the cause. Step 4: Establish a plan of action to resolve the problem and implement the solution. Step 5: Verify full system functionality and, if applicable, implement preventative measures. Step 6: Document findings, actions, and outcomes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 14 Troubleshooting

Create new playlist

Sign In

Sign Up

Chapter 14

Troubleshooting

CERTIFICATION OBJECTIVE 14.01

Troubleshooting Tools

Connectivity Tools

Ping

Traceroute

Nslookup and Dig

Configuration Tools

Ifconfig

Ipconfig

Route

Query Tools

Netstat

Arp Command

Remote Administration Tools

Telnet

SSH

CERTIFICATION OBJECTIVE 14.02

Documentation and Analysis

Documentation

Log Files

Network Device and IoT Logs

Syslog

CERTIFICATION OBJECTIVE 14.03

Troubleshooting Methodology

Deployment Issues

Incompatible or Missing Dependencies

Incorrect Configuration

Integration Issues with Different Cloud Platforms

Template Misconfiguration

System Clock Differences

Capacity Issues

Compute

Storage

Networking

Licensing

API Request Limits

Connectivity Issues

VLAN or VXLAN Misconfiguration

Incorrect Routing and Misconfigured Proxies

QoS Issues

Latency

Misconfigured MTU/MSS

Automation/Orchestration Issues

Batch Job Scheduling Issues

Security Issues

Authorization and Authentication Issues

Malware

Certificate Issues

CERTIFICATION SUMMARY

KEY TERMS

TWO-MINUTE DRILL

Troubleshooting Tools

Documentation and Analysis

Troubleshooting Methodology

SELF TEST

Troubleshooting Tools

Documentation and Analysis

Troubleshooting Methodology

SELF TEST ANSWERS

Troubleshooting Tools

Documentation and Analysis

Troubleshooting Methodology

Table of Contents for
14 Troubleshooting