CHAPTER 5

Monitor and troubleshoot Windows Server environments

Windows Servers should be monitored and maintained over their operational lifespans to ensure that they continue to function as well as possible whether they are deployed on-premises or in Azure. Monitoring and maintenance not only involve regularly checking performance and event logs but also troubleshooting problems when computers won’t boot. You also need to know what techniques to employ to diagnose what the issue might be when hybrid workloads in different locations cannot communicate due to either network problems or problems related to hybrid identity.

Skills covered in this chapter:

Skill 5.1: Monitor Windows Server by using Windows Server tools and Azure services

This objective deals with monitoring how Windows Server is running and how you can centralize the collection of performance and event telemetry into a Log Analytics workspace. To master this objective, you’ll need to understand the different technologies that you can use to monitor Windows Server and the Azure hybrid tools that are available to collect and analyze that information.

This section covers how to:

  • Monitor Windows Server by using Performance Monitor
  • Create and configure data collector sets
  • Monitor servers using Windows Admin Center
  • Monitor by using System Insights
  • Manage event logs
  • Deploy Azure Monitor and Log Analytics agents
  • Collect performance counters to Azure
  • Create alerts
  • Monitor Azure Virtual Machines by using Azure Diagnostics extension
  • Monitor Azure Virtual Machines performance by using VM Insights

Monitor Windows Server by using Performance Monitor

Every server administrator has heard users complain that a server is running slow, but determining whether it’s just a matter of perception on the part of a user or an actual slowdown requires you to monitor the server’s performance. You can do this by collecting and analyzing performance telemetry.

Performance Monitor

Performance Monitor allows you to view real-time telemetry on various performance counters. When adding a counter, you can choose whether you wish to display a single instance or a total of all instances. For example, if a server has more than one CPU, you can choose to monitor each CPU individually or add a counter that monitors total CPU usage, as shown in Figure 5-1.

You add counters to Performance Monitor that measure the following:

  • Memory
  • Processor
  • System
  • Physical disk
  • Network interface
This screenshot shows the Performance Monitor with %Processor Time counters displayed.

FIGURE 5-1 Performance Monitor

Performance counter alerts

Performance counter alerts enable you to configure a task to run when a performance counter, such as available disk space or memory, falls under or exceeds a specific value. To configure a performance counter alert, you create a new data collector set, choose the Create Manually option, and select the Performance Counter Alert option.

You add the performance counter, threshold value, and whether the alert should be triggered if the value exceeds or falls below the set value. When you create an alert by default, it adds an event to the event log only when it is triggered. This is useful if you have tools that allow you to extract this information from the event log and use the information as a way of tracking when alerts occur. You can also configure an alert to run a scheduled task when it is triggered. You can do this by editing the properties of the alert and by specifying the name of the scheduled task on the Task tab.

NEED MORE REVIEW?PERFORMANCE MONITOR

You can learn more about Performance Monitor at https://docs.microsoft.com/en-us/troubleshoot/windows-server/performance/performance-overview.

Create and configure data collector sets

Data collector sets enable you to gather performance data, system configuration information, and statistics into a single file. You can use Performance Monitor or other third-party tools to analyze this information to determine how well a server is functioning against an assigned workload.

You can configure data collector sets to include the following:

  • Performance counter data The data collector set includes not only specific performance counters, but also the data generated by those counters.
  • Event trace data Enables you to track events and system activities. Event trace data can be useful when you need to troubleshoot misbehaving applications or services.
  • System configuration information Enables you to track the state of registry keys and record any modifications made to those keys.

Windows Server includes the following built-in data collector sets:

  • Active Directory Diagnostics Available if you have installed the computer as a domain controller; it provides data on Active Directory health and reliability.
  • System Diagnostics Enables you to troubleshoot problems with hardware, drivers, and STOP errors.
  • System Performance Enables you to diagnose problems with sluggish system performance. You can determine which processes, services, or hardware components might be causing performance bottlenecks.

To create a data collector set, perform the following steps:

  1. Open Performance Monitor from the Tools menu of the Server Manager console.
  2. Expand Data Collector Sets.
  3. Select User Defined. On the Action menu, select New and then select Data Collector Set.
  4. You are given the option to create the data collector set from a template, which enables you to choose from an existing data collector set, or to create a data collector set manually. If you choose to create a data collector set manually, you have the option to create a data log, which can include a performance counter, event trace data, and system configuration information, or to create a performance counter alert.
  5. If you select to create data logs and select Performance Counter, next choose which performance counters to add to the data collector set. You can also specify how often Windows should collect data from the performance counters.
  6. If you choose to include event trace data, you need to enable event trace providers. A large number of event trace providers are available with Windows Server 2019. You use event trace providers when troubleshooting a specific problem. For example, the Microsoft Windows-AppLocker event trace provider helps you diagnose and troubleshoot issues related to AppLocker.
  7. If you choose to monitor system configuration information, you can select registry keys to monitor. Selecting a parent key enables you to monitor all registry changes that occur under that key while the data collector set is running.
  8. Next, specify where you want data collected by the data collector set to be stored. The default location is the %systemdrive%PerfLogsAdmin folder. If you intend to run the data collector set for an extended period of time, you should store the data on a volume that is separate from the one hosting the operating system because this ensures that you don’t fill up your system drive by mistake.
  9. The final step in setting up a data collector set is to specify the account under which the data collector set runs. The default is Local System, but you can configure the data collector set to use any account that you have the credentials for.

You can schedule when a data collector set runs by configuring the Schedule tab of a data collector set’s properties.

NEED MORE REVIEW?DATA COLLECTOR SETS

You can learn more about data collector sets at https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-r2-and-2008/cc749337(v=ws.11).

Monitor servers using Windows Admin Center

Windows Admin Center includes performance monitoring functionality that is similar to the traditional Performance Monitor console. Both versions of Performance Monitor can use the same performance counters, but the Windows Admin Center version is designed to be able to connect to remote servers to simplify the process of remote performance monitoring. Performance Monitor in Windows Admin Center can show data in line format, report, min-max, and heatmap (shown in Figure 5-2).

This screenshot shows the Windows Admin Center Performance Monitor with %Processor Time counters displayed in Heatmap.

FIGURE 5-2 Windows Admin Center Performance Monitor heatmap

NEED MORE REVIEW?WINDOWS ADMIN CENTER ALERTS

You can learn more about alerts in Windows Admin Center at https://docs.microsoft.com/en-us/windows-server/manage/windows-admin-center/azure/azure-monitor.

Monitor by using System Insights

System Insights is a predictive analytics tool for Windows Server that predicts, using machine learning, when a server’s current resource capacity will be exceeded. By default, System Insights can predict future resource consumption for compute, networking, and storage. System Insights needs approximately 180 days of data to be able to make a prediction for the next 60 days. System Insights also includes the ability for third parties to include add-ons. System Insights will provide the following prediction status indicators:

  • OK Forecast indicates capacity will not be exceeded.
  • Warning Forecast predicts that capacity will be exceeded within 30 days.
  • Critical Forecast predicts that capacity will be exceeded in the next 7 days.
  • Error An unexpected error has occurred.
  • None Not enough data has been collected to make a forecast.

NEED MORE REVIEW?SYSTEM INSIGHTS

You can learn more about System Insights at https://docs.microsoft.com/en-us/windows-server/manage/system-insights/overview.

Manage event logs

Event Viewer enables you to access recorded event information. Not only does Windows Server offer the application, security, setup, and system logs, but it also contains separate application and service logs. These logs are designed to provide information on a per-role or per-application basis, rather than having all application and role service-related events funneled into the application log. When searching for events related to a specific role service, feature, or application, check to see whether that role service, feature, or application has its own application log.

Event log filters

Filters and event logs enable you to view only those events that have specific characteristics. Filters only apply to the current Event Viewer session. If you constantly use a specific filter or set of filters to manage event logs, you should instead create a custom view. Filters only apply to a single event log. You can create log filters based on the following properties:

  • Logged Enables you to specify the time range for the filter.
  • Event Level Enables you to specify event levels. You can choose the following options: Critical, Warning, Verbose, Error, and Information.
  • Event Sources Enables you to choose the source of the event.
  • Event IDs Enables you to filter based on event ID. You can also exclude specific event IDs.
  • Keywords Enables you to specify keywords based on the contents of events.
  • User Enables you to limit events based on user.
  • Computer Enables you to limit events based on the computer.

To create a filter, perform the following steps:

  1. Open Event Viewer and select the log that you want to filter.
  2. Determine the properties of the event that you want to filter.
  3. In the Actions pane, select Filter Current Log.
  4. In the Filter Current Log dialog box, specify the filter properties.
Event log views

Event log views enable you to create customized views of events across any event log stored on a server, including events in the forwarded event log. Rather than looking through each event log for specific items of interest, you can create event log views that target only those specific items. For example, if there are certain events that you always want to look for, create a view and use the view rather than comb through the event logs another way. By default, Event Viewer includes a custom view named Administrative Events. This view displays critical, warning, and error events from a variety of important event logs such as the application, security, and system logs.

Views differ from filters in the following ways:

  • Persistent You can use a view across multiple Event Viewer sessions. If you configure a filter on a log, it is not available the next time you open the Event Viewer.
  • Include multiple logs A custom view can display events from separate logs. Filters are limited to displaying events from one log.
  • Exportable You can import and export event log views between computers.

The process for creating an event log view is similar to the process for creating a filter. The primary difference is that you can select events from multiple logs, and you give the event log view a name and choose a place to save it. To create an event log view, perform the following steps:

  1. Open Event Viewer.
  2. Select the Custom Views node and then select Create Custom View from the Actions menu.
  3. In the Create Custom View dialog box, select the properties of the view, including:
    • When the events are logged
    • The event level
    • Which event log to draw events from
    • Event source
    • Task category
    • Keywords
    • User
    • Computer
  4. In the Save Filter To Custom View dialog box, enter a name for the custom view and a location to save the view in. Select OK.
  5. Verify that the new view is listed as its own separate node in the Event Viewer.

You can export a custom event log view by selecting the event log view and selecting Export Custom View. Exported views can be imported on other computers running Windows Server.

Event subscriptions

Event log forwarding enables you to centralize the collection and management of events from multiple computers. Rather than having to examine each computer’s event log by remotely connecting to that computer, event log forwarding enables you to do one of the following:

  • Configure a central computer to collect specific events from source computers. Use this option in environments where you need to consolidate events from only a small number of computers.
  • Configure source computers to forward specific events to a collector computer. Use this option when you have a large number of computers that you want to consolidate events from. You configure this method by using Group Policy.

Event log forwarding enables you to configure the specific events that are forwarded to the central computer. This enables the computer to forward important events. It isn’t necessary, however, to forward all events from the source computer. If you discover something from the forwarded traffic that warrants further investigation, you can log on to the original source computer and view all the events from that computer in a normal manner.

Event log forwarding uses Windows Remote Management (WinRM) and the Windows Event Collector (wecsvc). You need to enable these services on computers that function as event forwarders and event collectors. You configure WinRM by using the winrm quickconfig command and you configure wecsvc by using the wecutil qc command. If you want to configure subscriptions from the security event log, you need to add the computer account of the collector computer to the local Administrators group on the source computer.

To configure a collector-initiated event subscription, configure WinRM and Windows Event Collector on the source and collector computers. In the Event Viewer, configure the Subscription Properties dialog box with the following information:

  • Subscription Name The name of the subscription.
  • Destination Log The log where collected events are stored.
  • Subscription Type and Source Computers: Collector Initiated Use the Select Computers dialog box to add the computers that the collector retrieves events from. The collector must be a member of the local Administrators group or the Event Log Readers group on each source computer, depending on whether access to the security log is required.
  • Events to Collect Create a custom view to specify which events are retrieved from each of the source computers.

If you want to instead configure a source computer-initiated subscription, you need to configure the following Group Policies on the computers that act as the event forwarders:

  • Configure Forwarder Resource Usage This policy determines the maximum event forwarding rate in events per second. If this policy is not configured, events are transmitted as soon as they are recorded.
  • Configure Target Subscription Manager This policy enables you to set the location of the collector computer.

Both of these policies are located in the Computer ConfigurationPoliciesAdministrative TemplatesWindows ComponentsEvent Forwarding node. When configuring the subscription, you must also specify the computer groups that hold the computer accounts of the computers that are forwarding events to the collector.

Event-driven tasks

Event Viewer enables you to attach tasks to specific events. A limitation to the process of creating event-driven tasks is that you need to have an example of the event that triggers the task already present in the event log. Events are triggered based on an event having the same log, source, and event ID.

To attach a task to a specific event, perform the following steps:

  1. Open Event Viewer. Locate and select the event that you want to base the new task on.
  2. In the Event Viewer Actions pane, select Attach Task To This Event. The Create Basic Task Wizard opens.
  3. On the Create A Basic Task page, review the name of the task that you want to create. By default, the task is named after the event. Select Next.
  4. On the When An Event is Logged page, review the information about the event. This lists the log that the event originates from, the source of the event, and the event ID. Select Next.
  5. On the Action page, you can choose the task to perform. The Send An E-Mail and Display A Message tasks are deprecated, and you receive an error if you try to create a task using these actions. Select Next.
  6. On the Start A Program page, specify the program or script that should be automatically triggered as well as additional arguments.
  7. After you create the task, you can modify the task to specify the security context under which the task executes. By default, event tasks only run when the user is signed on, but you can also choose to configure the task to run whether the user is signed on or not.

NEED MORE REVIEW?MANAGE EVENT LOGS

You can learn more about managing event logs at https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-r2-and-2008/cc722404(v=ws.11).

Deploy Azure Monitor and Log Analytics agents

Azure Monitor allows you to collect, analyze, and respond to telemetry from a variety of Azure resources as well as Windows Server VMs that are connected through the Azure Monitor or Log Analytics agent.

Azure Monitor agents

One confusing element is the ongoing shift in naming of the various agents used by Azure Monitor and Log Analytics. At various points the agents used have had different names and different functionality. Some documentation will refer to OMS agents, others Azure Monitor agents, and some Log Analytics agents.

  • Azure Monitor agent The most recent version of the agent consolidates features from all prior agents.
  • Log Analytics agent This agent can be deployed on Azure IaaS VMs or Windows Server instances in hybrid environments. Data from the Log Analytics agent is ingested by Azure Monitor logs. The Log Analytics agent will be deprecated by Microsoft in August 2024.
  • Azure Diagnostics extension This is only used with Azure IaaS VMs. The Azure Diagnostics extension collects data related to Azure Storage, Azure Event Hubs, and Azure Monitor.

NEED MORE REVIEW?AZURE MONITOR AGENT FAQ

You can learn more about the different Azure Monitor agents at https://docs.microsoft.com/en-us/azure/azure-monitor/faq#azure-monitor-agent.

Installing the agent

Installing the Azure Monitor or Log Analytics agent requires both the Log Analytics workspace ID and the key for the Log Analytics workspace. You can download the agent from the Log Analytics workspace directly and install using the wizard, where you enter the workspace ID and primary key, or from the command line, where you pass these values through in the installation command.

Network requirements

The Log Analytics agent needs to be able to connect either directly or via proxy to the following hosts on the internet via port 443:

You can also configure a Log Analytics gateway to allow connections between Log Analytics agents and Azure Monitor. The Log Analytics gateway is an HTTP forward proxy that sends data to Azure Automation and Log Analytics workspace. You use the Log Analytics gateway if you restrict which hosts can communicate from your on-premises network. Rather than having each host communicate with the Log Analytics workspace in Azure, traffic is routed from each host to the gateway that functions as a proxy. The Log Analytics gateway is only for log agent activity and does not support Azure Automation features such as runbooks and update automation.

NEED MORE REVIEW?LOG ANALYTICS AGENT

You can learn more about the Log Analytics agent at https://docs.microsoft.com/en-us/azure/azure-monitor/agents/agent-windows.

Collect performance counters to Azure

Rather than sending all possible telemetry from an on-premises server to Azure, data collection rules allow you to specify how data should be collected before it is transmitted to Azure Monitor logs.

Standard data collection rules collect and send data to Azure Monitor. To create a data collection rule, perform the following general steps:

  1. In Azure Monitor, select Data Collection Rules and then select Create.
  2. Provide a rule name, subscription, resource group, and region, and set the platform type to Windows. You can also set it to Linux, but that isn’t relevant to the AZ-801 exam.
  3. On the Resources page, select the virtual machines or connected servers that you wish to associate with the data collection rule.
  4. Select which data you want to collect. Here you select the performance counters with the option of selecting a particular set of objects as well as a sampling rate. You can also select events to collect from event logs using this method.
  5. On the Destination page, select where you want to deploy data. You can send data to multiple Log Analytics workspaces.

NEED MORE REVIEW?COLLECT PERFORMANCE COUNTERS TO AZURE

You can learn more about collecting performance counters to Azure at https://docs.microsoft.com/en-us/azure/azure-monitor/agents/data-collection-rule-overview.

Create alerts

You can use Log Analytics queries to check logs periodically and to send alerts using Azure Monitor log alerts. Alerts have the following three elements:

  • Target The Azure resource you want to monitor
  • Criteria How to evaluate the resource
  • Action Notifications or automation. Send an email, webhook, or run automation.

To create a new Azure Monitor alerts rule:

  1. In the portal, select the relevant resource and then in the Resource menu select Logs.
  2. Write a query that will locate the log events for which you want to trigger an alert.
  3. From the top command bar, select + New Alert rule.The Condition tab opens, populated with your log query.
  4. In the Measurement section, select values for the Measure, Aggregation type, and Aggregation granularity fields.
  5. In the Alert Logic section, set the Alert logic: Operator, Threshold Value, and Frequency.The Preview chart shows query evaluations results over time.
  6. You can select the Review + create button at any time.
  7. On the Actions tab, select or create the required action groups.
  8. On the Details tab, define the Project details and the Alert rule details.

NEED MORE REVIEW?CREATE ALERTS IN AZURE MONITOR

You can learn more about creating alerts at https://docs.microsoft.com/en-us/azure/azure-monitor/alerts/alerts-log.

Monitor Azure Virtual Machines by using Azure Diagnostics extension

The Azure Diagnostics extension is an agent for Azure Monitor that collects monitoring data from an Azure IaaS VM. The Diagnostics extension is used to:

  • Collect guest metrics information to Azure Monitor metrics
  • Forward guest logs and metrics to Azure Storage for archive
  • Forward guest logs and metrics to Azure event hubs, where that information can be sent outside Azure

The Azure Diagnostics extension and the Log Analytics agent are both able to collect monitoring data from Azure IaaS VMs. You can use either or both concurrently. The primary differences between the Azure Diagnostics extension and the Log Analytics agent include:

  • You can only use the Diagnostics extension with Azure IaaS VMs. You can use Log Analytics in hybrid environments with on-premises Windows Server instances.
  • The Azure Diagnostics extension sends data to Azure Storage (blogs and tables), Azure Monitor Metrics, and Event Hubs. Log Analytics collects data to Azure Monitor logs.
  • The Log Analytics agent is required for VM Insights and Microsoft Defender for Cloud.

The Azure Diagnostics extension can collect the following data on a Windows Server Azure IaaS VM:

  • Windows Event logs
  • Performance counters
  • IIS logs
  • Application logs
  • .NET EventSource logs
  • Event tracing for Windows
  • Crash dumps
  • File-based logs
  • Agent diagnostic logs

You install the Diagnostics extension in the Azure portal under Diagnostic settings in the Monitoring section of the virtual machine’s menu.

NEED MORE REVIEW?AZURE DIAGNOSTICS EXTENSION

You can learn more about Azure Diagnostics extension at https://docs.microsoft.com/en-us/azure/azure-monitor/agents/diagnostics-extension-overview.

Monitor Azure Virtual Machines performance by using VM Insights

VM Insights allows you to monitor the performance and health of virtual machines. You can examine processes that are running on those computers as well as the dependencies that exist to other resources. You can use VM Insights to determine if performance bottlenecks exist and if network issues are impacting workload functionality.

VM Insights stores data in Azure Monitor logs. This allows you to view data from a single VM or to leverage the functionality of Azure Monitor to examine data across multiple Azure IaaS VMs.

To deploy Azure VM Insights, you need to perform the following general steps:

  1. Create a Log Analytics workspace.
  2. Add VM Insights to the workspace. You can do this by selecting Workspace Configuration after selecting one or more virtual machines from the list of virtual machines in the Monitor menu of the Azure portal. This installs the VMInsights solution to the workspace.
  3. Install VM Insights on each virtual machine or on-premises connected machine. VM Insights requires the Log Analytics agent and the Dependency agent.

NEED MORE REVIEW?VM INSIGHTS

You can learn more about VM Insights at https://docs.microsoft.com/en-us/azure/azure-monitor/vm/vminsights-overview.

EXAM TIP

Remember to add a process object to a data collector set if you suspect a certain application is responsible for consuming excessive processor resources.

Skill 5.2: Troubleshoot Windows Server on-premises and hybrid networking

Hybrid cloud only works when on-premises resources can communicate seamlessly with resources in Azure and resources in Azure can seamlessly communicate with resources on-premises. Sometimes communication between these two locations becomes disrupted, and you need to find the appropriate tools to diagnose what has gone wrong.

This section covers how to:

  • Troubleshoot hybrid network connectivity
  • Troubleshoot Azure VPN
  • Troubleshoot on-premises connectivity

Troubleshoot hybrid network connectivity

Troubleshooting hybrid connectivity can be a challenge because it may not entirely be clear how traffic should pass back and forth between on-premises and cloud resources even when everything is functioning correctly. While you can use Azure Network Watcher to diagnose hybrid network problems, you should also use it before things go wrong so that you can understand what telemetry is generated when things are functioning normally. Azure Network Watcher provides three monitoring tools:

  • Topology The Network Watcher Topology tool generates graphical displays of your Azure virtual networks and the resources deployed on those networks. You can use this tool at the beginning of the troubleshooting process by allowing yourself to visualize all elements involved in the problem that you are troubleshooting.
  • Connection Monitor This tool allows you to verify that connections work between Azure resources such as IaaS VMs.
  • Network Performance Monitor This tool allows you to track latency and packet loss. You can configure alerts to trigger when latency and packet loss exceed particular thresholds.

Network Watcher is useful for troubleshooting hybrid network connectivity by providing the following six tools:

  • IP flow verify Lets you determine whether packets are allowed or denied to a specific IaaS virtual machine. This tool provides information about which network security group is causing the packet to be dropped.
  • Next hop Allows you to determine the route a packet takes from a source IP address to a destination IP address. It is useful in determining if routing tables for Azure virtual networks are incorrectly configured. If two IaaS hosts cannot communicate or an IaaS host cannot communicate with an on-premises host, this tool can help you determine if routing configuration is the issue.
  • Effective security rules Allows you to determine which specific rule in a set of network security group rules applied in multiple locations is blocking or allowing specific network traffic.
  • Packet capture Allows you to record all network traffic sent to and from an IaaS VM.
  • Connection troubleshoot Allows you to check TCP connectivity between a source and destination VM.
  • VPN troubleshoot Allows you to troubleshoot problems with virtual network gateway connections. This tool is covered in more detail in the next section.

NEED MORE REVIEW?TROUBLESHOOT HYBRID NETWORKING

You can learn more about troubleshooting hybrid networking at https://docs.microsoft.com/en-us/learn/modules/troubleshoot-premises-hybrid-networking/.

Troubleshoot Azure VPN

The first port of call when troubleshooting an Azure VPN is the VPN troubleshoot utility included with Azure Network Watcher. You can use this tool to troubleshoot VPN gateways and connections. The capability can be called through the portal, PowerShell, Azure CLI, or REST API. When called, Network Watcher diagnoses the health of the gateway, or connection, and returns the appropriate results. The request is a long-running transaction. The preliminary results returned give an overall picture of the health of the resource.

The following list contains the values returned with the troubleshoot API:

  • startTime This value is the time the troubleshoot API call started.
  • endTime This value is the time when the troubleshooting ended.
  • code This value is UnHealthy, if there is a single diagnosis failure.
  • results Results is a collection of results returned on the connection or the virtual network gateway.
    • id This value is the fault type.
    • summary This value is a summary of the fault.
    • detailed This value provides a detailed description of the fault.
    • recommendedActions This property is a collection of recommended actions to take.
    • actionText This value contains the text describing what action to take.
    • actionUri This value provides the URI to documentation on how to act.
    • actionUriText This value is a short description of the action text.

NEED MORE REVIEW?TROUBLESHOOT AZURE VPN

You can learn more about troubleshooting Azure VPN at https://docs.microsoft.com/en-us/azure/vpn-gateway/vpn-gateway-troubleshoot.

VPN client troubleshooting

If you are using the VPN client to allow a host to access resources on an Azure virtual network and you then make a change to the topology of your Azure virtual network, such as adding new resources on the extended network or configuring network peering, you will need to re-download the VPN client and reinstall it to access those resources. This is because the VPN client is configured for the network topology at the time you download and install it and will be reconfigured as that network topology changes. Downloading a new version of the VPN client and reinstalling it will resolve many problems related to point-to-site VPNs as this will ensure you have the most current certificates.

Other point-to-site VPN problems can include:

  • If you are notified that the message received was unexpected or badly formatted, check if the user-defined routes for the default route on the gateway subnet are configured correctly and that the root certificate public key is present in the Azure VPN gateway.
  • There is a limit to the number of VPN clients that can be connected simultaneously that depends on the VPN gateway SKU. Once the maximum number is reached, new connections cannot be established unless an existing connection is terminated.
  • If the VPN client cannot access network file shares when connected via point-to-site VPN, you may need to disable the caching of domain credentials on the client.
  • If the point-to-site VPN client is unable to resolve the FQDN of on-premises domain resources, you’ll need to update the DNS settings for the Azure virtual network. By default, these will be set to Azure DNS servers. You will have to alter them so that they point to a DNS server that can resolve or forward queries for on-premises resources as well as Azure resources. The simplest method of doing this is to deploy your own DNS server on the Azure virtual network.

NEED MORE REVIEW?POINT-TO-SITE TROUBLESHOOTING

You can learn more about troubleshooting point-to-site Azure VPNs at https://docs.microsoft.com/en-us/azure/vpn-gateway/vpn-gateway-troubleshoot-vpn-point-to-site-connectionproblems.

Troubleshoot Azure site-to-site VPN

Azure site-to-site VPNs are used to connect on-premises networks to Azure. You should take the following steps to troubleshoot site-to-site VPNs:

  • Ensure the VPN device used for the on-premises side of the VPN connection is validated. This can be done on the Overview page of the VPN gateway.
  • Ensure that the shared key for the on-premises VPN devices matches that used by the Azure VPN gateway.
  • Verify that the IP definition for the Local Network Gateway object in Azure matches the IP address assigned to the on-premises device.
  • Ensure that the Azure gateway IP definition configured on the on-premises device matches the IP address assigned to the Azure gateway.
  • Verify that the virtual network address spaces match between the Azure virtual network and the on-premises network configuration.
  • Verify that subnets match between the local network gateway and on-premises definition for the on-premises network.
  • Run a health probe by navigating to the URL https://<MyVirtualNetworkGatewayIP>:8081/healthprobe and review the certificate warning. If you receive a response, the VPN gateway is considered healthy.

NEED MORE REVIEW?SITE-TO-SITE VPN TROUBLESHOOTING

You can learn more about troubleshooting site-to-site Azure VPNs at https://docs.microsoft.com/en-us/azure/vpn-gateway/vpn-gateway-troubleshoot-site-to-site-cannot-connect.

Troubleshoot on-premises connectivity

The main causes of network connectivity problems are:

  • IP address misconfigurations Check that the IP address, subnet mask, and default gateway are correctly configured. A host can have limited connectivity if the subnet mask and default gateway are incorrectly configured.
  • Routing failures Some connectivity problems occur because traffic cannot seamlessly pass back and forth between on-premises and cloud workloads. This may be because topologies in either network have changed without VPNs and routing tables being updated.
  • Name resolution failures Check that the names assigned to resources in on-premises and cloud locations can be resolved in both locations. Failures can be due to incorrect client DNS server configuration or to misconfiguration and failures of the DNS servers or services themselves.
  • Hardware failure Check that cables are plugged in and functioning (sometimes a plugged-in cable has failed so always swap with another). If a router or switch has failed, multiple clients will be impacted.

You can use the following command-line and PowerShell utilities to diagnose problems with on-premises networks:

  • Ping.exe Allows you to check ICMP connectivity between two hosts. Use Test-NetConnection in PowerShell to perform similar tasks.
  • Tracert.exe Allows you to check the path between two hosts on different subnets.
  • Pathping.exe A combination of ping and traceroute that allows you to view connectivity and route information. Use Test-NetConnection in PowerShell to perform similar tasks.
  • Netstat.exe Allows you to view network connection information for the local host. Use Get-NetTCPConneciton in PowerShell to perform similar tasks.
  • Arp.exe Allows you to manage the contents of the Address Resolution Protocol cache that maps IP addresses to physical addresses.
  • Ipconfig.exe Allows you to view IP address configuration information. You can use Get-NetIPAddress and Clear-DNSClientCache to perform some of the same tasks from PowerShell.
  • Nslookup.exe Allows you to perform DNS queries. Use Resolve-DNSName in PowerShell to perform similar tasks.
  • Route.exe Manage routing information. You can use Get-NetRoute to perform similar tasks from PowerShell.

NEED MORE REVIEW?TROUBLESHOOT WINDOWS SERVER NETWORKING

You can learn more about troubleshooting Windows Server networking at https://docs.microsoft.com/en-us/troubleshoot/windows-server/networking/networking-overview.

Troubleshoot DNS

DNS troubleshooting can be separated into two broad categories: client-side problems and server-side problems. If only one person in your organization has a problem that is DNS related, then it’s probably a client-side problem. If multiple clients have a problem, then it’s probably a server issue. If you experience client-side problems, check which DNS servers the client is configured to use and use the appropriate PowerShell or command-line utilities to diagnose DNS server functionality.

You can use the following PowerShell and command-line utilities to diagnose DNS problems:

  • Resolve-DNSName Attempts to perform DNS resolution using the servers configured for use by the DNS client. You can use it to attempt to resolve FQDNs and IP addresses. You can use the -Server parameter to specify an alternate server. This allows you to determine if a resolution problem is specific to a DNS server. This command is the PowerShell equivalent of nslookup.exe.
  • Nslookup.exe Allows you to perform DNS resolution queries. Nslookup.exe is a cross-platform utility that allows you to perform lookups against local and remote DNS servers based on FQDN or IP address.
  • Get-DNSClient Use this PowerShell command to determine which DNS servers are used by a network interface.
    • Ipconfig /all Use this command to determine which DNS servers are used by the computer. This is the older command-line utility that also works to extract DNS server information.
  • Clear-DNSClientCache Removes all entries in the client’s DNS cache. Use this if a host’s address might have changed but the client is still using a cached DNS record. After clearing the cache, a new DNS query will occur and the most recent DNS record will be retrieved.
    • Ipconfig /flushdns This is the older command-line utility that removes all entries in the client’s DNS cache.

Common DNS server problems you might need to troubleshoot include:

  • DNS records are missing in a DNS zone. Depending on how DNS scavenging is configured, DNS records can be removed from a DNS zone if they are dynamically created or the NoRefresh and Refresh intervals are set too low. Additional causes of DNS records disappearing are:
    • In IPv6 environments, if DHCP option 81 is configured, DNS A records may be deleted when AAAA records are registered.
    • When active dynamic DHCP leases are converted to reservations, the A and PTR records are deleted by design. You will need to manually create these records.
  • Query response delays can occur if forwarders or conditional forwarders are no longer reachable. You should regularly check if DNS servers that are configured as the target of forwarders or conditional forwarders are able to resolve DNS queries.

If DNS zones are Active Directory integrated and records that are present on other domain controllers that also host the DNS role are not present on the local domain controller, check that there are no issues blocking Active Directory replication.

DNS SERVER TESTS

You can configure a DNS server to perform manual or automatic tests on the Monitoring tab of the DNS server’s properties, as shown in Figure 5-3.

This screenshot shows the Monitoring tab of a DNS server's properties. The option to perform simple queries against the DNS server and recursive queries is enabled.

FIGURE 5-3 DNS Monitoring

You can configure the DNS server to perform simple queries, where it will attempt to resolve a DNS query against itself and also a recursive query against other DNS servers that it is configured to use in its client settings. You can configure these tests to occur on a periodic basis and view the results either in this dialog box or in the DNS event logs.

DNS EVENT LOGS

The DNS server log is located in the Applications and Services Logs folder in Event Viewer. Depending on how you configure event logging on the Event Logging tab of the DNS server’s properties this event log records information, including:

  • Changes to the DNS service—for example, when the DNS server service is stopped or started.
  • Zone loading and signing events.
  • Modifications to DNS server configuration.
  • DNS warning and error events.

By default, the DNS server records all these events. It’s also possible to configure the DNS server to only log errors, or errors and warning events. The key with any type of logging is that you should only enable logging for information that you might need to review at some time. Many administrators log everything “just in case,” even though they will only ever be interested in a specific type of event.

In the event that you need to debug how a DNS server is performing, you can enable debug logging on the Debug Logging tab of the DNS server’s properties dialog box. Debug logging is resource intensive, and you should only use it when you have a specific problem related to the functionality of the DNS server. You can configure debug logging to use a filter so that only traffic from specific hosts is recorded, rather than traffic from all hosts that interact with the DNS server.

DNS SOCKET POOL

DNS socket pool is a technology that makes cache-tampering and spoofing attacks more difficult by using source port randomization when issuing DNS queries to remote DNS servers. To spoof the DNS server with an incorrect record, the attacker needs to guess which randomized port was used as well as the randomized transaction ID issued with the query. A DNS server running on Windows Server uses a socket pool of 2,500 by default. You can use the dnscmd command to vary the socket pool between 0 and 10,000. For example, to set the socket pool size to 4,000, issue the following command:

dnscmd /config /socketpoolsize 4000

You must restart the DNS service before the reconfigured socket pool size is used.

DNS CACHE LOCKING

DNS cache locking enables you to control when information stored in the DNS server’s cache can be overwritten. For example, when a recursive DNS server responds to a query for a record that is hosted on another DNS server, it caches the results of that query so that it doesn’t have to contact the remote DNS server if the same record is queried again within the TTL (Time to Live) value of the resource record. DNS cache locking prevents record data in a DNS server’s cache from being overwritten until a configured percentage of the TTL value has expired. By default, the DNS Cache Locking value is set to 100, but you can reset it using the Set-DNSServerCache cmdlet with the LockingPercent option. For example, to set the Cache Locking value to 80%, issue the following command and then restart the DNS server service:

Set-DNSServerCache –LockingPercent 80

DNS RECURSION

DNS servers on Windows Server perform recursive queries on behalf of clients by default. This means that when the client asks the DNS server to find a record that isn’t stored in a zone hosted by the DNS server, the DNS server goes out and finds the result of that query and passes it back to the client. It’s possible for nefarious third parties to use recursion as a denial-of-service (DoS) attack vector, slowing a DNS server to the point where it becomes unresponsive. You might also do this as a part of your troubleshooting process. You can disable recursion on the Advanced tab of the DNS server’s properties.

NETMASK ORDERING

Netmask ordering ensures that the DNS server returns the host record on the requesting client’s subnet if such a record exists. For example, imagine that the following host records existed on a network that used 24-bit subnet masks:

If netmask ordering is enabled and a client with the IP address 10.10.20.50 performs a lookup of wsus.contoso.com, it is always returned the record 10.10.20.105 because this record is on the same subnet as the client. If netmask ordering is not enabled, then the DNS server returns records in a round-robin fashion. If the requesting client is not on the same network as any of the host records, then the DNS server also returns records in a round-robin fashion. Netmask ordering is useful for services such as Windows Server Update Services (WSUS) that you might have at each branch office. When you use it, the DNS server redirects the client in the branch office to a resource on the local subnet when one exists. Netmask ordering is enabled by default on Windows Server DNS servers.

ANALYZE ZONE LEVEL STATISTICS

You can understand how a DNS zone is being utilized by clients by viewing DNS statistics. You can do this on computers running the Windows Server operating system by using the Get-DnsServerStatistics cmdlet. Information that you can view using this cmdlet includes:

  • Cache statistics View information about the number of requests that the DNS server satisfies from cache
  • DNSSEC statistics Provides data about successful and failed DNSSEC validations
  • Error statistics Detailed information about the number of errors, including bad keys, bad signatures, refusals, and unknown errors
  • Master statistics Contains information about zone transfer statistics
  • Query statistics Information about queries made to the DNS server
  • Record statistics Data about the number of records in the cache and memory utilization
  • Recursion statistics Information about how the DNS server solves recursive queries

You can view statistics related to a specific zone by using the –Zonename parameter. For example, if you wanted to view the statistics of the australia.adatum.com zone, you would issue the following command from an elevated Windows PowerShell prompt on a computer that hosts the DNS server role:

Get-DnsServerStatistics –Zonename australia.adatum.com

NEED MORE REVIEW?TROUBLESHOOT WINDOWS SERVER DNS

You can learn more about troubleshooting Windows Server DNS at https://docs.microsoft.com/en-us/troubleshoot/windows-server/networking/troubleshoot-dns-guidance.

Troubleshoot DHCP

If a client computer has been assigned an APIPA address instead of an address appropriate to its network, there is likely something wrong with the DHCP server. If clients are no longer receiving DHCP address leases form a DHCP server, check the following:

  • Is the DHCP server service running? You can check this in the services console or by running the get-service dhcpserver PowerShell command.
  • Is the DHCP server authorized? You can check the list of authorized DHCP servers by using the following PowerShell command (substitute your domain name for tailwindtraders.org) from an elevated PowerShell prompt:

    Get-ADObject -SearchBase "cn=configuration,dc=tailwindtraders,dc=org" -Filter "objectclass -eq 'dhcpclass' -AND Name -ne 'dhcproot'" | select name

  • Are IP address leases available? If the scope is exhausted because all IP addresses are leased, new clients will be unable to obtain IP addresses.
  • Determine if a DHCP relay is available if clients on remote subnets are unable to acquire IP addresses from the DHCP server. Ensure that the DHCP relay can be pinged from the DHCP server.
  • Verify that the DHCP server is listening on UDP port 67 and 68. If WDS is deployed on the DHCP server, this may interfere with the functioning of the DHCP server role.
  • If IPsec is used in the environment, ensure that the DHCP server is configured with the IPsec exemption.

NEED MORE REVIEW?TROUBLESHOOT WINDOWS SERVER DHCP

You can learn more about troubleshooting DHCP at https://docs.microsoft.com/en-us/troubleshoot/windows-server/networking/troubleshoot-dhcp-guidance.

EXAM TIP

A host with an incorrectly configured subnet mask may be able to contact some hosts on the same IP subnet but not others.

Skill 5.3: Troubleshoot Windows Server virtual machines in Azure

The most common workload in Azure is IaaS VMs. Running an IaaS VM in Azure is different than running a workload in an environment that you fully manage. While most Windows Server IaaS VMs will run in Azure without any challenges, if everything worked all the time, none of us would have jobs in IT because there would be very little to do. This objective details the tools and technologies that you can use to diagnose and resolve problems that may occur with Windows Server IaaS VMs.

This section covers how to:

  • Troubleshoot deployment failures
  • Troubleshoot booting failures
  • Troubleshoot VM performance issues
  • Troubleshoot VM extension issues
  • Troubleshoot disk encryption issues
  • Troubleshoot storage
  • Troubleshoot VM connection issues

Troubleshoot deployment failures

There are a variety of scenarios where you might be unable to deploy an IaaS VM in Azure. These include:

  • Invalid template error Azure allows you to deploy VMs from Bicep files or Azure Resource Manager (ARM) templates. Errors can occur when deploying IaaS VMs using this method if there is a syntax error present in the Bicep file or template, parameters are incorrectly specified, parameters have been assigned an invalid value, or a circular dependency exists. A deployment would fail if more than five resource groups were referenced in a template, but this limitation has been removed and a deployment can include up to 800 resource groups.
  • Allocation failure If you get an AllocationFailed or a ZonalAllocationFailed error, the datacenter in which the IaaS VM is to be deployed may be experiencing resource constraints that limit the size of VMs that can be deployed. These errors are generally transient as more capacity is added to Azure datacenters. You can also resolve the deployment problems by attempting a deployment in another Azure region. Allocation failures can also occur when you attempt to start an existing VM that was previously deallocated where the VM is configured to use an older VM size that is no longer supported. As newer generation hardware is deployed in Azure datacenter, older VM SKUs are retired and you will need to resize VMs to a newer supported SKU.
  • Quota error Each Azure subscription has a quota of CPU cores that can be assigned to IaaS VMs. If you exceed this quota, you will be unable to deploy new IaaS VMs. You can choose to either deallocate an existing IaaS VM or contact support to have your subscription’s CPU core quota increased.
  • OS Image error A VM deployment failure will occur if you are attempting to deploy a VM from an image that was incorrectly captured. Incorrect capture can occur if appropriate preparation steps are not taken or incorrect settings are configured during the capture process.

NEED MORE REVIEW?DEPLOYMENT TROUBLESHOOTING

You can learn more about IaaS VM deployment troubleshooting at https://docs.microsoft.com/en-us/azure/azure-resource-manager/troubleshooting/common-deployment-errors.

Troubleshoot booting failures

If an Azure IaaS VM enters a non-bootable state, you can use Console Output and Screenshot support to diagnose the issue. You can enable book diagnostics during VM creation. To enable boot diagnostics on an existing IaaS virtual machine, perform the following steps:

  1. Open the Virtual Machine in the Azure portal.
  2. Under Help select Boot diagnostics, and then choose Settings.
  3. In Boot Diagnostics Settings, select whether to use a managed storage account or a custom storage account. If you choose a custom storage account, you must ensure that you choose one that does not use premium storage and that is not configured as zone-redundant storage.
Azure Serial Console

Azure Serial Console allows you to access a text-based console to interact with IaaS VMs through the Azure portal. The serial connections connect to the COM1 serial port of the Windows Server IaaS VM and allow access independent of the network or operating system state. This functionality replicates the Emergency Management Services (EMS) access that you can configure for Windows Server, which you are most likely to use in the event that an error has occurred that doesn’t allow you to access the contents of the server using traditional administrative methods.

You can use the Serial Console for an IaaS VM as long as the following prerequisites have been met:

  • Boot diagnostics are enabled for the IaaS VM.
  • The Azure account accessing the Serial Console is assigned the Virtual Machine Contributor role for the IaaS VM and the boot diagnostics storage account.
  • A local user account is present on the IaaS VM that supports password authentication.
Boot failure messages

Most boot failure troubleshooting involves looking at the boot diagnostic screenshot of the IaaS VM, creating a repair VM, creating a snapshot of the failed IaaS VM OS disk, mounting the snapshot, and using the appropriate tool to resolve the issue on the mounted disk before reattaching it to the failed VM. The process to do this is covered later in this chapter. Mounting a snapshot of the failed VM’s OS disk and repairing it allows you to troubleshoot the following boot failures:

  • Boot Error – Disk Read Error Occurred Mount the OS disk snapshot copy of the failed VM on a Repair VM and use diskpart to set the boot partition to active. You should run checkdisk /f against the mounted disk image. Dismount the repaired OS disk from the repair VM and mount the repaired OS disk on the original VM.
  • Checking file system error If the IaaS VM boot screenshot shows that Check Disk process is running with the message Scanning and repairing drive (C:) or Checking file system on C:, then wait to determine whether the check completes successfully, at which point the VM may restart normally. If the IaaS VM is unable to exit the Check Disk process, attach a snapshot of the OS disk to a recovery VM and perform a Check Disk operation using chkdsk.exe /f. Dismount the repaired OS disk from the repair VM and mount the repaired OS disk on the original VM.
  • Critical process disabled If the IaaS VM does not boot and boot diagnostics shows the error #0x000000EF with the message Critical Process Died, create and access a repair VM, mount a snapshot of the failed IaaS VM’s boot disk, and run the command sfc /scannow /offbootdir=<BOOT DISK DRIVE>: /offwindir=<BROKEN DISK DRIVE>:windows where <BOOT DISK DRIVE> is the boot partition of the broken VM, and <BROKEN DISK DRIVE> is the OS partition of the broken VM. Dismount the repaired OS disk from the repair VM and mount the repaired OS disk on the original VM.
  • Failed boot error code C01A001D If the IaaS VM cannot boot and boot diagnostics shows a Windows Update operation in progress that is failing with the error code C01A001D, you will need to create and mount a snapshot of the failed IaaS VM’s OS disk to a recovery VM and delete any unnecessary files before reattaching the snapshot to the failed VM. This error is caused when there is not enough space on the OS disk to apply updates.
  • Failed boot no bootable disk If the IaaS VM cannot boot and boot diagnostics shows the message This is not a bootable disk. Please insert a bootable floppy and press any key to try again …” you should, after considering the use of floppy disks in the cloud, mount the OS disk snapshot copy of the failed VM on a repair VM. Use diskpart to set the boot partition to active. You should run checkdisk /f against the mounted disk image. Dismount the repaired OS disk from the repair VM and mount the repaired OS disk on the original VM.

If the Boot Configuration Data on the failed VM appears corrupted, perform the following steps from the repair VM against the mounted OS disk before detaching it and connecting it to the failed IaaS VM:

  1. Enumerate the Windows Boot Loader Identifier with

    bcdedit /store <Boot partition>:ootcd /enum

  2. Repair the Boot Configuration Data by running the following command, replacing <Windows partition> with the partition that contains the Windows folder, <Boot Partition> for the partition that contains the “Boot” system folder, and <Identifier> with the value determined in step 1:

    bcdedit /store <Boot partition>:ootcd /set {bootmgr} device partition=<boot partition>:

    bcdedit /store <Boot partition>:ootcd /set {bootmgr} integrityservices enable

    bcdedit /store <Boot partition>:ootcd /set {<Identifier>} device partition=<Windows partition>:

    bcdedit /store <Boot partition>:ootcd /set {<Identifier>} integrityservices enable

    bcdedit /store <Boot partition>:ootcd /set {<identifier>} recoveryenabled Off

    bcdedit /store <Boot partition>:ootcd /set {<identifier>} osdevice partition=<Windows partition>:

    bcdedit /store <Boot partition>:ootcd /set {<identifier>} bootstatuspolicy IgnoreAllFailures

NEED MORE REVIEW?IAAS VM BOOT

You can learn more about troubleshooting IaaS VM boot at https://docs.microsoft.com/en-us/troubleshoot/azure/virtual-machines/boot-error-troubleshoot.

Repair VMs and failed boot disks

You can create a special repair VM from Cloud Shell that provides you with the necessary tools to repair a failed boot disk by performing the following steps:

  1. Ensure that the az vm repair commands are available by running the command

    az extension add -n vm-repair

  2. Run the az vm repair create command to create a copy of the failed OS disk, create a repair VM in a new resource group, and attach the OS disk copy to the VM. This VM will be the same size and region as the nonfunctional VM specified in the command. If the failed VM uses Azure Disk Encryption, use the --unlock-encrypted-vm command switch. The following command will create a repair VM for the VM named FailedVMName in the resource group ResourceGroupName with the admin account prime and the password Claused1234:

    Az vm repair create -g ResourceGroupName -n FailedVMName –repair-username prime --repair-password ‘Claused1234’ --verbose

  3. You can then either connect to the repair VM normally through RDP or PowerShell, or you can run a repair script using the az vm repair run command. To view the available repair scripts, run the following command:

    az vm repair list-scripts

  4. Once you have fixed the issue that was stopping the VM from booting, run the az vm repair restore command to swap the repaired OS disk with the original VM’s OS disk. For example, to swap the fixed OS disk created in step 2, run this command:

    Az vm repair restore -g ResourceGroupName -n FailedVMName --verbose

NEED MORE REVIEW?AZURE VIRTUAL MACHINE REPAIR

You can learn more about IaaS VM repair at https://docs.microsoft.com/en-us/troubleshoot/azure/virtual-machines/repair-windows-vm-using-azure-virtual-machine-repair-commands.

Troubleshoot VM performance issues

The performance diagnostics tool allows you to troubleshoot performance issues, including high CPU usage, low disk space, or memory pressure. Diagnostics can be run and viewed directly from the Azure portal. To install and run performance diagnostics on a Windows Server IaaS VM, perform the following steps:

  1. Select the VM that you wish to diagnose.
  2. In the VM page, select Performance diagnostics.
  3. Either select an existing storage account or create a new storage account to store performance and diagnostics data. A single storage account can be used for multiple Azure IaaS VMs in the same region.
  4. Select Install performance diagnostics. After the installation completes, select Run Diagnostics.

Azure Performance Diagnostics allows you to run several different analysis scenarios. You should choose the analysis scenario that best describes the issue you are attempting to diagnose. Scenarios include:

  • Quick performance analysis Checks for known issues, runs a best practice analysis, and collects diagnostics data.
  • Performance analysis Does all the quick performance analysis checks as well as examines high resource consumption. Appropriate for diagnosing high CPU, memory, and disk utilization.
  • Advanced performance analysis Does all the quick performance analysis and performance analysis checks. This analysis is useful for complex issues and runs over a longer duration.
  • Azure Files analysis Performs all the checks outlined previously but will also capture a network trace and SMB traffic counters. Useful for diagnosing problems related to SMB file shares on Azure IaaS VMs.

NEED MORE REVIEW?TROUBLESHOOT VM PERFORMANCE ISSUES

You can learn more about VM performance at https://docs.microsoft.com/en-us/troubleshoot/azure/virtual-machines/performance-diagnostics.

Troubleshoot VM extension issues

You can view the status of the extensions installed on a VM using the Get-AZVM PowerShell cmdlet from an Azure Cloud Shell session or by viewing the Extensions and Applications area on a virtual machine’s properties page. You can retrieve information about an Azure IaaS VM’s extension by running the following command from Azure Cloud Shell:

Get-AzVMExtension -ResourceGroupName $RGName -VMName $vmName -Name $ExtensionName

When determining whether an issue is caused by an agent or with an extension, the Azure VM agent relies on the following three services that must be running:

  • RDAgent
  • Windows Azure Guest Agent
  • Microsoft Azure Telemetry Service

If you have verified that these three services are running, you can then check c:WindowsAzurelogsWaAppAgent.log to view information about extensions. Log data will include the following extension operations:

  • Enable
  • Install
  • Start
  • Disable

You can determine if an extension is failing by searching through the WaAppAgent.log for the word “error.” You can also troubleshoot extensions by examining each extension’s logs. Each extension has its own log files in an IaaS VM’s C:WindowsAzurelogsPlugins folder. Extension settings and status files are stored in the C:PackagesPlugins folder on a VM. The Azure Diagnostics Extension writes data to an Azure storage account. The diagnostics extension does not write data to a Log Analytics workspace.

NEED MORE REVIEW?VM EXTENSION ISSUES

You can learn more about troubleshooting VM extension issues at https://docs.microsoft.com/en-us/troubleshoot/azure/virtual-machines/support-agent-extensions.

Troubleshoot disk encryption issues

Disk encryption errors generally occur when the encryption keys, stored in a key vault, become inaccessible to an IaaS VM due to a configuration change. When troubleshooting Azure IaaS VM disk encryption issues, ensure the following:

  • Ensure that network security groups allow the VM to connect to Azure Active Directory endpoints and Azure Key Vault endpoints.
  • Ensure that the Azure Key Vault is located in the same Azure region and subscription as the IaaS VM.
  • Verify that the Key Encryption Key (KEK) is still present and has not been removed from Azure Key Vault.
  • Ensure that the VM name, data disks, and keys adhere to Key Vault resource naming restrictions.

You can use the Disable-AzVMDiskEncryption cmdlet followed by the Remove-AzVMDiskEncryptionExcention cmdlet from an Azure Cloud Shell PowerShell session to disable and remove Azure Disk Encryption. You can use the az vm encryption disable Azure CLI command from an Azure Cloud Shell session to disable Azure VM disk encryption.

If you encounter BitLocker errors that block the OS disk from booting, you can decrypt the OS disk by deallocating the VM and then starting it. When you do this, the Backup Encryption Key (BEK) file is recovered from the Azure Key Vault and placed on the encrypted disk.

NEED MORE REVIEW?IAAS VM DISK ENCRYPTION

You can learn more about IaaS VM disk encryption at https://docs.microsoft.com/en-us/azure/virtual-machines/windows/disk-encryption-overview.

Troubleshoot storage

The main disk roles for Azure IaaS VMs are the operating system disk, data disks, and temporary disks. These have the following properties:

  • The operating system (OS) disk hosts the Windows Server operating system. An Azure IaaS OS disk has a maximum capacity of 4,095 gigabytes. You cannot extend an OS disk if Azure Disk Encryption is enabled. Generation 2 VMs support GPT and MBR partition types on the OS disk.
  • Data disks are managed disks that are attached to an Azure IaaS VM and are registered as SCSI drives. Data disks have a maximum capacity of 32,767 gigabytes. Multiple data disks can be combined using storage spaces to support larger volumes. The number of data disks that can be attached to an Azure IaaS VM depends on the VM SKU.
  • Temporary disks can be used for short-term storage. Data on a temporary disk can be lost when Azure performs maintenance events, during unscheduled outages, or if you redeploy a VM. When an Azure IaaS VM performs a standard reboot, data on the temporary disk will remain. Important data that needs to be stored should not be located on a temporary disk.

Managed disk snapshots are read-only crash consistent copies of an entire managed disk. You can use an existing managed disk snapshot to create a new managed disk. Snapshots can be created with the New-AzSnapShot PowerShell cmdlet or the az snapshot create Azure CLI command from an Azure Cloud Shell session.

NEED MORE REVIEW?MORE ABOUT IAAS VM DISKS

You can learn more about Azure IaaS VM disks at https://docs.microsoft.com/en-us/azure/virtual-machines/faq-for-disks.

Troubleshoot VM connection issues

Once an IaaS VM is deployed, you need to consider how to allow administrative connections to the IaaS VM. You’ll need to ensure that any network security groups and firewalls between your administrative host and the target IaaS VM are configured to allow the appropriate administrative traffic, though tools such as Just-in-Time VM access and Azure Bastion can automate that process. As a general rule, you shouldn’t open the Remote Desktop port (TCP port 3389) or a remote PowerShell port (HTTP port 5985, HTTPS port 5986) so that any host from any IP address on the internet has access. If you must open these ports on a network security group applied at the IaaS VM’s network adapter level, try to limit the scope of any rule to the specific public IP address or subnet that you will be initiating the connection from. When troubleshooting network connectivity to an Azure IaaS VM, start by using the Test Your Connection option on the Connect page of the VM’s properties in the Azure portal.

Connecting with an Azure AD Account

When you deploy an IaaS VM, you usually configure a username and password for a local Administrator account. If the computer is standalone and not AD DS or Azure AD DS joined, you have to decide whether you want to limit access to the accounts that are present that you have configured on the IaaS VM itself or if you also want to allow local sign-on using Azure AD accounts.

Sign-on using Azure AD is supported for IaaS VMs running Windows Server 2019 and Windows Server 2022. The IaaS VMs need to be configured on a virtual network that allows access to the following endpoints over TCP port 443:

You can enable Azure AD logon for a Windows Server IaaS VM when creating the IaaS VM using the Azure portal or when creating an IaaS VM using Azure Cloud Shell. When you do this, the AADLoginForWindows extension is enabled on the IaaS VM. If you have an existing VM Windows Server IaaS VM, you can use the following Azure CLI command to install and configure the AADLoginForWindows extension (where ResourceGroupName and VMName are unique to your deployment):

Az vm extension set --publisher Microsoft.Azure.ActiveDirectory --name AADLoginForWindows --resource-group ResourceGroupName --vm-name VMName

Once the IaaS VM has the AADLoginForWindows extension configured and enabled, you can determine what permissions the Azure AD user account has on the IaaS VM by adding them to the following Azure roles:

  • Virtual Machine Administrator Login Accounts assigned this role are able to sign on to the IaaS VM with local Administrator privileges.
  • Virtual Machine User Login Accounts assigned this role are able to sign on to the VM with regular user privileges.
Remote PowerShell

You can initiate a remote PowerShell session from hosts on the internet. By default, remote PowerShell uses TCP port 5985 for HTTP communication and TCP port 5986 for HTTPS connection. These ports must be accessible to the host you are making your connection from for the connection to be safely established.

Another option is to run a Cloud Shell session in a browser and perform PowerShell remote administration in this manner. Cloud Shell is a browser-based CLI and a lot simpler to use than adding the Azure CLI to your local computer. There is a Cloud Shell icon at the top panel of the Azure console.

You can enable remote PowerShell on an Azure IaaS Windows VM by performing the following steps from Cloud Shell:

  1. Ensure that Cloud Shell has PowerShell enabled by running the pwsh command.
  2. At the PowerShell prompt in Cloud Shell, type the following command to enter local Administrator credentials for the Azure IaaS Windows VM:

    $cred=get-credential

  3. At the PowerShell prompt in Cloud Shell, type the following command to enable PowerShell remoting on the Windows Server IaaS VM, where the VM name is 2022-IO-A and the resource group that hosts the VM is 2022-IO-RG:

    Enable-AzVMPSRemoting -Name 2022-IO-A -ResourceGroupName 2022-IO-RG -Protocol https -OSType Windows

  4. Once this command has completed executing, you can use the Enter-AzVM cmdlet to establish a remote PowerShell session. For example, run this command to connect to the IaaS VM named 2022-IO-A in resource group 2022-IO-RG:

    Enter-AzVM -name 2022-IO-A -ResourceGroupName 2019-22-RG -Credential $cred

Azure Bastion

Azure Bastion allows you to establish an RDP session to a Windows Server IaaS VM through a standards-compliant browser such as Microsoft Edge or Google Chrome rather than having to use a remote desktop client. You can think of Azure Bastion as “jumpbox as a service” because it allows access to IaaS VMs that do not have a public IP address. Before the release of Azure Bastion, the only way to gain access to an IaaS VM that didn’t have a public IP address was either through a VPN to the virtual network that hosted the VM or by deploying a jumpbox VM with a public IP address, from which you then created a secondary connection to the target VM. If you have configured an SSH server on the IaaS VM, Bastion also supports creating SSH connections to Linux IaaS VMs or Windows Server configured with the SSH server service.

Prior to deploying Azure Bastion, you need to create a special subnet named AzureBastionSubnet on the virtual network that hosts your IaaS VMs. Once you deploy Azure Bastion, the service will manage the network security group configuration to allow you to successfully make connections.

Just-in-Time VM access

Rather than have management ports, such as the port used for Remote Desktop Protocol, TCP port 3389, open to hosts on the internet all the time, Just-in-Time (JIT) VM access allows you to open a specific management port for a limited duration of time and only open that port to a small range of IP addresses. You only need to use JIT if you require management port access to an Azure IaaS VM from a host on the internet. If the IaaS VM only has a private IP address or you can get by using a browser-based RDP session, then Azure Bastion is likely a better option. JIT also allows an organization to log on that has requested access, so you can figure out exactly which member of your team was the one who signed in and messed things up, which led to you writing up a report about the service outage. JIT VM access requires that you use the Azure Security Center, although doing so incurs an extra monthly cost per IaaS VM. Keep that in mind if you are thinking about configuring JIT.

Windows Admin Center in the Azure portal

Windows Admin Center (WAC), available in the Azure portal, allows you to manage Windows Server IaaS VMs. When you deploy Windows Admin Center in the Azure portal, WAC will be deployed on each Azure IaaS VM that you wish to manage. Once this is done, you can navigate directly to an Azure portal blade containing the WAC interface instead of loading Windows Admin Center up directly on your administrative workstation or jumpbox server.

NEED MORE REVIEW?OVERVIEW OF AZURE BASTION

You can learn more about Azure Bastion at https://docs.microsoft.com/en-us/azure/bastion/bastion-overview.

EXAM TIP

Remember that the Azure Diagnostics Extensions write data to a storage account and not a Log Analytics workspace.

Skill 5.4: Troubleshoot Active Directory

Active Directory Domain Services is wonderful when everything works perfectly. When things aren’t working perfectly, AD DS can be a ball of frustration. Problems with AD DS can be as simple as your assistant accidentally deleting a large number of user accounts, to domain controllers not replicating properly, to hybrid authentication not working nearly as effectively as it does in the product documentation.

This section covers how to:

  • Restore objects from AD Recycle Bin
  • Recover Active Directory database using Directory Services Restore Mode
  • Troubleshoot Active Directory replication
  • Recover SYSVOL
  • Troubleshoot hybrid authentication issues
  • Troubleshoot on-premises Active Directory

Restore objects from AD Recycle Bin

Active Directory Recycle Bin allows you to restore items that have been deleted from Active Directory but that are still present within the database because the tombstone lifetime has not been exceeded. Active Directory Recycle Bin requires that the domain functional level be set to Windows Server 2008 R2 or higher. You can’t use the Active Directory Recycle Bin to restore items that were deleted before you enabled Active Directory Recycle Bin.

Once it’s activated, you can’t deactivate the Active Directory Recycle Bin. There isn’t any great reason to want to deactivate AD Recycle Bin once it’s activated. You don’t have to use it to restore deleted items should you still prefer to go through the authoritative restore process.

To activate the Active Directory Recycle Bin, perform the following steps:

  1. Open the Active Directory Administrative Center and select the domain that you want to enable.
  2. In the Tasks pane, select Enable Recycle Bin, as shown in Figure 5-4.
This screenshot shows the Active Directory Administrative Center. The Enable Recycle Bin task is selected in the list of tasks.

FIGURE 5-4 Enable Recycle Bin

After you have enabled the AD Recycle Bin, you can restore an object from the newly available Deleted Objects container. This is, of course, assuming that the object was deleted after the Recycle Bin was enabled and assuming that the tombstone lifetime value has not been exceeded. To recover the object, select the object in the Deleted Objects container and then select Restore or Restore To. Figure 5-5 shows a deleted item being selected that can then be restored to its original location. The Restore To option allows you to restore the object to another available location, such as another OU.

This screenshot shows the Deleted Objects container; a deleted account is shown.

FIGURE 5-5 Deleted Objects container

NEED MORE REVIEW?RESTORE ACCOUNTS WITH AD RECYCLE BIN

You can learn more about restoring deleted user accounts at https://docs.microsoft.com/en-us/troubleshoot/windows-server/identity/retore-deleted-accounts-and-groups-in-ad.

Recover Active Directory database using Directory Services Restore Mode

Directory Services Restore Mode (DSRM) allows you to perform an authoritative restore of deleted objects from the AD DS database. You must perform an authoritative restore of deleted items because if you don’t, the restored item is deleted the next time the AD database synchronizes with other domain controllers where the item is marked as deleted. Authoritative restores are covered later in this chapter. You configure the Directory Services Restore Mode password on the Domain Controller Options page of the Active Directory Domain Services Configuration Wizard, as shown in Figure 5-6. Note that even though a computer running Windows Server 2022 is being configured as a domain controller, the maximum forest and domain functional levels are Windows Server 2016. This is because there is no Windows Server 2019 or Windows Server 2022 domain or forest functional level.

This screenshot shows the Domain Controller Options page of the Active Directory Domain Services Configuration Wizard.

FIGURE 5-6 Configuring Domain Controller Options

In the event that you forget the DSRM password, which, in theory, should be unique for each domain controller in your organization, you can reset it by running ntdsutil.exe from an elevated command prompt and entering the following commands at the ntdsutil.exe prompt, at which point you are prompted to enter a new DSRM password:

set dsrm password

Reset password on server null

Tombstone lifetime

Sometimes an Active Directory account, such as a user account or even an entire OU, is accidentally or, on occasion, maliciously deleted. Rather than go through the process of re-creating the deleted item or items, it’s possible to restore the items. Deleted items are retained within the AD DS database for a period of time specified as the tombstone lifetime. You can recover a deleted item without having to restore the item from a backup of Active Directory as long as the item was deleted in the Tombstone Lifetime window.

The default tombstone lifetime for an Active Directory environment at the Windows Server 2008 forest functional level or higher is 180 days. You can check the value of the tombstone lifetime by issuing the following command from an elevated command prompt (replacing dc=Contoso,dc=Internal with the suffix of your organization’s forest root domain):

Dsquery * "cn=Directory Service,cn=Windows NT,cn=Services,cn=Configuration,dc=Contoso, dc=Internal" -scope base -attr tombstonelifetime

For most organizations the 180-day default is fine, but some administrators might want to increase or decrease this value to give them a greater or lesser window for easily restoring deleted items. You can change the default tombstone lifetime by performing the following steps:

  1. From an elevated command prompt or PowerShell session, type ADSIEdit.msc.
  2. From the Action menu, select Connect to. In the Connection Settings dialog box, ensure that Configuration is selected under Select a well known Naming Context, as shown in Figure 5-7, and then select OK.
    This screenshot shows the Connection Settings dialog, with the Configuration context selected.

    FIGURE 5-7 Connection settings

  3. Navigate to, and then right-click the CN=Services, CN=Windows NT, CN=Directory Service node and select Properties.
  4. In the list of attributes, select tombstoneLifetime, as shown in Figure 5-8, and select Edit.
  5. Enter the new value, and then select OK twice.
This screenshot shows the Directory Service Properties dialog box. The tombstoneLifetime attribute is selected and is set to 180.

FIGURE 5-8 Tombstone lifetime

Authoritative restore

An authoritative restore is performed when you want the items you are recovering to overwrite items that are in the current Active Directory database. If you don’t perform an authoritative restore, Active Directory assumes that the restored data is simply out of date and overwrites it when it is synchronized from another domain controller. If you perform a normal Restore on an item that was backed up last Tuesday when it was deleted the following Thursday, the item is deleted the next time the Active Directory database is synchronized. You do not need to perform an authoritative restore if you only have one domain controller (DC) in your organization because there is no other domain controller that can overwrite the changes.

Authoritative restore is useful in the following scenarios:

  • You haven’t enabled Active Directory Recycle Bin.
  • You have enabled Active Directory Recycle Bin, but the object you want to restore was deleted before you enabled Active Directory Recycle Bin.
  • You need to restore items that are older than the tombstone lifetime of the AD DS database.

To perform an authoritative restore, you need to reboot a DC into Directory Services Restore Mode. If you want to restore an item that is older than the tombstone lifetime of the AD DS database, you also need to restore the AD DS database. You can do this by restoring the system state data on the server. You’ll likely need to take the DC temporarily off the network to perform this operation simply because if you restore a computer with old system state data and the DC synchronizes, all the data that you wish to recover will be deleted when the domain controller synchronizes.

You can configure a server to boot into Directory Services Restore Mode from the System Configuration utility. To do this, select Active Directory repair on the Boot tab, as shown in Figure 5-9. After you’ve finished with Directory Services Restore Mode, use the same utility to restore normal boot functionality.

This screenshot shows the System Configuration utility with the Active Directory repair option selected.

FIGURE 5-9 System Configuration

To enter Directory Services Restore Mode, you need to enter the Directory Services Restore Mode password.

To perform an authoritative restore, perform the following general steps:

  1. Choose a computer that functions as a Global Catalog server. This DC functions as your restore server.
  2. Locate the most recent system state backup that contains the objects that you want to restore.
  3. Restart the restore server in DSRM mode. Enter the DSRM password.
  4. Restore the system state data.
  5. Use the following command to restore items (where Mercury is the object name, Planets is the OU that it is contained in, and contoso.com is the host domain):

    Ntdsutil "authoritative restore" "restore object cn=Mercury,ou=Planets, dc=contoso,dc=com" q q

  6. If an entire OU is deleted, you can use the Restore Subtree option. For example, if you deleted the Planets OU and all the accounts that it held in the contoso.com domain, you could use the following command to restore it and all the items it contained:

    Ntdsutil "authoritative restore" "restore subtree OU=Planets,DC=contoso,DC=com" q q

If you need to perform an authoritative restore of SYSVOL, take the AD DS DC offline, restore the system state data from backup, and then modify the msDFSR-Options attribute of the SYSVOL subscription. You can do this once you enable the Advanced Features setting in Active Directory Users and Computers and navigate to the Domain System Volume item under DFSR-LocalSettings in the Domain Controllers container. Once you have performed this step, you can reconnect the DC to the network and the restored SYSVOL will overwrite SYSVOL on other AD DS domain controllers in the domain.

Nonauthoritative restore

When you perform a nonauthoritative restore, you restore a backup of Active Directory that’s in a good known state. When rebooted, the domain controller contacts replication partners and overwrites the contents of the nonauthoritative restore with all updates that have occurred to the database since the backup was taken. Nonauthoritative restores are appropriate when the Active Directory database on a database has been corrupted and needs to be recovered. You don’t use a nonauthoritative restore to recover deleted items, since any deleted items that are restored when performing the nonauthoritative restore will be overwritten when changes replicate from other DCs.

Performing a full system recovery on a DC functions in a similar way to performing a nonauthoritative restore. When the recovered DC boots, all changes that have occurred in Active Directory since the backup was taken overwrite existing information in the database.

Active Directory snapshots

You can use ntdsutil.exe to create snapshots of the Active Directory database. A snapshot is a point-in-time copy of the database. You can use tools to examine the contents of the database as it existed at that point in time. For example, in the event that user accounts were removed from a security group but those user accounts were not deleted, you could see the previous membership of that security group by mounting a previous snapshot and examining the group properties in the mounted snapshot.

It is also possible to transfer objects from the snapshot of the Active Directory database back into the version currently used with your domain's domain controllers. The AD DS service must be running to create a snapshot.

To create a snapshot, execute the following command:

Ntdsutil snapshot "Activate Instance NTDS" create quit quit

Each snapshot is identified by a GUID. You can create a scheduled task to create snapshots on a regular basis. You can view a list of all current snapshots on a domain controller by running the following command:

Ntdsutil snapshot "list all" quit quit

To mount a snapshot, make a note of the GUID of the snapshot that you want to mount and then issue the following command:

Ntdsutil "activate instance ntds" snapshot "mount {GUID}" quit quit

When mounting snapshots, you must use the {} braces with the GUID. You can also use the snapshot number associated with the GUID when mounting the snapshot with the ntdsutil command. This number is always an odd number.

When the snapshot mounts, take a note of the path associated with the snapshot. You use this path when mounting the snapshot with dsamain. For example, to use dsamain with the snapshot mounted as c:$SNAP_201212291630_VOLUMEc$, issue this command:

Dsamain /dbpath 'c:$SNAP_201212291630_VOLUMEC$WindowsNTDS tds.dit' /ldapport 50000

You can choose to mount the snapshot using any available TCP port number; 50000 is just easy to remember. Leave the PowerShell windows open when performing this action. After the snapshot is mounted, you can access it using Active Directory Users and Computers. To do this, perform the following steps:

  1. Open Active Directory Users and Computers.
  2. Right-select the root node, and select Change Domain Controller.
  3. In the Change Directory Server dialog box, enter the name of the domain controller and the port, and select OK. You can then view the contents of the snapshot using Active Directory Users and Computers in the same way that you would the contents of the current directory.

You can dismount the snapshot by pressing Ctrl+C to close dsamain and then executing the following command to dismount the snapshot:

Ntdsutil.exe "activate instance ntds" snapshot "unmount {GUID}" quit quit

Other methods of recovering deleted items

Although the recommended way of ensuring that deleted Active Directory objects are recoverable is to enable the Active Directory Recycle Bin or to perform an authoritative restore using DSRM, you can also use tombstone reanimation to recover a deleted object. Tombstone reanimation involves using the ldp.exe utility to modify the attributes of the deleted object so that it no longer has the deleted attribute. Because it may lead to unpredictable results, you should use tombstone reanimation only if no backups of the system state data exist and you haven’t enabled the Active Directory Recycle Bin.

Although Active Directory snapshots do represent copies of the Active Directory database at a particular point in time, you should use mounted snapshots to determine which backup contains the items you want to authoritatively restore. It is possible to export objects from snapshots and to reimport them into Active Directory using tools such as LDIFDE (LDAP Data Interchange Format Data Exchange), but this can lead to unpredictable results.

Troubleshoot Active Directory replication

Replication makes it possible for changes that are made on one AD DS domain controller to be replicated to other domain controllers in the domain and, in some cases, to other domain controllers in the forest. Rather than replicating the AD DS database in its entirety, the replication process is made more efficient by splitting the database into logical partitions. Replication occurs at the partition level, with some partitions only replicating to domain controllers within the local domain, some partitions replicating only to enrolled domain controllers, and some partitions replicating to all domain controllers in the forest. AD DS includes the following default partitions:

  • Configuration partition This partition stores forest-wide AD DS structure information, including domain, site, and domain controller location data. The configuration partition also holds information about DHCP server authorization and Active Directory Certificate Services certificate templates. The configuration partition replicates to all domain controllers in the forest.
  • Schema partition The schema partition stores definitions of all objects and attributes as well as the rules for creating and manipulating those objects. There is a default set of classes and attributes that cannot be changed, but it’s possible to extend the schema and add new attributes and classes. Only the domain controller that holds the Schema Master FSMO role is able to extend the schema. The schema partition replicates to all domain controllers in the forest.
  • Domain partition The domain partition holds information about domain-specific objects such as organizational units, domain-related settings, and user, group, and computer accounts. A new domain partition is created each time you add a new domain to the forest. The domain partition replicates to all domain controllers in a domain. All objects in every domain partition are stored in the Global Catalog, but these objects are stored only with some, not all, of their attribute values.
  • Application partition Application partitions store application-specific information for applications that store information in AD DS. There can be multiple application partitions, each of which is used by different applications. You can configure application partitions so that they replicate only to some domain controllers in a forest. For example, you can create specific application partitions to be used for DNS replication so that DNS zones replicate to some, but not all, domain controllers in the forest.

Domains running at the Windows Server 2008 and higher functional level support attribute-level replication. Rather than replicate the entire object when a change is made to an attribute on that object, such as when group membership changes for a user account, only the attribute that changes is replicated to other domain controllers. Attribute-level replication substantially reduces the amount of data that needs to be transmitted when objects stored in AD DS are modified.

Understanding multimaster replication

AD DS uses multimaster replication. This means that any writable domain controller is able to make modifications of the AD DS database and to have those modifications propagate to the other domain controllers in the domain. Domain controllers use pull replication to acquire changes from other domain controllers. A domain controller may pull changes after being notified by replication partners that changes are available. A domain controller notifies its first replication partner that a change has occurred within 15 seconds and additional replication partners every 3 seconds after the previous notification. Domain controllers also periodically poll replication partners to determine whether changes are available so that those changes can be pulled and applied to the local copy of the relevant partition. By default, polling occurs once every 60 minutes. You can alter this schedule by editing the properties of the connection object in the Active Directory Sites and Services console.

Knowledge Consistency Checker (KCC)

The Knowledge Consistency Checker (KCC) runs on each domain controller. The KCC is responsible for creating and optimizing the replication paths between domain controllers located at a specific site. In the event that a domain controller is added or removed from a site, the KCC automatically reconfigures the site’s replication topology. The KCC topology organization process occurs every 15 minutes by default. Although you can change this value by editing the registry, you can also trigger an update using the repadmin command-line tool with the KCC switch.

Store and forward replication

AD DS supports store and forward replication. For example, say the Canberra and Melbourne branch offices are enrolled in a custom application partition. These branch offices aren’t connected to each other, but they are connected to the Sydney head office. In this case, changes made to objects stored in the application partition at Canberra can be pulled by the domain controller in Sydney. The Melbourne domain controller can then pull those changes from the domain controller in Sydney.

Conflict resolution

In an environment that supports multimaster replication, it’s possible that updates may be made to the same object at the same time in two or more different places. Active Directory includes sophisticated technologies that minimize the chance that these conflicts will cause problems, even when conflicting updates occur in locations that are distant from each other.

Each domain controller tracks updates by using update sequence numbers (USNs). Each time a domain controller updates, either by processing an update performed locally or by processing an update acquired through replication, it increments the USN and associates the new value with the update. USNs are unique to each domain controller as each domain controller processes a different number of updates to every other domain controller.

When this happens, the domain controller that wrote the most recent change, known as the last writer, wins. Because each domain controller’s clock might not be precisely synchronized with every other domain controller’s clock, last write isn’t simply determined by a comparison of time stamps. Similarly, because USNs are unique to each domain controller, a direct comparison of USNs is not made. Instead, the conflict resolution algorithm looks at the attribute version number. This is a number that indicates how many times the attribute has changed and is calculated using USNs. When the same attribute has been changed on different domain controllers, the attribute with the higher attribute version number wins. If the attribute version number is the same, the attribute modification time stamps are compared, with the most recent change being deemed authoritative.

If you add or move an object to a container that was deleted on another domain controller at the same time, the object is moved to the LostAndFound container. You can view this container when you enable the Advanced Features option in the Active Directory Users and Computers console.

RODC replication

The key difference between an RODC and a writable domain controller is that RODCs aren’t able to update the Active Directory database and that they only host password information for a subset of security principals. When a client in a site that only has RODCs needs to make a change to the Active Directory database, that change is forwarded to a writable domain controller in another site. When considering replication, remember that all RODC-related replication is incoming and that other domain controllers do not pull updates from the AD DS database hosted on an RODC.

RODCs use the usual replication schedule to pull updates from writable domain controllers except in certain cases. RODCs perform inbound replication using a replicate-single-object (RSO) operation. These cases include:

  • The password of a user whose account password is stored on the RODC is changed.
  • A DNS record update occurs where the DNS client performing the update attempts to use the RODC to process the update and is then redirected by the RODC to a writable DC that hosts the appropriate Active Directory–integrated DNS zone.
  • Client attributes, including client name, DnsHostName, OsName, OsVersionInfo, supported encryption types, and LastLogonTimeStamp, are updated.

These updates occur outside the usual replication schedule as they involve objects and attributes that are important to security. An example is when a user at a site that uses RODCs calls the service desk to have their password reset. The service desk staff member, located in another site, resets the password using a writable domain controller. If a special RSO operation isn’t performed, it is necessary to wait for the change to replicate to the site before the user is able to sign on with the newly reset password.

Monitor and manage replication

You can use the Active Directory Sites and Services console to trigger replication. You can trigger replication on a specific domain controller by right-clicking the connection object and selecting Replicate Now. When you do this, the domain controller replicates with all of its replication partners.

You can also monitor replication as it occurs using DirectoryServices performance counters in Performance Monitor. Through Performance Monitor, you can view inbound and outbound replication, including the number of inbound objects in the queue and pending synchronizations.

Repadmin

You can use the repadmin command-line tool to manage and monitor replication. This tool is especially useful at enabling you to diagnose where there are problems in a replication topology. For example, you can use repadmin with the following switches:

  • replsummary Generates information showing when replication between partners has failed. You can also use this switch to view information about the largest intermission between replication events.
  • showrepl Views specific inbound replication traffic, including objects that were replicated and the date stamps associated with that traffic.
  • prp Determines which user account passwords are being stored on an RODC.
  • kcc Forces the KCC to recalculate a domain controller’s inbound replication topology.
  • queue Enables you to display inbound replication requests that a domain controller must make to reach a state of convergence with source replication partners.
  • replicate Forces replication of a specific directory partition to a specific destination domain controller.
  • replsingleobj Use this switch when you need to replicate a single object between two domain controllers.
  • rodcpwdrepl Enables you to populate RODCs with the passwords of specific users.
  • showutdvec Displays the highest USN value recorded for committed replication operations on a specific DC.

NEED MORE REVIEW?TROUBLESHOOTING AD DS REPLICATION

You can learn more about troubleshooting AD DS replication at https://docs.microsoft.com/en-us/windows-server/identity/ad-ds/manage/troubleshoot/troubleshooting-activedirectory-replication-problems.

Recover SYSVOL

SYSVOL (system volume) is a special set of folders on each AD DS domain controller that store group policy templates, junction points, and startup and logon scripts. The default location on each AD DS domain controller is the %SYSTEMROOT%SYSVOLsysvol folder. When in a health state, SYSVOL replicates to each domain controller in an AD DS domain.

In some cases when replication is failing and you are unable to successfully perform maintenance tasks on SYSVOL, you may choose to rebuild SYSVOL. You should only rebuild SYSVOL in the event that the other domain controllers in a domain have a healthy and functioning SYSVOL.

To rebuild SYSVOL, perform the following general steps:

  1. Identify a replication partner of the AD DS DC that has SYSVOL in a healthy state.
  2. Restart the impacted AD DS DC in DSRM.
  3. Stop the DFS Replication Service and Netlogon Service on the impacted AD DS DC.
  4. Delete the SYSVOL folder on the AD DS DC.
  5. Create a copy of the SYSVOL folder on the healthy replication partner.
  6. In the copy of the SYSVOL folder on the replication partner, update the path for <JUNCTION> from the healthy DC’s FQDN to the FQDN of the DC you are repairing.
  7. Use robocopy to copy the duplicate SYSVOL folder off the healthy replication partner to the location on the impacted AD DS DC that you removed the problematic SYSVOL folder from in Step 4.
  8. Restart the AD DS in normal mode and determine whether the problems with SYSVOL have been resolved.

NEED MORE REVIEW?RECOVER SYSVOL

You can learn more about recovering SYSVOL at https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-r2-and-2008/cc816596(v=ws.10).

Troubleshoot hybrid authentication issues

The process of being able to use a single identity for on-premises and cloud resources is rarely straightforward. The first step in troubleshooting hybrid authentication involves diagnosing where the problem might occur. Is it an issue of identity synchronization? Is it an issue related to password hashes or pass-through authentication? Is the cause a problem related to AAD single sign-on?

Troubleshoot identity synchronization

When you configure Azure AD Connect to synchronize identities from an on-premises instance of AD DS with Azure AD, a Synchronization Errors report becomes available with the Azure portal. Errors that may be displayed in this report include the following:

  • Data mismatch errors Azure AD does not allow two or more objects to have the same value for certain attributes, including proxyAddress, userPrincipalName, onPremesisSecurityIdentifier and objectID. Identify which two objects are in conflict and determine which one should be present in Azure AD. This may involve changing the attribute in one on-premises object or deleting that object.
  • Object type mismatch Occasionally two objects of different type (user, group, or contact) have the same attribute values. An example is mail-enabled security groups sometimes being configured with the same proxyAddress attribute as a user account. Determine which objects have the duplicate attribute and remediate by changing one of the attribute values.
  • Data validation errors This error can occur when the userPrincipalName attribute value has invalid or unsupported characters or doesn’t follow the appropriate format.
  • Access violation errors Occurs when Azure AD Connect attempts to update a cloud-only object.
  • Large objected errors Attribute value exceeds size limit or length limit. Often occurs with the userCertificate, userSMIMECertificate, thumbnailPhoto, and proxyAddresses attributes.

NEED MORE REVIEW?IDENTITY SYNCHRONIZATION

You can learn more about troubleshooting identity synchronization at https://docs.microsoft.com/en-us/azure/active-directory/hybrid/tshoot-connect-sync-errors.

Troubleshoot password hash synchronization

Azure AD Connect includes several troubleshooting-related PowerShell scripts. To allow these scripts to run, open a PowerShell session on the server on which Azure AD Connect is installed and ensure that the execution policy is set to RemoteSigned using the following command:

Set-ExecutionPolicy RemoteSigned

You can then start the Azure AD Connect wizard and select Troubleshoot from the Additional Tasks page. This will open a troubleshooting menu in PowerShell. From this menu you can then select Troubleshoot password hash synchronization. The tests in the submenu include troubleshooting the issue where no passwords are synchronized or when one specific object’s password is not synchronized.

The Password hash synchronization does not work at all task performs the following tests:

  • Verifies that password hash synchronization is enabled for the Azure AD tenant
  • Verifies that the Azure AD Connect server is not in staging mode
  • Verifies that the on-premises AD DS connector has the password hash synchronization feature enabled
  • Searches the Windows Application Event logs for password hash synchronization heartbeat activity
  • Validates that the AD DS accounts used by the connector have the appropriate username, password, and permissions

If you select the Password is not synchronized for a specific user account option, the following checks are performed:

  • Checks the state of the Active Directory object in the Active Directory connector space, Metaverse, and Azure AD connector space
  • Verifies that there are synchronization rules with password hash synchronization enabled and applied to the Active Directory object
  • Determines the last attempt to synchronize the password for the object

Azure AD Connect will not synchronize temporary passwords to Azure AD. If an account has the Change password at next logon option set, it will be necessary for the user to change their password before hash synchronization can occur to replicate a hash of the password to Azure.

NEED MORE REVIEW?TROUBLESHOOT PASSWORD HASH SYNCHRONIZATION

You can learn more about troubleshooting hybrid password hash synchronization at https://docs.microsoft.com/en-us/azure/active-directory/hybrid/tshoot-connect-password-hash-synchronization.

Troubleshoot pass-through authentication

The first step in troubleshooting pass-through authentication is checking the status of the feature in the Azure AD Connect blade of the Azure AD admin center. Pass-through authentication provides the following user sign-in error messages that can help troubleshooting:

  • Unable to connect to Active Directory Ensure that agent servers are members of the same AD forest as the users whose passwords need to be validated. Ensure users are able to connect to Active Directory. (AADSTS80001)
  • A timeout occurred connecting to Active Directory Verify that Active Directory is available and is responding to requests from the agents. (AADSTS8002)
  • Validation encountered unpredictable WebException A transient error. Retry the request. If it continues to fail, contact support. (AADSTS80005)
  • An error occurred communicating with Active Directory Check the agent logs for more information and verify that Active Directory is operating as expected. (AADSTS80007)

Users may also get an invalid username/password error in the event that their on-premises UserPrincipalName (UPN) differs from their Azure UPN. If your tenancy has an Azure AD Premium license, you can view the sign-in activity report in Azure AD to diagnose pass-through authentication issues. The sign-in activity report will provide the following sign-in failure telemetry:

  • User's Active Directory password has expired Resolve by resetting the user's password in your on-premises AD DS.
  • No Authentication Agent available Resolve by installing and registering an Authentication Agent for pass-through authentication.
  • Authentication Agent’s password validation request timed out Check if the computer that hosts the Authentication Agent can communicate with on-premises AD DS.
  • Invalid response received by Authentication Agent If this occurs for multiple users, check Active Directory Domain Services replication.
  • Authentication Agent unable to connect to Active Directory Check if the computer that hosts the Authentication Agent can communicate with on-premises AD DS.
  • Authentication Agent unable to decrypt password Uninstall the existing Authentication Agent and install and register a new Authentication Agent.
  • Authentication Agent unable to retrieve decryption key Uninstall the existing Authentication Agent and install and register a new Authentication Agent.

You also need to ensure that the Authentication Agent or the computer that hosts Azure AD Connect is able to communicate both with on-premises AD DS as well as Azure. Communication with Azure can occur either directly or through a proxy.

NEED MORE REVIEW?TROUBLESHOOT PASS-THROUGH AUTHENTICATION

You can learn more about troubleshooting hybrid pass-through authentication at https://docs.microsoft.com/en-us/azure/active-directory/hybrid/tshoot-connect-pass-through-authentication.

Azure Active Directory Seamless Single Sign-On

You can check whether Seamless SSO is enabled for an Azure AD tenant on the Azure AD Connect blade of Azure Active Directory admin center in the Azure portal. If the Azure AD tenancy has an Azure AD Premium license associated with it, the sign-in activity report is present in Azure Active Directory admin center that can provide you with sign-in failure reasons. Problems can occur with Azure AD SSO under the following circumstances:

  • If Seamless SSO has been disabled and then reenabled on a tenancy as part of a troubleshooting process, users may not be able to perform SSO for up to 10 hours due to their existing Kerberos tickets remaining valid.
  • Users that are members of too many AD DS groups are unable to perform SSO because their Kerberos tickets become too large to process. You may need to reduce the number of groups a user is a member of to successfully perform SSO. Sign-in error code 81001 will list “user’s Kerberos ticket too large.”
  • Seamless SSO isn’t supported if more than 30 AD DS forests are being synchronized with a single Azure AD tenant.
  • If forest trust relationships exist for on-premises AD DS forests, enabling SSO in one forest enables SSO in all forests.
  • The policy that enables Seamless SSO has a 25,600-character limit. That limit applies to everything included in the policy including forest names. Extremely long forest names with multiple forests can impact the policy limit.
  • If the error code 81010 reads that “Seamless SSO failed because user’s Kerberos ticket has expired or is invalid,” the user will need to sign on from a domain-joined device on the on-premises network. You can use the klist purge command to remove existing tickets and to retry logon.
  • Ensure that the device’s time is synchronized with Active Directory and that the computer that hosts the PDC emulator role has time synchronized with an accurate external time source.
  • Use the klist command from a command prompt to verify that tickets issued for the AZUREADSSOACC computer account are present.

NEED MORE REVIEW?TROUBLESHOOT AZURE ACTIVE DIRECTORY SEAMLESS SINGLE SIGN-ON

You can learn more about troubleshooting Azure Active Directory Seamless Single Sign-On at https://docs.microsoft.com/en-us/azure/active-directory/hybrid/tshoot-connect-sso.

Troubleshoot on-premises Active Directory

A common cause of AD DS problems is DNS. AD DS domain controllers and clients uses DNS to locate resources in AD DS. One of the simplest and most effective troubleshooting steps you can take when trying to resolve problems with AD DS is to check which DNS servers have been configured for AD DS domain controllers and the servers and clients that need to interact with them. Other troubleshooting strategies covered in this section include DCDiag, database optimization, and performing metadata cleanup.

DCDiag

DCDiag.exe is a tool that is built into Windows Server that you can use from an elevated command prompt to check the health of a specific domain controller. You can run it locally on an AD DS domain controller or run it remotely using the /s parameter against a remote domain controller in the same forest. You can also use it with the /e parameter to check all servers in the enterprise. The /c parameter runs all tests except the DCPromo and RegisterInDNS tests. You can use the dcdiag command with the /fix parameter to fix Service Principal Names (SPNs) related to the Machine Account object of a domain controller.

DCDiag checks include the following, which can be run separately or all at once:

  • DNS Tests including network connectivity, DNS client configuration, domain controller registration in DNS, service availability, zone configuration, zone delegation, dynamic update, CNAME, and SRV record checks.
  • Connectivity Determines if domain controllers present in AD DS can be contacted.
  • Replication checks Determine if any replication errors have occurred.
  • NCSecDesc Verify that security descriptors on naming context heads are configured appropriately for replication.
  • NetLogons Determine if permissions are configured appropriately for replication.
  • Advertising Checks to verify that each DC advertises the roles it is able to perform.
  • KnowsOfRoleHolders Checks that each DC can contact each FSMO role holder.
  • Intersite Determines if failovers have occurred that block intersite replication.
  • FSMOCheck Checks if an FSMO role holder can contact the Kerberos Key Distribution Center server, time server, and Global Catalog server.
  • RidManager Determines whether the relative identifier master FSMO role is accessible and functioning correctly.
  • MachineAccount Determines whether the DC’s computer account is properly registered and services are advertised. Can repair if this account is missing or if account flags are incorrect.
  • Services Determines if all necessary AD DS DC services are running.
  • OutboundSecureChannels Checks that secure channels exist to all other domain controllers in the forest.
  • ObjectsReplicated Verifies that the Machine Account and the Directory System Agent objects have replicated.
  • KCCEvent Checks that the Knowledge Consistency Checker completes without error.
  • Systemlog Checks the systemlog for critical errors.
  • CheckSDRefDom Verifies security descriptors for application directory partitions.
  • VerifyReplicas Checks that application directory partitions are properly replicating to designated replicas.
  • CrossRefValidation Verifies integrity of cross-references.
  • Topology Ensures that the KCC has a fully connected topology for all domain controllers.
  • CutOffServers Determines if any AD DS DC is not receiving replication because replication partners are nonresponsive.
  • Sysvolcheck Checks SYSVOL health.
Active Directory database optimization

There are several steps you can take to optimize your Active Directory database, including defragmenting the database, performing a file integrity check, and performing a semantic integrity check.

When you defragment the Active Directory database, a new copy of the database file, Ntds.dit, is created. You can defragment the Active Directory database or perform other operations only if the database is offline. You can take the Active Directory database offline by stopping the AD DS service, which you can do from the Update Services console or by issuing the following command from an elevated PowerShell prompt:

Stop-Service NTDS –force

You use the ntdsutil.exe utility to perform the fragmentation using the following command:

ntdsutil.exe "activate instance ntds" files "compact to c:\" quit quit

After the defragmentation has completed, copy the defragmented database over the original located in C:windowsNTDS tds.dit and delete all log files in the C:windowsNTDS folder.

You can check the integrity of the file that stores the database using ntdsutil.exe by issuing the following command from an elevated prompt when the AD DS service is stopped:

ntdsutil.exe "activate instance ntds" files integrity quit quit

To verify that the AD DS database is internally consistent, you can run a semantic consistency check. The semantic check can also repair the database if problems are detected. You can perform a semantic check using ntdsutil.exe by issuing the following command:

ntdsutil.exe "activate instance ntds" "semantic database analysis" "verbose on" "go fixup" quit quit

Active Directory metadata cleanup

The graceful way to remove a domain controller is to run the Active Directory Domain Services Configuration Wizard to remove AD DS. You can also remove the domain controller gracefully by using the Uninstall-ADDSDomainController cmdlet. When you do this, the domain controller is removed, all references to the domain controller in Active Directory are also removed, and any FSMO roles that the domain controller hosted are transferred to other DCs in the domain.

Active Directory metadata cleanup is necessary if a domain controller has been forcibly removed from Active Directory. Here’s an example: an existing domain controller catches fire or is accidentally thrown out of a window by a systems administrator having a bad day. When this happens, references to the domain controller within Active Directory remain. These references, especially if the domain controller hosted FSMO roles, can cause problems if not removed. Metadata cleanup is the process of removing these references.

If you use the Active Directory Users and Computers or Active Directory Sites and Services console to delete the computer account of a domain controller, the metadata associated with the domain controller are cleaned up. The console will prompt you when you try to delete the account of a domain controller that can’t be contacted. You confirm that you can’t contact the domain controller. When you do this, metadata cleanup occurs automatically.

To remove server metadata using ntdsutil, issue the following command, where <ServerName> is the distinguished name of the domain controller whose metadata you want to remove from Active Directory:

Ntdsutil "metadata cleanup" "remove selected server <ServerName>"

NEED MORE REVIEW?TROUBLESHOOT AD DS

You can learn more about troubleshooting AD DS at https://docs.microsoft.com/en-us/windows-server/identity/ad-ds/deploy/troubleshooting-domain-controller-deployment.

EXAM TIP

Remember that you can view how Active Directory Domain Services looked at a previous point in time, including what the group memberships were, by mounting a previous snapshot and then connecting to that snapshot using Active Directory Users and Computers.

Chapter summary

  • You can monitor Windows Server performance using Performance Monitor. This allows you to view point-in-time information for configured performance counters.
  • Data collector sets allow you to collect performance counters and telemetry over time.
  • System Insights allows you to use machine learning to predict when CPU, storage, or network resources on a server might reach capacity.
  • Azure Monitor lets you collect performance counters to Azure and create alerts. You need the Log Analytics workspace ID and key to install the agent.
  • You can use the Network Watcher to troubleshoot hybrid connections.
  • When troubleshooting VM boot problems, you can create a copy of the OS boot disk and use a recovery VM to perform diagnosis and repair before replacing the original OS disk with the repaired disk.
  • When troubleshooting disk encryption issues ensure that there is connectivity between the IaaS VM and Azure Key Vault.
  • You can only restore objects from AD Recycle Bin after it has been enabled. You can use Directory Services Restore mode to restore deleted items from a backup.
  • You can use repadmin.exe to troubleshoot Active Directory replication.
  • You can use dcdiag.exe to troubleshoot on-premises AD DS domain controllers.

Thought experiment

In this thought experiment, demonstrate your skills and knowledge of the topics covered in this chapter. You can find answers to this thought experiment in the next section.

You are in the process of examining an AD DS deployment that is reaching 20 years of age. In the last 20 years, the site topology has changed dramatically as new branch offices and sites have been added and removed, and domain controllers have been deployed, upgraded, demoted, and have sometimes suffered catastrophic hardware failure. As part of your process, you want to examine the health of each domain controller, determine how replication is functioning, and perform maintenance operations on the Active Directory databases. With this in mind, answer the following questions:

  1. What command-line utility can you use to check the health of an AD DS domain controller to ensure that all appropriate AD DS DNS records are registered for each AD DS DC?
  2. What command-line utility can you use to determine when replication last occurred between AD DS DC replication partners?
  3. What tool procedure do you use to defragment the AD DS database and perform a semantic integrity check?

Thought experiment answers

This section contains the solution to the thought experiment. Each answer explains why the answer choice is correct.

  1. You use dcdiag.exe to assess the health of an AD DS domain controller. You can also use this utility to determine if all relevant DNS records related to the AD DS DC are present.
  2. You can use repadmin.exe with the replsummary switch to determine when replication last occurred between specific replication partners.
  3. You stop NTDS on the domain controller with the Stop-Service PowerShell cmdlet. You use ntdsutil.exe to mount and defragment the AD DS database file ntds.dit. You can also use ntdsutil.exe to perform a semantic integrity check on the file. You then copy the defragmented and checked file back to the original C:windowsNTDS location after removing the original and restart NTDS on the AD DS DC.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.210.99.209