Chapter 5. Troubleshooting Data Replication Issues

Within a Cisco Unified Communications Manager (CUCM) cluster, database replication is one of the more crucial functions for optimal performance. This is due to an ever-changing database. The cluster consists of the CUCM publisher and subscribers. The terms publisher and subscriber are derived from database operations and functionality. The primary database is maintained by the publisher and pushed to the subscribers. That is, the subscribers carry an active subscription to the information published and offered by the publisher. It seems obvious. But often the source of the terminology is unknown. Knowing that it’s all about database replication seems to demystify the idea of the cluster, consisting of nodes, comprising the actual private branch exchange (PBX) functionality.

Chapter Objectives

This chapter examines the basics of the database, the replication relationships, and how to break, re-establish, and repair those relationships. Upon completing this chapter, you will be able to

• Review the basics of database replication in CUCM

• Identify database replication issues in CUCM

• Describe the procedure to resolve database replication issues in a CUCM cluster

Database Replication Overview

Cisco CallManager, as it was known in the early days, was based on MS SQL Server database infrastructure running on a Microsoft Windows Server platform. When Cisco CallManager 5.0 came along, it was converted to a Linux-based operating system and the Informix database infrastructure. It was renamed Cisco Unified Communications Manager (CUCM). Now, with the recent Collaboration System Release (CSR) 11, it has become way more than that it was at its inception. It was an IP PBX, and now it is the foundation of a rich collaboration architecture that has far exceeded mere voice call control functionality.

What does all of this mean? It means that the underlying architecture and all the peripheral applications, services, clients, and other functions are even more reliant upon the health of this database.


Note

It is important to address a key aspect of working with database replication: patience.

The processes and commands discussed in this chapter may run quickly, but the resulting changes take time. Depending on cluster and database size, full replication can take anywhere from minutes to hours. Synchronization status is designated using a numerical value in the range of 0–4. Patience is key here. Wait for status 2 (good) on every node.


Cluster Overview

A cluster consists of one or more call control servers (nodes) connected via a data network to provide call-processing services to registered devices. The cluster provides other ancillary services, such as Trivial File Transfer Protocol (TFTP), computer telephony interface (CTI) services, media services, and more. The first node in the cluster is called the publisher. The publisher is the owner of the database. It has full read/write capabilities and control over the database. Subsequent nodes joining the cluster are known as subscribers. That is, they request access to the database and wish to be updated any time there is a change. As the database is updated by the publisher, those updates are sent to the Subscriber nodes.

Each node in the cluster is capable of providing call-processing services. However, each node is not an independent entity. In the days of time-division multiplexed (TDM) private branch exchange (PBX) call processing, each call PBX was a fully independent call-processing entity (administered independently as well). The cluster is the PBX. The fact that these call-processing nodes can be spread out across the network while being still part of a single, all-encompassing call control element was one of the most compelling elements leading to the widespread adoption of IP telephony. Redundancy could be added at multiple layers and across geography without having to administer additional PBXs.

To function properly, CUCM needs to be able to retrieve configuration settings for registered devices. All these settings are stored in a database by using an IBM Informix Dynamic Server (IDS). This database is a repository for all things related to call control such as service parameters, features, device configurations, and of course, the dial plan.

As mentioned, the publisher keeps the master copy of the database. The subscribers merely read it. This relationship mandates a certain amount of traffic flow among the servers, along with very limited latency (80 ms roundtrip max). In fact, there are multiple traffic flows, depending on the purpose of the node.

IDS traffic flows to every node, regardless of its purpose. Some Subscriber nodes do not perform call processing. In many clusters, there are dedicated TFTP server nodes, whereas other nodes are dedicated to music on hold (MoH), conferencing, or other services. This traffic flows in a hub-and-spoke topology. The database replication traffic flows only to/from the publisher from each of the Subscriber nodes, not between Subscriber nodes.

This traffic flow is not to be confused with call-processing user-facing feature replication traffic, which flows in a full mesh among all nodes, regardless of their function. Database modifications for user-facing call-processing features are made on the Subscriber nodes to which each endpoint/device is registered. So, the subscribers get to make some limited updates to the database. These updates must be replicated to all other nodes in the cluster to maintain database integrity and maintain redundancy for features and services. These features include

• Call Forward All (CFA)

• Message Waiting Indicator (MWI)

• Privacy Enable/Disable

• Extension Mobility (EM) login/logout

• Hunt Group login/logout

• Device Mobility

• Certificate Authority Proxy Function (CAPF) status for end/application users

• Credential hacking and authentication

Another traffic type flows among the call-processing subscribers. This traffic is Intra-Cluster Communications Signaling (ICCS). ICCS flows in a full mesh among all the call-processing nodes. This allows for a faster, real-time information exchange for exceedingly frequent changes of call and device states among the subscribers. It enables optimal call routing among the devices registered to the cluster. Figure 5-1 shows a graphical representation of database replication (IDS) and ICCS traffic flows.

Figure 5-1. Database Replication Overview

Image

Note

Network Time Protocol (NTP) is a crucial part of maintaining replication. The subscribers acquire NTP from the publisher. The publisher should obtain it from a highly reliable source. Cisco recommends that a Linux NTP server or Cisco IOS device be used to provide NTP service to the publisher. Use of a Microsoft Windows–based time service is not supported because Windows uses Simple Network Time Protocol (SNTP) to which the publisher cannot synchronize. The publisher should be synchronized to a Stratum 1, 2, or 3 time source. If the stratum is 5 or higher, the publisher will generate alarms.


Checking Replication State

The first step in identifying a replication issue is to know where to view the current state of replication. You must have some understanding of how replication errors may manifest in terms of configuration or user-reported issues.

You can check replication status from a number of places within the system. This includes the CLI, Cisco Unified Reporting Tool, and Real Time Monitoring Tool (RTMT). In all three cases, the replication status is shown based on a numeric Replication State:

• 0 — Replication is in initialization state

• 1 — Replicates have been created, but their count is incorrect

• 2 — Replication is good

• 3 — Replication is bad

• 4 — Replication setup failed

From the CLI, issue the utils dbreplication runtimestate command. Example 5-1 shows the output of this command. Be patient. This command takes some time to return all the associated output.

Example 5-1. utils dbreplication runtimestate Command Output


admin:utils dbreplication runtimestate

Server Time: Sun Jan 10 13:44:03 CST 2016

Cluster Replication State: Replication status command started at: 2016-01-10-13-39
     Replication status command ENDED. Checked 692 tables out of 692
     Last Completed Table: devicenumplanmapremdestmap
    No Errors or Mismatches found.

     Use 'file view activelog cm/trace/dbl/sdi/ReplicationStatus.2016_01_10_13_39_38.out' to see the details


DB Version: ccm11_0_1_20000_2
Repltimeout set to: 300s
PROCESS option set to: 1

Cluster Detailed View from cucmpub (3 Servers):

                                      PING      DB/RPC/   REPL.    Replication    REPLICATION SETUP
SERVER-NAME         IP ADDRESS        (msec)    DbMon?    QUEUE    Group ID       (RTMT) & Details
-----------         ----------        ------    -------   -----    -----------    ------------------
cucmsub2            172.16.100.8      0.421     Y/Y/Y     0        (g_5)          (2) Setup Completed
cucmsub             172.16.100.2      0.207     Y/Y/Y     0        (g_4)          (2) Setup Completed
cucmpub             172.16.100.1      0.039     Y/Y/Y     0        (g_2)          (2) Setup Completed




admin:

Another commonly used command is utils dbreplication status. It shows the state (active/inactive) of replication among the nodes but does not show the status in terms of the 0–4 numerical values discussed. Example 5-2 shows the output of the command as issued for all nodes: utils dbreplication status all.

Example 5-2. utils dbreplication status all Command Output


admin:utils dbreplication status all
 Replication status check is now running in background.
Use command 'utils dbreplication runtimestate' to check its progress

The final output will be in file
cm/trace/dbl/sdi/ReplicationStatus.2016_01_10_14_36_49.out

Please use "file view activelog cm/trace/dbl/sdi/ReplicationStatus.2016_01_10_14_36_49.out " command to see the output
admin:file view activelog cm/trace/dbl/sdi/ReplicationStatus.2016_01_10_14_36_49.out

Sun Jan 10 14:36:49 2016 main()  DEBUG:  -->
Sun Jan 10 14:36:57 2016 main()  DEBUG:  Replication cluster summary:
SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
-----------------------------------------------------------------------
g_2_ccm11_0_1_20000_2    2 Active   Local           0
g_4_ccm11_0_1_20000_2    4 Active   Connected       0 Jan  1 16:01:48
g_5_ccm11_0_1_20000_2    5 Active   Connected       0 Jan  1 15:48:24
Sun Jan 10 14:37:19 2016 main()  DEBUG:  <--

end of the file reached
options: q=quit, n=next, p=prev, b=begin, e=end (lines 1 - 8 of 8) :
admin:

In Example 5-2, the highlighted column shows that replication is active among the nodes. The node on which the command was entered shows as Local in the status column, while peer nodes show as Connected (hopefully). If the connection to a peer is lost, it shows as Dropped instead.

In the Cisco Unified Reporting Tool, a couple of reports deal specifically with database replication. From a status standpoint, the focus here is only on one of them: the Unified CM Database Status report. This report is CPU and time intensive. In general, it should be run only during off hours, as it takes a minimum of 10 seconds per node in the cluster to run.

Open a browser to the publisher and select the Cisco Unified Reporting option from the drop-down box in the top-right corner. Then click Go. Once there, log in (if necessary) and click System Reports. The list of reports is shown in the left column. Select the Cisco Unified CM Database Status report. Once it is selected, you might get a blank screen with three icons. One of those icons resembles a piece of paper with a bar graph. Click that icon to generate a new report. Figure 5-2 shows an example of the report generated.

Figure 5-2. Database Replication Status Report

Image

The report verifies connectivity among the nodes, replication status, name resolution, and much more. In Figure 5-2, the Unified CM Database Status box shows that all servers have the same replication count and that status is good. The View Details link was clicked to expand the output to show that status on all counts is 2 (good). The report also directs the viewer to the Database Summary screen in RTMT. Figure 5-3 shows the Database Summary screen in the Voice/Video section of RTMT.

Figure 5-3. Database Summary Screen in RTMT

Image

In Figure 5-3, five graphs are visible. They represent Change Notification Requests Queued in DB, Change Notification Requests Queued in Memory, Total Number of Connection Clients, Replicates Created, and Replication Status. A line is shown on the graph for each node and one for the cluster overall. Each is represented by a different color. In the table at the bottom of the figure, you can see the nodes and their statuses. As expected, all nodes show a status of 2.

RTMT performance counters can also show the status of database replication per node. In the System section of RTMT, click Performance. Drop down the list of counters for each node and select Number of Replicates Created and State of Replication. Double-click the Number of Replicates Created and the Replicate_State for the node, if you want to see both. This discussion is most interested in Replicate_State. Figure 5-4 shows both of the performance counters selected for the publisher and two subscribers.

Figure 5-4. Replication Performance Counters in RTMT

Image

In Figure 5-4 , the number of replicates created across all nodes is identical. Additionally, the replication status of all nodes is shown as 2, as expected.

Database Replication Issues

Occasionally, database replication experiences anomalies, issues, and/or problems (though not necessarily in that order). The symptoms, as reported by users, may be intermittent, sporadic, and difficult, if not impossible, to reproduce. The cluster architecture allows a significant degree of autonomy for each server node in terms of fulfilling its role without excessive updates from the publisher. Administrative and/or configuration changes should be made to the publisher. If they cannot be replicated to the other nodes within the cluster, there will be issues in terms of database state that might result in operational issues.

IDS is a robust system and maintains a very high degree of reliability. However, some situations that might arise will impact the system’s capability to function as designed. They include

Network Connectivity: Reachability must be maintained between the nodes across the IP network.

Network Bandwidth: Replication is a real-time process and must be prioritized in terms of QoS and bandwidth availability end-to-end.

DNS: The replication process makes extensive use of DNS, and any misconfiguration or delayed response times may impact performance.

Excessive CPU Utilization of Peer Nodes: Peer nodes may be experiencing sustained periods of excessively high CPU utilization, which can preclude their capability to process replication updates.

NTP: Replication relies heavily on NTP to track and process replication information and ensure full synchronization.

Network Connectivity

It is obvious why network connectivity has the potential to cause replication issues. However, the ability to adequately troubleshoot it may not be. A simple ping from your workstation won’t necessarily show an accurate picture of the traffic flow. The reason is that it is between your workstation and the node(s) in question rather than between the nodes themselves. The same is true for traceroute command usage. Example 5-3 shows a simple ping from a workstation to the Publisher node.

Example 5-3. ping Command Output


C:>ping 172.16.100.1

Pinging 172.16.100.1 with 32 bytes of data:
Reply from 172.16.100.1: bytes=32 time<1ms TTL=63
Reply from 172.16.100.1: bytes=32 time<1ms TTL=63
Reply from 172.16.100.1: bytes=32 time<1ms TTL=63
Reply from 172.16.100.1: bytes=32 time<1ms TTL=63

Ping statistics for 172.16.100.1:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 0ms, Maximum = 0ms, Average = 0ms
C:>
C:>ping cucmpub.mydomain.com

Pinging cucmpub.mydomain.com [172.16.100.1] with 32 bytes of data:
Reply from 172.16.100.1: bytes=32 time<1ms TTL=63
Reply from 172.16.100.1: bytes=32 time<1ms TTL=63
Reply from 172.16.100.1: bytes=32 time<1ms TTL=63
Reply from 172.16.100.1: bytes=32 time<1ms TTL=63

Ping statistics for 172.16.100.1:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 0ms, Maximum = 0ms, Average = 0ms

C:>

Example 5-3 shows sub-1 ms response time. It also shows that, from the workstation perspective, the publisher is reachable both by IP address and DNS name. But, again, the fallacy in the output is that it is from the workstation perspective and not reflective of the traffic flow between the nodes. This may seem elementary, but it has come up so many times over the years as a valid, relevant issue in troubleshooting. The thought process is correct, but the commands just need to be entered from the server(s) in question. So, SSH to or open the VMware console of the publisher and subscriber(s). Use the utils network ping and utils network traceroute.

Example 5-4 shows a ping from the Publisher to a Subscriber node using the Publisher CLI using the utils network ping command.

Example 5-4. utils network ping Command Output


admin:utils network ping 172.16.100.2
PING 172.16.100.2 (172.16.100.2) 56(84) bytes of data.
64 bytes from 172.16.100.2: icmp_seq=1 ttl=64 time=0.175 ms
64 bytes from 172.16.100.2: icmp_seq=2 ttl=64 time=0.187 ms
64 bytes from 172.16.100.2: icmp_seq=3 ttl=64 time=0.205 ms
64 bytes from 172.16.100.2: icmp_seq=4 ttl=64 time=0.126 ms

--- 172.16.100.2 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3001ms
rtt min/avg/max/mdev = 0.126/0.173/0.205/0.030 ms


admin:utils network ping cucmsub.mydomain.com
PING cucmsub.mydomain.com (172.16.100.2) 56(84) bytes of data.
64 bytes from cucmsub.mydomain.com (172.16.100.2): icmp_seq=1 ttl=64 time=0.081 ms
64 bytes from cucmsub.mydomain.com (172.16.100.2): icmp_seq=2 ttl=64 time=0.055 ms
64 bytes from cucmsub.mydomain.com (172.16.100.2): icmp_seq=3 ttl=64 time=0.144 ms
64 bytes from cucmsub.mydomain.com (172.16.100.2): icmp_seq=4 ttl=64 time=0.137 ms

--- cucmsub.mydomain.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2999ms
rtt min/avg/max/mdev = 0.055/0.104/0.144/0.038 ms

admin:

In Example 5-4, note that both IP address and DNS pings were issued. This verifies that DNS is functioning on the Publisher node. Also note the time shown by each ping response. It’s not merely stating >1 ms. The result is specific to 1/1000 ms. That is much more relevant to your needs in terms of true timing involved.

The specificity and usefulness of the utils network traceroute command are similar to ping. Example 5-5 shows a basic use of the command on the Publisher node.

Example 5-5. utils network traceroute Command Output


admin:utils network traceroute 172.16.100.8
traceroute to 172.16.100.8 (172.16.100.8), 30 hops max, 60 byte packets
 1  cucmsub2.mydomain.com (172.16.100.8)  0.310 ms  0.266 ms  0.335 ms

admin:

Notice, in the example, that the issuance of the command by IP address also resulted in its being resolved in DNS. Like the ping command, the timing is represented to 1/1000 ms.

In the case of a complete loss of connectivity, the pings will come back with less than optimal results as shown in Example 5-6.

Example 5-6. utils network ping Failure


admin:utils network ping cucmsub2
PING cucmsub2.mydomain.com (172.16.100.8) 56(84) bytes of data.
From cucmpub.mydomain.com (172.16.100.1) icmp_seq=1 Destination Host Unreachable

--- cucmsub2.mydomain.com ping statistics ---
4 packets transmitted, 0 received, +1 errors, 100% packet loss, time 13001ms


Error running command:

Executed command unsuccessfully

admin:

Checking the replication status using the utils dbreplication runtimestate command shows a dreary picture of events as shown in Example 5-7.

Example 5-7. utils dbreplication runtimestate Command Output


admin:utils dbreplication runtimestate

Server Time: Sun Jan 10 14:49:51 CST 2016

Cluster Replication State: Replication status command started at: 2016-01-10-14-46
     Replication status command ENDED. Checked 692 tables out of 692
     Last Completed Table: devicenumplanmapremdestmap
     No Errors or Mismatches found.

     Use 'file view activelog cm/trace/dbl/sdi/ReplicationStatus.2016_01_10_14_46_07.out' to see the details


DB Version: ccm11_0_1_20000_2
Repltimeout set to: 300s
PROCESS option set to: 1

Cluster Detailed View from cucmpub (3 Servers):

                                      PING      DB/RPC/   REPL.    Replication    REPLICATION SETUP
SERVER-NAME         IP ADDRESS        (msec)    DbMon?    QUEUE    Group ID       (RTMT) & Details
-----------         ----------        ------    -------   -----    -----------    ------------------
cucmsub             172.16.100.2      0.087     Y/Y/Y     0        (g_4)          (2) Setup Completed
cucmpub             172.16.100.1      0.018     Y/Y/Y     0        (g_2)          (2) Setup Completed
cucmsub2            172.16.100.8      N/A       -/N/-     592      (g_5)          (-) DB Active-Dropped




admin:

In Example 5-7, the highlighted line shows the loss of the subscriber. This seems to be a total loss of either reachability between the nodes or the node itself. The ping column shows N/A, which means it is unreachable. So, the Replication Setup (RTMT) & Details column shows no status and DB Active-Dropped.

Network Bandwidth

So far, two types of traffic have been discussed regarding replication: database replication and ICCS. A minimum of 1.544 Mbps is required between sites for ICCS traffic flow, and an additional minimum of 1.544 Mbps is required for other interserver traffic, including database replication. Depending on the deployment model in use for the collaboration system architecture, more bandwidth may be required (for example, the Remote Failover model). These minimums only speak of replication and ICCS specifically among CUCM nodes. They do not include other services or applications. This is strictly the high-priority traffic for call control (not all the intercluster traffic). These bandwidth guidelines are the rule for clusters supporting up to 10,000 busy call hour attempts (BHCA). If more than 10,000 BHCA is required, the formula for bandwidth calculation is

Total Bandwidth (Mbps) = (Total BHCA/10,000) ∗ (1 + 0.006 ∗ Delay), where Delay = RTT delay in ms

Intracluster traffic flows consist of multiple traffic types. These traffic types vary in how they’re classified by the system, either priority or best effort. Priority ICCS traffic is marked with IP Precedence 3 (DSCP 24/PHB CS3). Best effort is marked as IP Precedence 0 (DSCP 0/PHB BE). The traffic types are as follows:

Database Traffic, which provides configuration information: This type is best effort but may be reclassified as IP Precedence 1 if needed (for example, extensive use of extension mobility).

Firewall Management Traffic: This type authenticates subscribers to the publisher to gain access to the database. It is best effort but may be reclassified as IP Precedence 1 if required.

ICCS Real-Time Traffic: This type addresses signaling, call admission control (CAC), and other call signaling. ICCS maintains a TCP connection among all nodes running the Cisco CallManager service. This is marked as priority traffic.

CTI Manager Real-Time Traffic: This type is used for CTI devices involved in calls or for controlling/monitoring other devices on the CUCM servers. This is marked as priority traffic.

CTI ICCS traffic requirements for CTI Manager over the WAN deployments have not been included in the count so far. It calculates as follows:

CTI ICCS bandwidth (Mbps) = (Total BHCA/10,000)*0.53

For those deployments where J/TAPI applications are remote in relation to the CUCM subscriber providing call control, additional math is required for calculating the Quick Buffer Encoding (QBE) J/TAPI bandwidth:

J/TAPI bandwidth (Mbps) = (Total BHCA/10,000)*0.28

As mentioned, there is a requirement to maintain a maximum roundtrip time of 80 ms between any two given CUCM nodes in the cluster regardless of their function within the cluster.

Many factors impact bandwidth, including shared line appearances across a WAN, CTI, and J/TAPI. Yet bandwidth is not the only concern. A big, fat pipe is not always an indicator to avoid a QoS configuration. QoS is an end-to-end prioritization mechanism for mission-critical traffic types on router/switch ingress interface, processing, egress interface, and more. It’s not something that is required when bandwidth needs to be stretched further. Bandwidth is crucial, but so is the means by which traffic is prioritized across the device in question, not just the link between devices.

DNS

Domain Name System (DNS) has transitioned from optional to required over the past few years, as the Cisco Collaboration System has matured. Numerous services rely on the capability to quickly resolve A Records, SRV Records, and more. The need for highly available DNS servers is crucial to the health of IDS and ICCS. IDS uses DNS extensively for replication. Misconfigured, unreachable DNS causes issues for database replication.

One preferred manner of testing DNS is through the utils diagnose test command from the CLI. DNS is just one of the aspects of the server health it tests. Example 5-8 shows the output of the utils diagnose test command.

Example 5-8. utils diagnose test Command Output


admin:utils diagnose test

Log file: platform/log/diag3.log

Starting diagnostic test(s)
===========================
test - disk_space          : Passed (available: 1533 MB, used: 12494 MB)
skip - disk_files          : This module must be run directly and off hours
test - service_manager     : Passed
test - tomcat              : Passed
test - tomcat_deadlocks    : Passed
test - tomcat_keystore     : Passed
test - tomcat_connectors   : Passed
test - tomcat_threads      : Passed
test - tomcat_memory       : Passed
test - tomcat_sessions     : Passed
skip - tomcat_heapdump     : This module must be run directly and off hours
test - validate_network    : Passed
test - raid                : Passed
test - system_info         : Passed (Collected system information in diagnostic log)
test - ntp_reachability    : Warning
The host 204.235.61.9 is not reachable, or its NTP service is down.
The host 173.49.198.27 is not reachable, or its NTP service is down.

Some of the configured external NTP servers are not reachable.
It is recommended that for better time synchronization all of
the NTP servers be reachable.

Please use the OS Admin GUI to add/remove NTP servers.

test - ntp_clock_drift     : Passed
test - ntp_stratum         : Failed
The reference NTP server is a stratum 5 clock.
NTP servers with stratum 5 or worse clocks are deemed unreliable.
Please consider using an NTP server with better stratum level.

Please use OS Admin GUI to add/delete NTP servers.
skip - sdl_fragmentation   : This module must be run directly and off hours
skip - sdi_fragmentation   : This module must be run directly and off hours

Diagnostics Completed

 The final output will be in Log file: platform/log/diag3.log

 Please use 'file view activelog platform/log/diag3.log' command to see the output

In Example 5-8, the test runs checks on disk space, disk files, Tomcat services, network connectivity, the RAID system, and NTP. The DNS check falls under the validate_network test. Any DNS errors are reported there. Also note that one or more NTP servers are unreachable (see the section “NTP,” later in this chapter).

Another means of verifying DNS reachability among the nodes is the utils network host command. It can confirm both forward and reverse lookup as shown in Example 5-9.

Example 5-9. utils network host Command Output


admin:utils network host 172.16.100.2
Local Resolution:
172.16.100.2 resolves locally to cucmsub.mydomain.com

External Resolution:
2.100.16.172.in-addr.arpa domain name pointer cucmsub.mydomain.com.
admin:
admin:utils network host cucmsub
Local Resolution:
cucmsub.mydomain.com resolves locally to 172.16.100.2

External Resolution:
cucmsub.mydomain.com has address 172.16.100.2
admin:

Rerun the command to test resolution for all nodes in the cluster.

Excessive CPU Utilization on Peer Nodes

Verifying CPU utilization on various nodes is best done via the RTMT, where the nodes can all be seen at the same time. This process is covered in detail in the “CPU and Memory” section of Chapter 3, “Using Troubleshooting and Monitoring Tools.” It is only briefly reviewed here. In RTMT, the CPU and Memory screen is accessed in the System section as shown in Figure 5-5.

Figure 5-5. RTMT CPU and Memory Screen

Image

In Figure 5-5, the memory and CPU usage of all nodes is shown in a line graph form. Each node is represented by a different color line on the graph. You can see that a couple of the nodes have CPU spikes. In particular, the one node seems to be relatively busy with some process because it has spiked to near 100 percent momentarily. Because the spike wasn’t sustained for an extended period, it is unlikely to have been interfering with replication.

NTP

The cluster relies on time stamps to ensure that all relevant information has been processed efficiently and in the order in which it was received. The most up-to-date information will have the newest time stamps. In a real-time replication scenario, such as cluster architecture, timing is everything. The synchronization of the clocks of all nodes is critical to proper functionality. The Network Time Protocol provides that synchronization service.

The NTP Watchdog in CUCM polls the configured NTP server(s) once per minute on VMware and every 30 minutes on physical machines. The clock on virtual servers is less reliable than on physical servers. The NTP Watchdog forces a restart of the NTP service if the time is offset by more than 3 seconds. The NTP daemon keeps time corrected on a millisecond scale, but the service must be restarted for such a huge time correction.

The publisher, especially because it’s now running on a virtual machine, should always be configured to pull time from a physical server, such as a Cisco IOS device or a Linux server. Subscriber nodes always pull time from the publisher.

You can verify the NTP service by using the utils diagnose test command, as shown previously in Example 5-9. Another command used to verify NTP is utils ntp status as shown in Example 5-10.

Example 5-10. utils ntp status Command Output


admin:utils ntp status
ntpd (pid 9880) is running...

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*172.16.1.1      67.198.37.16     4 u  727 1024  377    1.400   -6.209   4.256
 204.235.61.9    .INIT.          16 u    - 1024    0    0.000    0.000   0.000
 173.49.198.27   .INIT.          16 u    - 1024    0    0.000    0.000   0.000


synchronised to NTP server (172.16.1.1) at stratum 5
   time correct to within 141 ms
   polling server every 1024 s

Current time in UTC is : Sun Jan 10 22:37:47 UTC 2016
Current time in America/Chicago is : Sun Jan 10 16:37:47 CST 2016
admin:

In Example 5-10, the internal NTP server is reachable, and time is in sync. The two external NTP servers are still in the .INIT. state, indicating that contact has not been established, likely due to the use of NTPv3 on the external NTP servers rather than NTPv4. Also of particular interest in the command output is the clock stratum. The time is synchronized to a stratum 5 clock. Cisco recommends that the publisher sync to a clock with stratum of 3 or less for optimal performance.

To monitor the NTP conversations on the node, issue the utils network capture port 123 command. This begins a packet capture for NTP. Example 5-11 shows the output of this command.

Example 5-11. NTP Packet Capture on CUCM


admin:utils network capture port 123
Executing command with options:
 size=128                count=1000              interface=eth0
 src=                    dest=                   port=123
 ip=
16:41:01.594373 IP cucmsub2.mydomain.com.48284 > cucmpub.mydomain.com.ntp: NTPv4,
Client, length 48
16:41:01.596012 IP cucmpub.mydomain.com.ntp > cucmsub2.mydomain.com.48284: NTPv4, Server, length 48
16:41:01.596284 IP cucmsub2.mydomain.com.48284 > cucmpub.mydomain.com.ntp: NTPv4, Client, length 48
16:41:01.596733 IP cucmpub.mydomain.com.ntp > cucmsub2.mydomain.com.48284: NTPv4, Server, length 48
16:41:01.596931 IP cucmsub2.mydomain.com.48284 > cucmpub.mydomain.com.ntp: NTPv4, Client, length 48
16:41:01.597371 IP cucmpub.mydomain.com.ntp > cucmsub2.mydomain.com.48284: NTPv4, Server, length 48
16:41:01.597568 IP cucmsub2.mydomain.com.48284 > cucmpub.mydomain.com.ntp: NTPv4, Client, length 48
16:41:01.597623 IP cucmpub.mydomain.com.ntp > cucmsub2.mydomain.com.48284: NTPv4, Server, length 48
16:41:01.625517 IP cucmsub.mydomain.com.34728 > cucmpub.mydomain.com.ntp: NTPv4, Client, length 48
16:41:01.625929 IP cucmpub.mydomain.com.ntp > cucmsub.mydomain.com.34728: NTPv4, Server, length 48

In Example 5-11, NTPv4 traffic flows, using DNS names, can be seen passing among the cluster nodes. NTPv4 is required for CUCM 9.x and higher.

Other relevant commands specific to troubleshooting NTP on CUCM include

utils diagnose module ntp_reachability

utils diagnose module ntp_clock_drift

utils diagnose module ntp_stratum

The output of these commands is all included in the utils diagnose test command output. These commands are included simply to allow the use of specific aspects if desired.

In troubleshooting NTP issues, take into account the following:

• Ensure that the NTP server is reachable.

• Ensure that the Stratum of the NTP server is acceptable (1–3).

• Subscribers out of sync may indicate a publisher reachability issue.

If necessary, issue the utils ntp restart command to reinitialize the NTP service on a given node. This step is necessary whenever a large correction (3 seconds or more) is required.

Resolving CUCM Replication Issues

Replication is performed by a specific set of scripts running on the CUCM nodes. The process is performed by the Cisco database replicator (CDR). The overall method is rather straightforward. It includes a number of predictable steps that are performed at installation time for the node in question:

Step 1. Define the publisher and set it up to begin replication (RTMT=0).

Step 2. Define a template on the publisher and “realize” it to tell it which tables to replicate (RTMT=2).

Step 3. Define the subscriber (RTMT=0).

Step 4. Realize the template on each Subscriber node to tell it the tables for which it will get/send data.

Step 5. Synchronize data using CDR Check or CDR Sync. After the CDR Check passes, RTMT=2.

Step 6. Repeat steps 3–5 for each subscriber.

Understanding the basic process is exceedingly helpful in troubleshooting. It is crucial to recognize key issues, which may be precluding the servers from replicating. These issues should be checked before running any of the commands; they include

• Verify Server/Cluster Connectivity for the necessary TCP/UDP port ranges in use for intracluster communications. You can find a complete list of these ports here:

http://www.cisco.com/c/en/us/td/docs/voice_ip_comm/cucm/admin/11_0_1/sysConfig/CUCM_BK_C733E983_00_cucm-system-configuration-guide/CUCM_BK_C733E983_00_cucm-system-configuration-guide-transformed_chapter_0111101.html#CUCM_TP_C22BA64A_00

• Check the hosts file(s) that will be in use when replication is initialized. These files are as follows:

/etc/hosts: Local file used to map IP addresses to hostnames

/home/Informix/.rhosts: List of trusted hostnames to be used in setting up database connections

$INFORMIXDIR/etc/sqlhosts: Full list of CUCM nodes for replication

• Verify proper DNS functionality throughout the cluster

All this information is available in the Cisco Unified Reporting Tool’s Unified CM Database Replication Status report.

Resolving replication is not something that you should attempt solo the first few times. Work through the issues of troubleshooting replication with the Technical Assistance Center (TAC) unless you’re working on a lab system and there is no potential for service impact to end users.

Repairing Replication

As mentioned previously, you should check the status of replication throughout the cluster by using the utils dbreplication runtimestate command. If you have sufficient reason to believe that replication is experiencing issues (that is, nodes not at the 2 state), the utils dbreplication repair command may be warranted. However, the use of the command varies based on the size and state of the cluster. The command is typically issued only from the publisher.

For a cluster with 5000 phones or fewer, the utils dbreplication repair all command is safe to use. On larger clusters or clusters with only one node misbehaving, use the utils dbreplication [nodename] command. Be patient. Depending on the size of the database, this command can take hours (sometimes up to a full day) to complete fully. Monitor the status of the command using the utils dbreplication runtimestate command. Example 5-12 shows the use of the utils dbreplication repair all command and a brief view of the output file it creates.

Example 5-12. utils dbreplication repair all Command and File Output


admin:utils dbreplication repair all
  -------------------- utils dbreplication repair --------------------
 chmod: changing permissions of `/var/log/active/cm/trace/dbl/sdi/replication_scripts_output.log': Operation not permitted

 Replication Repair is now running in the background.
 Use command ‘utils dbreplication runtimestate' to check its progress

 Output will be in file cm/trace/dbl/sdi/ReplicationRepair.2016_01_11_22_35_02.out

Please use "file view activelog cm/trace/dbl/sdi/ReplicationRepair.2016_01_11_22_35_02.out " command to see the output

admin:file view activelog cm/trace/dbl/sdi/ReplicationRepair.2016_01_11_22_35_02.out


 utils dbreplication repair output

 To determine if replication is suspect, look for the following:
         (1) Number of rows in a table do not match on all nodes.
         (2) Non-zero values occur in any of the other output columns for a table

 SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
 -----------------------------------------------------------------------
 g_2_ccm11_0_1_20000_2    2 Active   Local           0
 g_4_ccm11_0_1_20000_2    4 Active   Connected       0 Jan  1 16:01:48
 g_5_ccm11_0_1_20000_2    5 Active   Connected       0 Jan 10 15:04:34
 Mon Jan 11 22:35:12 2016 dbllib.getReplServerName  DEBUG:  -->
 Mon Jan 11 22:35:24 2016 dbllib.getReplServerName  DEBUG:  replservername: g_2_ccm11_0_1_20000_2
 Mon Jan 11 22:35:24 2016 dbllib.getReplServerName  DEBUG:  <--

 -------------------------------------------------


 No Errors or Mismatches found.

 options: q=quit, n=next, p=prev, b=begin, e=end (lines 1 - 20 of 8325) :
 Replication status is good on all available servers.

 Jan 11 2016 22:36:03 ------   Table scan for ccmdbtemplate_g_2_ccm11_0_1_20000_2_1_141_typedberrors start  --------

 Node                  Rows     Extra   Missing  Mismatch Processed
 ---------------- --------- --------- --------- --------- ---------
 g_2_ccm11_0_1_20000_2      1602         0         0         0         0
 g_4_ccm11_0_1_20000_2      1602         0         0         0         0
 g_5_ccm11_0_1_20000_2      1602         0         0         0         0


 Jan 11 2016 22:36:04 ------   Table scan for ccmdbtemplate_g_2_ccm11_0_1_20000_2_1_141_typedberrors end   ---------


 Jan 11 2016 22:36:04 ------   Table scan for ccmdbtemplate_g_2_ccm11_0_1_20000_2_1_342_typeroutingdatabasecachetimer start  --------

 Node                  Rows     Extra   Missing  Mismatch Processed
 ---------------- --------- --------- --------- --------- ---------
 g_2_ccm11_0_1_20000_2        97         0         0         0         0
 g_4_ccm11_0_1_20000_2        97         0         0         0         0

 options: q=quit, n=next, p=prev, b=begin, e=end (lines 21 - 40 of 8325) :

Obviously, there is a great deal of output in this file. Example 5-12 shows only the first 40 of 8325 lines. The key aspects are highlighted. In this case, there are no mismatches. That information is shown early in the file. If there were mismatches, they would be detailed throughout the output file.

Resetting Replication

Replication can be reset clusterwide or per node. The more specific options are typically preferred as first resorts, with the nuclear option as a last resort. Resetting replication begins with stopping replication and then resetting it. It can be stopped on the any/all subscribers as well as the publisher. Additionally, it can be stopped node-by-node or all at once. If it is to be stopped node-by-node, it should be stopped on the publisher only after completely stopped on all subscribers. That is, the utils dbreplication stop command must have been issued on each subscriber followed by waiting out the repltimeout, which is 300 seconds (5 minutes) by default. That timer starts after the stop is issued on the last subscriber. In CUCM 7.x, the utils dbreplication stop all command was added; it takes care of all these issues in one command issued from the publisher.


Note

The utils dbreplication stop all and utils dbreplication reset all commands replace the utils dbreplication clusterreset command, which was deprecated in CUCM 9.0(1). Although it is still in the list, the utils dbreplication clusterreset command is nonfunctional and simply results in an error message.


Keep in mind that it will still wait out the value of the repltimeout before returning the prompt to you on the screen. You can see the value of the timer by entering the show tech repltimeout command or set it to a different value by using the utils dbreplication setrepltimeout [time in seconds] command.


Warning

There is no “Are you sure?” prompt when using the utils dbreplication stop all command! It is immediately executed. Make sure you really want to do it before you press Enter after typing the command!


Example 5-13 shows the output of the utils dbreplication stop all command.

Example 5-13. utils dbreplication stop all Command Output


admin:utils dbreplication stop all


********************************************************************************************
 This command will delete the marker file(s) so that automatic replication setup is stopped
 It will also stop any replication setup currently executing

********************************************************************************************

 Deleted the marker file, auto replication setup is stopped

 Service Manager is running
 A Cisco DB Replicator[STOPPING]
 A Cisco DB Replicator[STOPPING]
 Commanded Out of Service
 A Cisco DB Replicator[NOTRUNNING]
 Service Manager is running
 A Cisco DB Replicator[STARTING]
 A Cisco DB Replicator[STARTING]
 A Cisco DB Replicator[STARTED]
 Will stop PUB and SUBs: all
 Stopping Sub: 172.16.100.2
          Stop replication on sub Completed
 Stopping Sub: 172.16.100.8
          Stop replication on sub Completed
 Killed dblrpc process - 26445

 Completed replication process cleanup

 Please run the command ‘utils dbreplication runtimestate' and make sure all nodes are
 RPC reachable before a replication reset is executed
 admin:

In Example 5-13, the A Cisco DB Replicator service is seen stopping on both subscribers. Then the process is killed on the publisher. It also notes that the utils dbreplication runtimestate command should be used to monitor the progression. Example 5-14 shows the output of the utils dbreplication runtimestate after the stop command was issued.

Example 5-14. Stopping Replication


admin:utils dbreplication runtimestate

 Server Time: Mon Jan 11 23:24:19 CST 2016

 Cluster Replication State: Replication status command started at: 2016-01-11-22-49
      Replication status command ENDED. Checked 692 tables out of 692
      Last Completed Table: devicenumplanmapremdestmap
      No Errors or Mismatches found.

      Use ‘file view activelog cm/trace/dbl/sdi/ReplicationStatus.2016_01_11_22_49_57.out' to see the details


 DB Version: ccm11_0_1_20000_2
 Repltimeout set to: 300s
 PROCESS option set to: 1

 Cluster Detailed View from cucmpub (3 Servers):

                                       PING      DB/RPC/   REPL.    Replication    REPLICATION SETUP
 SERVER-NAME         IP ADDRESS        (msec)    DbMon?    QUEUE    Group ID       (RTMT) & Details
 -----------         ----------        ------    -------   -----    -----------    ------------------
 cucmsub2            172.16.100.8      0.369     Y/Y/Y     0        (g_5)          (2) Setup Completed
 cucmsub             172.16.100.2      0.086     Y/Y/Y     0        (g_4)          (2) Setup Completed
 cucmpub             172.16.100.1      0.019     Y/Y/Y     0        (g_2)          (2) Setup Completed


 admin:

In Example 5-14, note that reachability is still fine between the nodes, and they all match in terms of the database tables. However, the highlighted portion shows that replication has ended. Note that each node has a Replication Group ID associated (2, 4, 5). This information is useful in tracking the status of the reset process.

After the replication process is stopped on all nodes, you can reset it by using the utils dbreplication reset all command. As tiring as it may be to see the same advice—“Be patient!”—throughout the chapter, patience is necessary. This is not a fast process and will take an hour or more, depending on the size of the cluster and number of nodes involved. Monitor it using the utils dbreplication runtimestate command. Example 5-15 shows the output of the utils dbreplication reset all command.

Example 5-15. utils dbreplication reset all Command Output


admin:utils dbreplication reset all

 This command will try to start Replication reset and will return in 1-2 minutes.
 Background repair of replication will continue after that for 1 hour.
 Please watch RTMT replication state. It should go from 0 to 2. When all subs
 have an RTMT Replicate State of 2, replication is complete.
 If Sub replication state becomes 4 or 1, there is an error in replication setup.
 Monitor the RTMT counters on all subs to determine when replication is complete.
 Error details if found will be listed below

 OK [172.16.100.8]
 OK [172.16.100.2]

 Reset command completed successfully on:
 --> cucmsub
 --> cucmsub2

 Reset completed: 2     Failed: 0
 Duration: 6.32 minute
 Use CLI to see detail: ‘file view activelog cm/trace/dbl/sdi/dbl_repl_output_util.log'
 admin:

Viewing the file output created by the command shows a verification of reachability, deletion of the replicate, and redefinition of the replicate. The output mentions that all subscribers should go to State 0 and then to State 2. If one or more nodes happen to remain at State 0 for more than 4 hours, you should reinitiate a stop and reset on the problematic node(s). This is true if the State=4. Each Subscriber node is noted by its Replication Group ID, as shown in the output of the utils dbreplication runtimestate command in Example 5-14. Monitoring the reset progress with the utils dbreplication runtimestate command enables you to see the progression of each node through various states. Example 5-16 shows the utils dbreplication runtimestate command output during the reset process.

Example 5-16. Monitoring Reset Progression


admin:utils dbreplication runtimestate

 Server Time: Mon Jan 11 23:40:52 CST 2016

 Cluster Replication State: PUB SETUP Started at 2016-01-11-23-34
      Setup Progress: 1 node(s) added to the replication network
      Setup Errors: No errors


 DB Version: ccm11_0_1_20000_2
 Repltimeout set to: 300s
 PROCESS option set to: 1

 Cluster Detailed View from cucmpub (3 Servers):

                                       PING      DB/RPC/   REPL.    Replication    REPLICATION SETUP
 SERVER-NAME         IP ADDRESS        (msec)    DbMon?    QUEUE    Group ID       (RTMT) & Details
 -----------         ----------        ------    -------   -----    -----------    ------------------
 cucmsub             172.16.100.2      0.125     Y/Y/Y     --       (-)            (-) Waiting...
 cucmsub2            172.16.100.8      0.377     Y/Y/Y     --       (-)            (0) Defining...
 cucmpub             172.16.100.1      0.020     Y/Y/Y     0        (g_2)          (2) Setup Completed



 admin:

In Example 5-16, cucmsub is waiting for replication to begin while cucmsub2 is in a defining state (RTMT=0). The setup on the Publisher node is complete and at RTMT=2 state. The state of each subscriber should go to 0 and then to 2 after the replication reset is completed. If it does not do so on one or more nodes, more troubleshooting is needed. But, again, be patient. The output of the utils dbreplication runtimestate command will be in somewhat constant flux. Keep watching it. Example 5-17 shows the difference a 10-minute wait can make. Take a look at the subscribers’ replication states.

Example 5-17. Monitoring Replication Reset


admin:utils dbreplication runtimestate

 Server Time: Mon Jan 11 23:50:10 CST 2016

 Cluster Replication State: BROADCAST SYNC Started on 2 server(s) at: 2016-01-11-23-49
      Use CLI to see detail: ‘file view activelog cm/trace/dbl/20160111_234950_dbl_repl_output_Broadcast.log'

 DB Version: ccm11_0_1_20000_2
 Repltimeout set to: 300s
 PROCESS option set to: 1

 Cluster Detailed View from cucmpub (3 Servers):

                                       PING      DB/RPC/   REPL.    Replication    REPLICATION SETUP
 SERVER-NAME         IP ADDRESS        (msec)    DbMon?    QUEUE    Group ID       (RTMT) & Details
 -----------         ----------        ------    -------   -----    -----------    ------------------
 cucmsub             172.16.100.2      0.086     Y/Y/Y     0        (g_4)          (0) Syncing...
 cucmsub2            172.16.100.8      0.392     Y/Y/Y     0        (g_5)          (0) Syncing...
 cucmpub             172.16.100.1      0.016     Y/Y/Y     0        (g_2)          (2) Setup Completed



 admin:

In Example 5-17, note that both subscribers are in an RTMT=0 state, Syncing. This is excellent progress. Keep watching until all three nodes are at the RTMT=2 state, Setup Completed.

As with the stop command, there is no “Are you sure?” prompting when you enter the reset command. So, be sure you really mean it when you press Enter.

The addition to the all parameter on utils dbreplication stop and utils dbreplication reset was quite useful. If replication is okay on all but one node, focus on that one node. As an example, the cluster in use for much of the construction of this book is a three-node cluster. If the Publisher and Subscriber 1 both show Replication State 3 in RTMT, while Subscriber 2 shows Replication State 4, it is worth exploring a reset of just Subscriber 2’s replication.

To do so, enter utils dbreplication stop on Subscriber 2 only. Remember, there is no safety net prompt to ask if you are sure. After you press Enter, replication stops on that node. Example 5-18 shows the output from Subscriber 2.

Example 5-18. Stop Replication on Subscriber 2 Only


admin:utils dbreplication stop

 ********************************************************************************************
 This command will delete the marker file(s) so that automatic replication setup is stopped
 It will also stop any replication setup currently executing
 ********************************************************************************************

 Deleted the marker file, auto replication setup is stopped

 Service Manager is running
 Commanded Out of Service
 A Cisco DB Replicator[NOTRUNNING]
 Service Manager is running
 A Cisco DB Replicator[STARTING]
 A Cisco DB Replicator[STARTING]
 A Cisco DB Replicator[STARTED]
 Killed dblrpc process - 14313

 Completed replication process cleanup

 Please run the command ‘utils dbreplication runtimestate' and make sure all nodes are
 RPC reachable before a replication reset is executed
 admin:

From the publisher, enter the utils dbreplication reset [nodename] command. For the hostname, use the hostname of Subscriber 2. The command becomes utils dbreplication reset cucmsub2. Example 5-19 shows the command output on the publisher.

Example 5-19. Reset Replication for Subscriber 2 from Publisher


admin:utils dbreplication reset cucmsub2

 Repairing of replication is in progress.
 Background repair of replication will continue after that for 30 minutes....
 OK [172.16.100.8]

 Reset completed: 0     Failed: 0
 Duration: 2.9 minute
 Use CLI to see detail: ‘file view activelog cm/trace/dbl/sdi/dbl_repl_output_util.log'
 admin:

Example 5-20 shows the log file generated by the reset detailed for the preceding example.

Example 5-20. Reset of Subscriber 2 Replication from Publisher


admin:file view activelog cm/trace/dbl/sdi/dbl_repl_output_util.log

Tue Jan 12 00:04:58 2016 replutil  DEBUG:  -->
Tue Jan 12 00:04:58 2016 replutil  DEBUG:  task to do [teardown]
Tue Jan 12 00:04:58 2016 replutil  DEBUG:  hostname [cucmsub2]
Tue Jan 12 00:04:58 2016 replutil  DEBUG:   Inside task == teardown
Tue Jan 12 00:04:58 2016 replutil.getList  DEBUG:  -->
Tue Jan 12 00:04:58 2016 replutil.getList  DEBUG:  Inside getList
Tue Jan 12 00:04:58 2016 replutil.getList  DEBUG:  Inside : getList(hostname) :: and if (Hostname != None )
Tue Jan 12 00:05:25 2016 replutil.getList  DEBUG:  <--
Tue Jan 12 00:05:25 2016 replutil  DEBUG:  Starting replication reset on node: cucmsub2
Tue Jan 12 00:05:25 2016 replutil.cdrDelete  DEBUG:  -->
Tue Jan 12 00:05:25 2016 replutil.cdrDelete  DEBUG:
Inside cdrDelete()
Tue Jan 12 00:05:25 2016 replutil.cdrDelete  DEBUG:  length of groupnameslist is [1]
Tue Jan 12 00:05:25 2016 dbllib.cdrdeleteserver  DEBUG:  -->

Tue Jan 12 00:05:46 2016 dbllib.cdrdeleteserver  DEBUG:  Executing su - informix -c "source /usr/local/cm/db/informix/local/ids.env; cdr delete server -f --connect=g_5_ccm11_0_1_20000_2 g_5_ccm11_0_1_20000_2"
Tue Jan 12 00:05:46 2016 dbllib.cdrdeleteserver  DEBUG:  Successfully deleted g_5_ccm11_0_1_20000_2 remotely
Tue Jan 12 00:05:47 2016 dbllib.cdrdeleteserver  DEBUG:  Executing su - informix -c "source /usr/local/cm/db/informix/local/ids.env; cdr delete server -f g_5_ccm11_0_1_20000_2"
Tue Jan 12 00:05:47 2016 dbllib.cdrdeleteserver  DEBUG:  Successfully deleted g_5_ccm11_0_1_20000_2
Tue Jan 12 00:05:47 2016 dbllib.cdrdeleteserver  DEBUG:  <--
Tue Jan 12 00:05:47 2016 replutil.cdrDelete  DEBUG:  Successfully deleted the g_5_ccm11_0_1_20000_2 server from replication network
Tue Jan 12 00:06:47 2016 replutil.cdrDelete  DEBUG:  <--
Tue Jan 12 00:06:47 2016 replutil  DEBUG:  <--
Tue Jan 12 00:06:48 2016 replutil  DEBUG:  -->
Tue Jan 12 00:06:48 2016 replutil  DEBUG:  task to do [setup]
Tue Jan 12 00:06:48 2016 replutil  DEBUG:  hostname [cucmsub2]
Tue Jan 12 00:06:48 2016 replutil  DEBUG:   Inside task == setup

Tue Jan 12 00:06:48 2016 replutil.getList  DEBUG:  -->
Tue Jan 12 00:06:48 2016 replutil.getList  DEBUG:  Inside getList
Tue Jan 12 00:06:48 2016 replutil.getList  DEBUG:  Inside : getList(hostname) :: and if (Hostname != None )
Tue Jan 12 00:07:23 2016 replutil.getList  DEBUG:  <--
Tue Jan 12 00:07:23 2016 replutil.cdrDefine  DEBUG:  -->
Tue Jan 12 00:07:23 2016 replutil.cdrDefine  DEBUG:  Inside cdrDefine method
Tue Jan 12 00:07:23 2016 replutil.cdrDefine  DEBUG:  Inside cdrDefine

Tue Jan 12 00:07:23 2016 replutil.cdrDefine  DEBUG:  val is g_5_ccm11_0_1_20000_2
Tue Jan 12 00:07:23 2016 replutil.cdrDefine  DEBUG:  cmd is [/usr/local/cm/bin/dbl mkrepl --delsub g_5_ccm11_0_1_20000_2]
Tue Jan 12 00:07:52 2016 replutil.cdrDefine  DEBUG:  <--
Tue Jan 12 00:07:52 2016 replutil  DEBUG:  Reset completed: 0     Failed: 0
Tue Jan 12 00:07:52 2016 replutil  DEBUG:  <--

end of the file reached
options: q=quit, n=next, p=prev, b=begin, e=end (lines 101 - 105 of 105) :
admin:

In Example 5-20, key aspects are highlighted. The teardown task begins, followed by the reset and redefining of the database on cucmsub2. Even though this was performed on only one node, it still took considerable time to complete. It will continue processing for some time after the command is entered. Keep monitoring until it returns to the RTMT=2 state.

If it does not change from RTMT=0 to RTMT=2 within 4 hours, check the relevant services using the utils service list command to make sure they are started. The services in question include A Cisco DB, A Cisco DB Replicator, and Cisco Database Layer Monitor. If any one of these services is not started, start it and retry the reset. If the services are all running, and connectivity shows to be good between the nodes within the cluster, a corruption may exist within the replication process itself. On the affected Subscriber node, issue the utils dbreplication stop command followed by the utils dbreplication dropadmindb command. This forces a corrupted syscdr to restart from scratch. On the publisher CLI, issue the utils dbreplication reset [nodename] command and monitor the output of the utils dbreplication runtimestate command again.

Chapter Summary

Replication troubleshooting can be a tricky business on the best of days and on the smallest of clusters. We cannot stress enough that this level of troubleshooting is quite advanced and involved. On a production system, it is highly recommended that TAC be involved until such time you gain a comfort level with the commands and quirks surrounding replication. This chapter is by no means an exhaustive resource on replication. In fact, it merely scratches the surface. If you have access to a lab system and can build a multinode environment, it is exceedingly useful to work with the commands discussed in this chapter and get a feel for how the database replication process goes and the time it takes, even for a small cluster.

Tearing down and rebuilding replication node-by-node as well as clusterwide can provide a great deal of comfort level and general know-how when it comes to understanding how Cisco Collaboration Systems behave in the best and worst of times. Don’t rush the process. Fully re-establishing replication to the desired state can take a great deal of time. Above all, please be patient.

References

For additional information, refer to the following:

• Command Line Interface Guide for Cisco Unified Communications Solutions

http://www.cisco.com/c/en/us/td/docs/voice_ip_comm/cucm/cli_ref/11_0_1/CUCM_BK_C93262BC_00_cucm-cli-reference-guide-1101.html

• Troubleshooting CUCM Database Replication in Linux Appliance Model

https://supportforums.cisco.com/document/52421/troubleshooting-cucm-database-replication-linux-appliance-model

• Steps to Troubleshoot Database Replication

http://www.cisco.com/c/en/us/support/docs/unified-communications/unified-communications-manager-callmanager/200396-Steps-to-Troubleshoot-Database-Replicati.html

Review Questions

Use these questions to review what you’ve learned in this chapter. The answers appear in Appendix A, “Answers to Chapter Review Questions.”

1. A cluster contains which of the following components?

a. Publisher

b. Subscribers

c. TFTP Server(s)

d. PBX

e. CTI Server

2. Device-specific settings are stored in a database using which of the following?

a. PostGRE

b. SQL

c. MySQL

d. IBM IDS

3. The master copy of the database is maintained by which of the following?

a. Subscriber

b. DBMaster

c. Publisher

d. TFTP Server

4. Which type of traffic flows among CUCM nodes in a cluster to communicate call and device state among the nodes?

a. IDS

b. ICCS

c. SCCP

d. SIP

5. CUCM subscribers obtain NTP synchronization from where?

a. Publisher

b. Cisco IOS Router

c. MS Windows Server

d. External NTP Source

6. Which of the following should never be used for CUCM NTP synchronization?

a. Publisher

b. Cisco IOS Router

c. MS Windows Server

d. External NTP Source

7. Which Replication_State indicates that replication is initializing?

a. 0

b. 1

c. 2

d. 3

e. 4

8. Which Replication_State indicates replicate creation but an incorrect count?

a. 0

b. 1

c. 2

d. 3

e. 4

9. Which Replication_State indicates replication setup failure?

a. 0

b. 1

c. 2

d. 3

e. 4

10. Which two CUCM CLI commands show current Replication_State?

a. utils dbreplication runtimestate

b. show dbreplication runtimestate

c. utils dbreplication status

d. show dbreplication status

11. Which CUCM CLI command is useful in verifying network connectivity and reachability?

a. utils network status

b. show status

c. utils network ping

d. show network eth0

12. Which CUCM CLI command is preferred for testing DNS?

a. utils network nslookup

b. utils network traceroute

c. utils diagnose test

d. utils diagnose dns

13. Which CUCM CLI command displays network time information?

a. show ntp synchronization

b. utils ntp status

c. show ntp status

d. network ntp status

14. Which command should be run only from the CUCM publisher?

a. utils dbreplication repair all

b. utils dbreplication status

c. utils dbreplication runtimestate

d. utils dbreplication diagnose

15. Which command can be issued only from the publisher and stops clusterwide replication on all nodes?

a. utils dbreplication stop

b. utils dbreplication stop all

c. utils dbreplication clusterreset

d. utils dbreplication status

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.118.40