Within a Cisco Unified Communications Manager (CUCM) cluster, database replication is one of the more crucial functions for optimal performance. This is due to an ever-changing database. The cluster consists of the CUCM publisher and subscribers. The terms publisher and subscriber are derived from database operations and functionality. The primary database is maintained by the publisher and pushed to the subscribers. That is, the subscribers carry an active subscription to the information published and offered by the publisher. It seems obvious. But often the source of the terminology is unknown. Knowing that it’s all about database replication seems to demystify the idea of the cluster, consisting of nodes, comprising the actual private branch exchange (PBX) functionality.
This chapter examines the basics of the database, the replication relationships, and how to break, re-establish, and repair those relationships. Upon completing this chapter, you will be able to
• Review the basics of database replication in CUCM
• Identify database replication issues in CUCM
• Describe the procedure to resolve database replication issues in a CUCM cluster
Cisco CallManager, as it was known in the early days, was based on MS SQL Server database infrastructure running on a Microsoft Windows Server platform. When Cisco CallManager 5.0 came along, it was converted to a Linux-based operating system and the Informix database infrastructure. It was renamed Cisco Unified Communications Manager (CUCM). Now, with the recent Collaboration System Release (CSR) 11, it has become way more than that it was at its inception. It was an IP PBX, and now it is the foundation of a rich collaboration architecture that has far exceeded mere voice call control functionality.
What does all of this mean? It means that the underlying architecture and all the peripheral applications, services, clients, and other functions are even more reliant upon the health of this database.
Note
It is important to address a key aspect of working with database replication: patience.
The processes and commands discussed in this chapter may run quickly, but the resulting changes take time. Depending on cluster and database size, full replication can take anywhere from minutes to hours. Synchronization status is designated using a numerical value in the range of 0–4. Patience is key here. Wait for status 2 (good) on every node.
A cluster consists of one or more call control servers (nodes) connected via a data network to provide call-processing services to registered devices. The cluster provides other ancillary services, such as Trivial File Transfer Protocol (TFTP), computer telephony interface (CTI) services, media services, and more. The first node in the cluster is called the publisher. The publisher is the owner of the database. It has full read/write capabilities and control over the database. Subsequent nodes joining the cluster are known as subscribers. That is, they request access to the database and wish to be updated any time there is a change. As the database is updated by the publisher, those updates are sent to the Subscriber nodes.
Each node in the cluster is capable of providing call-processing services. However, each node is not an independent entity. In the days of time-division multiplexed (TDM) private branch exchange (PBX) call processing, each call PBX was a fully independent call-processing entity (administered independently as well). The cluster is the PBX. The fact that these call-processing nodes can be spread out across the network while being still part of a single, all-encompassing call control element was one of the most compelling elements leading to the widespread adoption of IP telephony. Redundancy could be added at multiple layers and across geography without having to administer additional PBXs.
To function properly, CUCM needs to be able to retrieve configuration settings for registered devices. All these settings are stored in a database by using an IBM Informix Dynamic Server (IDS). This database is a repository for all things related to call control such as service parameters, features, device configurations, and of course, the dial plan.
As mentioned, the publisher keeps the master copy of the database. The subscribers merely read it. This relationship mandates a certain amount of traffic flow among the servers, along with very limited latency (80 ms roundtrip max). In fact, there are multiple traffic flows, depending on the purpose of the node.
IDS traffic flows to every node, regardless of its purpose. Some Subscriber nodes do not perform call processing. In many clusters, there are dedicated TFTP server nodes, whereas other nodes are dedicated to music on hold (MoH), conferencing, or other services. This traffic flows in a hub-and-spoke topology. The database replication traffic flows only to/from the publisher from each of the Subscriber nodes, not between Subscriber nodes.
This traffic flow is not to be confused with call-processing user-facing feature replication traffic, which flows in a full mesh among all nodes, regardless of their function. Database modifications for user-facing call-processing features are made on the Subscriber nodes to which each endpoint/device is registered. So, the subscribers get to make some limited updates to the database. These updates must be replicated to all other nodes in the cluster to maintain database integrity and maintain redundancy for features and services. These features include
• Call Forward All (CFA)
• Message Waiting Indicator (MWI)
• Privacy Enable/Disable
• Extension Mobility (EM) login/logout
• Hunt Group login/logout
• Device Mobility
• Certificate Authority Proxy Function (CAPF) status for end/application users
• Credential hacking and authentication
Another traffic type flows among the call-processing subscribers. This traffic is Intra-Cluster Communications Signaling (ICCS). ICCS flows in a full mesh among all the call-processing nodes. This allows for a faster, real-time information exchange for exceedingly frequent changes of call and device states among the subscribers. It enables optimal call routing among the devices registered to the cluster. Figure 5-1 shows a graphical representation of database replication (IDS) and ICCS traffic flows.
Figure 5-1. Database Replication Overview
Note
Network Time Protocol (NTP) is a crucial part of maintaining replication. The subscribers acquire NTP from the publisher. The publisher should obtain it from a highly reliable source. Cisco recommends that a Linux NTP server or Cisco IOS device be used to provide NTP service to the publisher. Use of a Microsoft Windows–based time service is not supported because Windows uses Simple Network Time Protocol (SNTP) to which the publisher cannot synchronize. The publisher should be synchronized to a Stratum 1, 2, or 3 time source. If the stratum is 5 or higher, the publisher will generate alarms.
The first step in identifying a replication issue is to know where to view the current state of replication. You must have some understanding of how replication errors may manifest in terms of configuration or user-reported issues.
You can check replication status from a number of places within the system. This includes the CLI, Cisco Unified Reporting Tool, and Real Time Monitoring Tool (RTMT). In all three cases, the replication status is shown based on a numeric Replication State:
• 0 — Replication is in initialization state
• 1 — Replicates have been created, but their count is incorrect
• 2 — Replication is good
• 3 — Replication is bad
• 4 — Replication setup failed
From the CLI, issue the utils dbreplication runtimestate command. Example 5-1 shows the output of this command. Be patient. This command takes some time to return all the associated output.
Example 5-1. utils dbreplication runtimestate Command Output
admin:utils dbreplication runtimestate Server Time: Sun Jan 10 13:44:03 CST 2016 Cluster Replication State: Replication status command started at: 2016-01-10-13-39 Replication status command ENDED. Checked 692 tables out of 692 Last Completed Table: devicenumplanmapremdestmap No Errors or Mismatches found. Use 'file view activelog cm/trace/dbl/sdi/ReplicationStatus.2016_01_10_13_39_38.out' to see the details DB Version: ccm11_0_1_20000_2 Repltimeout set to: 300s PROCESS option set to: 1 Cluster Detailed View from cucmpub (3 Servers): PING DB/RPC/ REPL. Replication REPLICATION SETUP SERVER-NAME IP ADDRESS (msec) DbMon? QUEUE Group ID (RTMT) & Details ----------- ---------- ------ ------- ----- ----------- ------------------ cucmsub2 172.16.100.8 0.421 Y/Y/Y 0 (g_5) (2) Setup Completed cucmsub 172.16.100.2 0.207 Y/Y/Y 0 (g_4) (2) Setup Completed cucmpub 172.16.100.1 0.039 Y/Y/Y 0 (g_2) (2) Setup Completed admin:
Another commonly used command is utils dbreplication status. It shows the state (active/inactive) of replication among the nodes but does not show the status in terms of the 0–4 numerical values discussed. Example 5-2 shows the output of the command as issued for all nodes: utils dbreplication status all.
Example 5-2. utils dbreplication status all Command Output
admin:utils dbreplication status all Replication status check is now running in background. Use command 'utils dbreplication runtimestate' to check its progress The final output will be in file cm/trace/dbl/sdi/ReplicationStatus.2016_01_10_14_36_49.out Please use "file view activelog cm/trace/dbl/sdi/ReplicationStatus.2016_01_10_14_36_49.out " command to see the output admin:file view activelog cm/trace/dbl/sdi/ReplicationStatus.2016_01_10_14_36_49.out Sun Jan 10 14:36:49 2016 main() DEBUG: --> Sun Jan 10 14:36:57 2016 main() DEBUG: Replication cluster summary: SERVER ID STATE STATUS QUEUE CONNECTION CHANGED ----------------------------------------------------------------------- g_2_ccm11_0_1_20000_2 2 Active Local 0 g_4_ccm11_0_1_20000_2 4 Active Connected 0 Jan 1 16:01:48 g_5_ccm11_0_1_20000_2 5 Active Connected 0 Jan 1 15:48:24 Sun Jan 10 14:37:19 2016 main() DEBUG: <-- end of the file reached options: q=quit, n=next, p=prev, b=begin, e=end (lines 1 - 8 of 8) : admin:
In Example 5-2, the highlighted column shows that replication is active among the nodes. The node on which the command was entered shows as Local in the status column, while peer nodes show as Connected (hopefully). If the connection to a peer is lost, it shows as Dropped instead.
In the Cisco Unified Reporting Tool, a couple of reports deal specifically with database replication. From a status standpoint, the focus here is only on one of them: the Unified CM Database Status report. This report is CPU and time intensive. In general, it should be run only during off hours, as it takes a minimum of 10 seconds per node in the cluster to run.
Open a browser to the publisher and select the Cisco Unified Reporting option from the drop-down box in the top-right corner. Then click Go. Once there, log in (if necessary) and click System Reports. The list of reports is shown in the left column. Select the Cisco Unified CM Database Status report. Once it is selected, you might get a blank screen with three icons. One of those icons resembles a piece of paper with a bar graph. Click that icon to generate a new report. Figure 5-2 shows an example of the report generated.
Figure 5-2. Database Replication Status Report
The report verifies connectivity among the nodes, replication status, name resolution, and much more. In Figure 5-2, the Unified CM Database Status box shows that all servers have the same replication count and that status is good. The View Details link was clicked to expand the output to show that status on all counts is 2 (good). The report also directs the viewer to the Database Summary screen in RTMT. Figure 5-3 shows the Database Summary screen in the Voice/Video section of RTMT.
Figure 5-3. Database Summary Screen in RTMT
In Figure 5-3, five graphs are visible. They represent Change Notification Requests Queued in DB, Change Notification Requests Queued in Memory, Total Number of Connection Clients, Replicates Created, and Replication Status. A line is shown on the graph for each node and one for the cluster overall. Each is represented by a different color. In the table at the bottom of the figure, you can see the nodes and their statuses. As expected, all nodes show a status of 2.
RTMT performance counters can also show the status of database replication per node. In the System section of RTMT, click Performance. Drop down the list of counters for each node and select Number of Replicates Created and State of Replication. Double-click the Number of Replicates Created and the Replicate_State for the node, if you want to see both. This discussion is most interested in Replicate_State. Figure 5-4 shows both of the performance counters selected for the publisher and two subscribers.
Figure 5-4. Replication Performance Counters in RTMT
In Figure 5-4 , the number of replicates created across all nodes is identical. Additionally, the replication status of all nodes is shown as 2, as expected.
Occasionally, database replication experiences anomalies, issues, and/or problems (though not necessarily in that order). The symptoms, as reported by users, may be intermittent, sporadic, and difficult, if not impossible, to reproduce. The cluster architecture allows a significant degree of autonomy for each server node in terms of fulfilling its role without excessive updates from the publisher. Administrative and/or configuration changes should be made to the publisher. If they cannot be replicated to the other nodes within the cluster, there will be issues in terms of database state that might result in operational issues.
IDS is a robust system and maintains a very high degree of reliability. However, some situations that might arise will impact the system’s capability to function as designed. They include
• Network Connectivity: Reachability must be maintained between the nodes across the IP network.
• Network Bandwidth: Replication is a real-time process and must be prioritized in terms of QoS and bandwidth availability end-to-end.
• DNS: The replication process makes extensive use of DNS, and any misconfiguration or delayed response times may impact performance.
• Excessive CPU Utilization of Peer Nodes: Peer nodes may be experiencing sustained periods of excessively high CPU utilization, which can preclude their capability to process replication updates.
• NTP: Replication relies heavily on NTP to track and process replication information and ensure full synchronization.
It is obvious why network connectivity has the potential to cause replication issues. However, the ability to adequately troubleshoot it may not be. A simple ping from your workstation won’t necessarily show an accurate picture of the traffic flow. The reason is that it is between your workstation and the node(s) in question rather than between the nodes themselves. The same is true for traceroute command usage. Example 5-3 shows a simple ping from a workstation to the Publisher node.
Example 5-3. ping Command Output
C:>ping 172.16.100.1 Pinging 172.16.100.1 with 32 bytes of data: Reply from 172.16.100.1: bytes=32 time<1ms TTL=63 Reply from 172.16.100.1: bytes=32 time<1ms TTL=63 Reply from 172.16.100.1: bytes=32 time<1ms TTL=63 Reply from 172.16.100.1: bytes=32 time<1ms TTL=63 Ping statistics for 172.16.100.1: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 0ms, Maximum = 0ms, Average = 0ms C:> C:>ping cucmpub.mydomain.com Pinging cucmpub.mydomain.com [172.16.100.1] with 32 bytes of data: Reply from 172.16.100.1: bytes=32 time<1ms TTL=63 Reply from 172.16.100.1: bytes=32 time<1ms TTL=63 Reply from 172.16.100.1: bytes=32 time<1ms TTL=63 Reply from 172.16.100.1: bytes=32 time<1ms TTL=63 Ping statistics for 172.16.100.1: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 0ms, Maximum = 0ms, Average = 0ms C:>
Example 5-3 shows sub-1 ms response time. It also shows that, from the workstation perspective, the publisher is reachable both by IP address and DNS name. But, again, the fallacy in the output is that it is from the workstation perspective and not reflective of the traffic flow between the nodes. This may seem elementary, but it has come up so many times over the years as a valid, relevant issue in troubleshooting. The thought process is correct, but the commands just need to be entered from the server(s) in question. So, SSH to or open the VMware console of the publisher and subscriber(s). Use the utils network ping and utils network traceroute.
Example 5-4 shows a ping from the Publisher to a Subscriber node using the Publisher CLI using the utils network ping command.
Example 5-4. utils network ping Command Output
admin:utils network ping 172.16.100.2 PING 172.16.100.2 (172.16.100.2) 56(84) bytes of data. 64 bytes from 172.16.100.2: icmp_seq=1 ttl=64 time=0.175 ms 64 bytes from 172.16.100.2: icmp_seq=2 ttl=64 time=0.187 ms 64 bytes from 172.16.100.2: icmp_seq=3 ttl=64 time=0.205 ms 64 bytes from 172.16.100.2: icmp_seq=4 ttl=64 time=0.126 ms --- 172.16.100.2 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3001ms rtt min/avg/max/mdev = 0.126/0.173/0.205/0.030 ms admin:utils network ping cucmsub.mydomain.com PING cucmsub.mydomain.com (172.16.100.2) 56(84) bytes of data. 64 bytes from cucmsub.mydomain.com (172.16.100.2): icmp_seq=1 ttl=64 time=0.081 ms 64 bytes from cucmsub.mydomain.com (172.16.100.2): icmp_seq=2 ttl=64 time=0.055 ms 64 bytes from cucmsub.mydomain.com (172.16.100.2): icmp_seq=3 ttl=64 time=0.144 ms 64 bytes from cucmsub.mydomain.com (172.16.100.2): icmp_seq=4 ttl=64 time=0.137 ms --- cucmsub.mydomain.com ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 2999ms rtt min/avg/max/mdev = 0.055/0.104/0.144/0.038 ms admin:
In Example 5-4, note that both IP address and DNS pings were issued. This verifies that DNS is functioning on the Publisher node. Also note the time shown by each ping response. It’s not merely stating >1 ms. The result is specific to 1/1000 ms. That is much more relevant to your needs in terms of true timing involved.
The specificity and usefulness of the utils network traceroute command are similar to ping. Example 5-5 shows a basic use of the command on the Publisher node.
Example 5-5. utils network traceroute Command Output
admin:utils network traceroute 172.16.100.8 traceroute to 172.16.100.8 (172.16.100.8), 30 hops max, 60 byte packets 1 cucmsub2.mydomain.com (172.16.100.8) 0.310 ms 0.266 ms 0.335 ms admin:
Notice, in the example, that the issuance of the command by IP address also resulted in its being resolved in DNS. Like the ping command, the timing is represented to 1/1000 ms.
In the case of a complete loss of connectivity, the pings will come back with less than optimal results as shown in Example 5-6.
Example 5-6. utils network ping Failure
admin:utils network ping cucmsub2 PING cucmsub2.mydomain.com (172.16.100.8) 56(84) bytes of data. From cucmpub.mydomain.com (172.16.100.1) icmp_seq=1 Destination Host Unreachable --- cucmsub2.mydomain.com ping statistics --- 4 packets transmitted, 0 received, +1 errors, 100% packet loss, time 13001ms Error running command: Executed command unsuccessfully admin:
Checking the replication status using the utils dbreplication runtimestate command shows a dreary picture of events as shown in Example 5-7.
Example 5-7. utils dbreplication runtimestate Command Output
admin:utils dbreplication runtimestate
Server Time: Sun Jan 10 14:49:51 CST 2016
Cluster Replication State: Replication status command started at: 2016-01-10-14-46
Replication status command ENDED. Checked 692 tables out of 692
Last Completed Table: devicenumplanmapremdestmap
No Errors or Mismatches found.
Use 'file view activelog cm/trace/dbl/sdi/ReplicationStatus.2016_01_10_14_46_07.out' to see the details
DB Version: ccm11_0_1_20000_2
Repltimeout set to: 300s
PROCESS option set to: 1
Cluster Detailed View from cucmpub (3 Servers):
PING DB/RPC/ REPL. Replication REPLICATION SETUP
SERVER-NAME IP ADDRESS (msec) DbMon? QUEUE Group ID (RTMT) & Details
----------- ---------- ------ ------- ----- ----------- ------------------
cucmsub 172.16.100.2 0.087 Y/Y/Y 0 (g_4) (2) Setup Completed
cucmpub 172.16.100.1 0.018 Y/Y/Y 0 (g_2) (2) Setup Completed
cucmsub2 172.16.100.8 N/A -/N/- 592 (g_5) (-) DB Active-Dropped
admin:
In Example 5-7, the highlighted line shows the loss of the subscriber. This seems to be a total loss of either reachability between the nodes or the node itself. The ping column shows N/A, which means it is unreachable. So, the Replication Setup (RTMT) & Details column shows no status and DB Active-Dropped.
So far, two types of traffic have been discussed regarding replication: database replication and ICCS. A minimum of 1.544 Mbps is required between sites for ICCS traffic flow, and an additional minimum of 1.544 Mbps is required for other interserver traffic, including database replication. Depending on the deployment model in use for the collaboration system architecture, more bandwidth may be required (for example, the Remote Failover model). These minimums only speak of replication and ICCS specifically among CUCM nodes. They do not include other services or applications. This is strictly the high-priority traffic for call control (not all the intercluster traffic). These bandwidth guidelines are the rule for clusters supporting up to 10,000 busy call hour attempts (BHCA). If more than 10,000 BHCA is required, the formula for bandwidth calculation is
Total Bandwidth (Mbps) = (Total BHCA/10,000) ∗ (1 + 0.006 ∗ Delay), where Delay = RTT delay in ms
Intracluster traffic flows consist of multiple traffic types. These traffic types vary in how they’re classified by the system, either priority or best effort. Priority ICCS traffic is marked with IP Precedence 3 (DSCP 24/PHB CS3). Best effort is marked as IP Precedence 0 (DSCP 0/PHB BE). The traffic types are as follows:
• Database Traffic, which provides configuration information: This type is best effort but may be reclassified as IP Precedence 1 if needed (for example, extensive use of extension mobility).
• Firewall Management Traffic: This type authenticates subscribers to the publisher to gain access to the database. It is best effort but may be reclassified as IP Precedence 1 if required.
• ICCS Real-Time Traffic: This type addresses signaling, call admission control (CAC), and other call signaling. ICCS maintains a TCP connection among all nodes running the Cisco CallManager service. This is marked as priority traffic.
• CTI Manager Real-Time Traffic: This type is used for CTI devices involved in calls or for controlling/monitoring other devices on the CUCM servers. This is marked as priority traffic.
CTI ICCS traffic requirements for CTI Manager over the WAN deployments have not been included in the count so far. It calculates as follows:
CTI ICCS bandwidth (Mbps) = (Total BHCA/10,000)*0.53
For those deployments where J/TAPI applications are remote in relation to the CUCM subscriber providing call control, additional math is required for calculating the Quick Buffer Encoding (QBE) J/TAPI bandwidth:
J/TAPI bandwidth (Mbps) = (Total BHCA/10,000)*0.28
As mentioned, there is a requirement to maintain a maximum roundtrip time of 80 ms between any two given CUCM nodes in the cluster regardless of their function within the cluster.
Many factors impact bandwidth, including shared line appearances across a WAN, CTI, and J/TAPI. Yet bandwidth is not the only concern. A big, fat pipe is not always an indicator to avoid a QoS configuration. QoS is an end-to-end prioritization mechanism for mission-critical traffic types on router/switch ingress interface, processing, egress interface, and more. It’s not something that is required when bandwidth needs to be stretched further. Bandwidth is crucial, but so is the means by which traffic is prioritized across the device in question, not just the link between devices.
Domain Name System (DNS) has transitioned from optional to required over the past few years, as the Cisco Collaboration System has matured. Numerous services rely on the capability to quickly resolve A Records, SRV Records, and more. The need for highly available DNS servers is crucial to the health of IDS and ICCS. IDS uses DNS extensively for replication. Misconfigured, unreachable DNS causes issues for database replication.
One preferred manner of testing DNS is through the utils diagnose test command from the CLI. DNS is just one of the aspects of the server health it tests. Example 5-8 shows the output of the utils diagnose test command.
Example 5-8. utils diagnose test Command Output
admin:utils diagnose test Log file: platform/log/diag3.log Starting diagnostic test(s) =========================== test - disk_space : Passed (available: 1533 MB, used: 12494 MB) skip - disk_files : This module must be run directly and off hours test - service_manager : Passed test - tomcat : Passed test - tomcat_deadlocks : Passed test - tomcat_keystore : Passed test - tomcat_connectors : Passed test - tomcat_threads : Passed test - tomcat_memory : Passed test - tomcat_sessions : Passed skip - tomcat_heapdump : This module must be run directly and off hours test - validate_network : Passed test - raid : Passed test - system_info : Passed (Collected system information in diagnostic log) test - ntp_reachability : Warning The host 204.235.61.9 is not reachable, or its NTP service is down. The host 173.49.198.27 is not reachable, or its NTP service is down. Some of the configured external NTP servers are not reachable. It is recommended that for better time synchronization all of the NTP servers be reachable. Please use the OS Admin GUI to add/remove NTP servers. test - ntp_clock_drift : Passed test - ntp_stratum : Failed The reference NTP server is a stratum 5 clock. NTP servers with stratum 5 or worse clocks are deemed unreliable. Please consider using an NTP server with better stratum level. Please use OS Admin GUI to add/delete NTP servers. skip - sdl_fragmentation : This module must be run directly and off hours skip - sdi_fragmentation : This module must be run directly and off hours Diagnostics Completed The final output will be in Log file: platform/log/diag3.log Please use 'file view activelog platform/log/diag3.log' command to see the output
In Example 5-8, the test runs checks on disk space, disk files, Tomcat services, network connectivity, the RAID system, and NTP. The DNS check falls under the validate_network test. Any DNS errors are reported there. Also note that one or more NTP servers are unreachable (see the section “NTP,” later in this chapter).
Another means of verifying DNS reachability among the nodes is the utils network host command. It can confirm both forward and reverse lookup as shown in Example 5-9.
Example 5-9. utils network host Command Output
admin:utils network host 172.16.100.2 Local Resolution: 172.16.100.2 resolves locally to cucmsub.mydomain.com External Resolution: 2.100.16.172.in-addr.arpa domain name pointer cucmsub.mydomain.com. admin: admin:utils network host cucmsub Local Resolution: cucmsub.mydomain.com resolves locally to 172.16.100.2 External Resolution: cucmsub.mydomain.com has address 172.16.100.2 admin:
Rerun the command to test resolution for all nodes in the cluster.
Verifying CPU utilization on various nodes is best done via the RTMT, where the nodes can all be seen at the same time. This process is covered in detail in the “CPU and Memory” section of Chapter 3, “Using Troubleshooting and Monitoring Tools.” It is only briefly reviewed here. In RTMT, the CPU and Memory screen is accessed in the System section as shown in Figure 5-5.
Figure 5-5. RTMT CPU and Memory Screen
In Figure 5-5, the memory and CPU usage of all nodes is shown in a line graph form. Each node is represented by a different color line on the graph. You can see that a couple of the nodes have CPU spikes. In particular, the one node seems to be relatively busy with some process because it has spiked to near 100 percent momentarily. Because the spike wasn’t sustained for an extended period, it is unlikely to have been interfering with replication.
The cluster relies on time stamps to ensure that all relevant information has been processed efficiently and in the order in which it was received. The most up-to-date information will have the newest time stamps. In a real-time replication scenario, such as cluster architecture, timing is everything. The synchronization of the clocks of all nodes is critical to proper functionality. The Network Time Protocol provides that synchronization service.
The NTP Watchdog in CUCM polls the configured NTP server(s) once per minute on VMware and every 30 minutes on physical machines. The clock on virtual servers is less reliable than on physical servers. The NTP Watchdog forces a restart of the NTP service if the time is offset by more than 3 seconds. The NTP daemon keeps time corrected on a millisecond scale, but the service must be restarted for such a huge time correction.
The publisher, especially because it’s now running on a virtual machine, should always be configured to pull time from a physical server, such as a Cisco IOS device or a Linux server. Subscriber nodes always pull time from the publisher.
You can verify the NTP service by using the utils diagnose test command, as shown previously in Example 5-9. Another command used to verify NTP is utils ntp status as shown in Example 5-10.
Example 5-10. utils ntp status Command Output
admin:utils ntp status ntpd (pid 9880) is running... remote refid st t when poll reach delay offset jitter ============================================================================== *172.16.1.1 67.198.37.16 4 u 727 1024 377 1.400 -6.209 4.256 204.235.61.9 .INIT. 16 u - 1024 0 0.000 0.000 0.000 173.49.198.27 .INIT. 16 u - 1024 0 0.000 0.000 0.000 synchronised to NTP server (172.16.1.1) at stratum 5 time correct to within 141 ms polling server every 1024 s Current time in UTC is : Sun Jan 10 22:37:47 UTC 2016 Current time in America/Chicago is : Sun Jan 10 16:37:47 CST 2016 admin:
In Example 5-10, the internal NTP server is reachable, and time is in sync. The two external NTP servers are still in the .INIT. state, indicating that contact has not been established, likely due to the use of NTPv3 on the external NTP servers rather than NTPv4. Also of particular interest in the command output is the clock stratum. The time is synchronized to a stratum 5 clock. Cisco recommends that the publisher sync to a clock with stratum of 3 or less for optimal performance.
To monitor the NTP conversations on the node, issue the utils network capture port 123 command. This begins a packet capture for NTP. Example 5-11 shows the output of this command.
Example 5-11. NTP Packet Capture on CUCM
admin:utils network capture port 123 Executing command with options: size=128 count=1000 interface=eth0 src= dest= port=123 ip= 16:41:01.594373 IP cucmsub2.mydomain.com.48284 > cucmpub.mydomain.com.ntp: NTPv4, Client, length 48 16:41:01.596012 IP cucmpub.mydomain.com.ntp > cucmsub2.mydomain.com.48284: NTPv4, Server, length 48 16:41:01.596284 IP cucmsub2.mydomain.com.48284 > cucmpub.mydomain.com.ntp: NTPv4, Client, length 48 16:41:01.596733 IP cucmpub.mydomain.com.ntp > cucmsub2.mydomain.com.48284: NTPv4, Server, length 48 16:41:01.596931 IP cucmsub2.mydomain.com.48284 > cucmpub.mydomain.com.ntp: NTPv4, Client, length 48 16:41:01.597371 IP cucmpub.mydomain.com.ntp > cucmsub2.mydomain.com.48284: NTPv4, Server, length 48 16:41:01.597568 IP cucmsub2.mydomain.com.48284 > cucmpub.mydomain.com.ntp: NTPv4, Client, length 48 16:41:01.597623 IP cucmpub.mydomain.com.ntp > cucmsub2.mydomain.com.48284: NTPv4, Server, length 48 16:41:01.625517 IP cucmsub.mydomain.com.34728 > cucmpub.mydomain.com.ntp: NTPv4, Client, length 48 16:41:01.625929 IP cucmpub.mydomain.com.ntp > cucmsub.mydomain.com.34728: NTPv4, Server, length 48
In Example 5-11, NTPv4 traffic flows, using DNS names, can be seen passing among the cluster nodes. NTPv4 is required for CUCM 9.x and higher.
Other relevant commands specific to troubleshooting NTP on CUCM include
• utils diagnose module ntp_reachability
• utils diagnose module ntp_clock_drift
• utils diagnose module ntp_stratum
The output of these commands is all included in the utils diagnose test command output. These commands are included simply to allow the use of specific aspects if desired.
In troubleshooting NTP issues, take into account the following:
• Ensure that the NTP server is reachable.
• Ensure that the Stratum of the NTP server is acceptable (1–3).
• Subscribers out of sync may indicate a publisher reachability issue.
If necessary, issue the utils ntp restart command to reinitialize the NTP service on a given node. This step is necessary whenever a large correction (3 seconds or more) is required.
Replication is performed by a specific set of scripts running on the CUCM nodes. The process is performed by the Cisco database replicator (CDR). The overall method is rather straightforward. It includes a number of predictable steps that are performed at installation time for the node in question:
Step 1. Define the publisher and set it up to begin replication (RTMT=0).
Step 2. Define a template on the publisher and “realize” it to tell it which tables to replicate (RTMT=2).
Step 3. Define the subscriber (RTMT=0).
Step 4. Realize the template on each Subscriber node to tell it the tables for which it will get/send data.
Step 5. Synchronize data using CDR Check or CDR Sync. After the CDR Check passes, RTMT=2.
Step 6. Repeat steps 3–5 for each subscriber.
Understanding the basic process is exceedingly helpful in troubleshooting. It is crucial to recognize key issues, which may be precluding the servers from replicating. These issues should be checked before running any of the commands; they include
• Verify Server/Cluster Connectivity for the necessary TCP/UDP port ranges in use for intracluster communications. You can find a complete list of these ports here:
• Check the hosts file(s) that will be in use when replication is initialized. These files are as follows:
• /etc/hosts: Local file used to map IP addresses to hostnames
• /home/Informix/.rhosts: List of trusted hostnames to be used in setting up database connections
• $INFORMIXDIR/etc/sqlhosts: Full list of CUCM nodes for replication
• Verify proper DNS functionality throughout the cluster
All this information is available in the Cisco Unified Reporting Tool’s Unified CM Database Replication Status report.
Resolving replication is not something that you should attempt solo the first few times. Work through the issues of troubleshooting replication with the Technical Assistance Center (TAC) unless you’re working on a lab system and there is no potential for service impact to end users.
As mentioned previously, you should check the status of replication throughout the cluster by using the utils dbreplication runtimestate command. If you have sufficient reason to believe that replication is experiencing issues (that is, nodes not at the 2 state), the utils dbreplication repair command may be warranted. However, the use of the command varies based on the size and state of the cluster. The command is typically issued only from the publisher.
For a cluster with 5000 phones or fewer, the utils dbreplication repair all command is safe to use. On larger clusters or clusters with only one node misbehaving, use the utils dbreplication [nodename] command. Be patient. Depending on the size of the database, this command can take hours (sometimes up to a full day) to complete fully. Monitor the status of the command using the utils dbreplication runtimestate command. Example 5-12 shows the use of the utils dbreplication repair all command and a brief view of the output file it creates.
Example 5-12. utils dbreplication repair all Command and File Output
admin:utils dbreplication repair all -------------------- utils dbreplication repair -------------------- chmod: changing permissions of `/var/log/active/cm/trace/dbl/sdi/replication_scripts_output.log': Operation not permitted Replication Repair is now running in the background. Use command ‘utils dbreplication runtimestate' to check its progress Output will be in file cm/trace/dbl/sdi/ReplicationRepair.2016_01_11_22_35_02.out Please use "file view activelog cm/trace/dbl/sdi/ReplicationRepair.2016_01_11_22_35_02.out " command to see the output admin:file view activelog cm/trace/dbl/sdi/ReplicationRepair.2016_01_11_22_35_02.out utils dbreplication repair output To determine if replication is suspect, look for the following: (1) Number of rows in a table do not match on all nodes. (2) Non-zero values occur in any of the other output columns for a table SERVER ID STATE STATUS QUEUE CONNECTION CHANGED ----------------------------------------------------------------------- g_2_ccm11_0_1_20000_2 2 Active Local 0 g_4_ccm11_0_1_20000_2 4 Active Connected 0 Jan 1 16:01:48 g_5_ccm11_0_1_20000_2 5 Active Connected 0 Jan 10 15:04:34 Mon Jan 11 22:35:12 2016 dbllib.getReplServerName DEBUG: --> Mon Jan 11 22:35:24 2016 dbllib.getReplServerName DEBUG: replservername: g_2_ccm11_0_1_20000_2 Mon Jan 11 22:35:24 2016 dbllib.getReplServerName DEBUG: <-- ------------------------------------------------- No Errors or Mismatches found. options: q=quit, n=next, p=prev, b=begin, e=end (lines 1 - 20 of 8325) : Replication status is good on all available servers. Jan 11 2016 22:36:03 ------ Table scan for ccmdbtemplate_g_2_ccm11_0_1_20000_2_1_141_typedberrors start -------- Node Rows Extra Missing Mismatch Processed ---------------- --------- --------- --------- --------- --------- g_2_ccm11_0_1_20000_2 1602 0 0 0 0 g_4_ccm11_0_1_20000_2 1602 0 0 0 0 g_5_ccm11_0_1_20000_2 1602 0 0 0 0 Jan 11 2016 22:36:04 ------ Table scan for ccmdbtemplate_g_2_ccm11_0_1_20000_2_1_141_typedberrors end --------- Jan 11 2016 22:36:04 ------ Table scan for ccmdbtemplate_g_2_ccm11_0_1_20000_2_1_342_typeroutingdatabasecachetimer start -------- Node Rows Extra Missing Mismatch Processed ---------------- --------- --------- --------- --------- --------- g_2_ccm11_0_1_20000_2 97 0 0 0 0 g_4_ccm11_0_1_20000_2 97 0 0 0 0 options: q=quit, n=next, p=prev, b=begin, e=end (lines 21 - 40 of 8325) :
Obviously, there is a great deal of output in this file. Example 5-12 shows only the first 40 of 8325 lines. The key aspects are highlighted. In this case, there are no mismatches. That information is shown early in the file. If there were mismatches, they would be detailed throughout the output file.
Replication can be reset clusterwide or per node. The more specific options are typically preferred as first resorts, with the nuclear option as a last resort. Resetting replication begins with stopping replication and then resetting it. It can be stopped on the any/all subscribers as well as the publisher. Additionally, it can be stopped node-by-node or all at once. If it is to be stopped node-by-node, it should be stopped on the publisher only after completely stopped on all subscribers. That is, the utils dbreplication stop command must have been issued on each subscriber followed by waiting out the repltimeout, which is 300 seconds (5 minutes) by default. That timer starts after the stop is issued on the last subscriber. In CUCM 7.x, the utils dbreplication stop all command was added; it takes care of all these issues in one command issued from the publisher.
Note
The utils dbreplication stop all and utils dbreplication reset all commands replace the utils dbreplication clusterreset command, which was deprecated in CUCM 9.0(1). Although it is still in the list, the utils dbreplication clusterreset command is nonfunctional and simply results in an error message.
Keep in mind that it will still wait out the value of the repltimeout before returning the prompt to you on the screen. You can see the value of the timer by entering the show tech repltimeout command or set it to a different value by using the utils dbreplication setrepltimeout [time in seconds] command.
Warning
There is no “Are you sure?” prompt when using the utils dbreplication stop all command! It is immediately executed. Make sure you really want to do it before you press Enter after typing the command!
Example 5-13 shows the output of the utils dbreplication stop all command.
Example 5-13. utils dbreplication stop all Command Output
admin:utils dbreplication stop all ******************************************************************************************** This command will delete the marker file(s) so that automatic replication setup is stopped It will also stop any replication setup currently executing ******************************************************************************************** Deleted the marker file, auto replication setup is stopped Service Manager is running A Cisco DB Replicator[STOPPING] A Cisco DB Replicator[STOPPING] Commanded Out of Service A Cisco DB Replicator[NOTRUNNING] Service Manager is running A Cisco DB Replicator[STARTING] A Cisco DB Replicator[STARTING] A Cisco DB Replicator[STARTED] Will stop PUB and SUBs: all Stopping Sub: 172.16.100.2 Stop replication on sub Completed Stopping Sub: 172.16.100.8 Stop replication on sub Completed Killed dblrpc process - 26445 Completed replication process cleanup Please run the command ‘utils dbreplication runtimestate' and make sure all nodes are RPC reachable before a replication reset is executed admin:
In Example 5-13, the A Cisco DB Replicator service is seen stopping on both subscribers. Then the process is killed on the publisher. It also notes that the utils dbreplication runtimestate command should be used to monitor the progression. Example 5-14 shows the output of the utils dbreplication runtimestate after the stop command was issued.
Example 5-14. Stopping Replication
admin:utils dbreplication runtimestate Server Time: Mon Jan 11 23:24:19 CST 2016 Cluster Replication State: Replication status command started at: 2016-01-11-22-49 Replication status command ENDED. Checked 692 tables out of 692 Last Completed Table: devicenumplanmapremdestmap No Errors or Mismatches found. Use ‘file view activelog cm/trace/dbl/sdi/ReplicationStatus.2016_01_11_22_49_57.out' to see the details DB Version: ccm11_0_1_20000_2 Repltimeout set to: 300s PROCESS option set to: 1 Cluster Detailed View from cucmpub (3 Servers): PING DB/RPC/ REPL. Replication REPLICATION SETUP SERVER-NAME IP ADDRESS (msec) DbMon? QUEUE Group ID (RTMT) & Details ----------- ---------- ------ ------- ----- ----------- ------------------ cucmsub2 172.16.100.8 0.369 Y/Y/Y 0 (g_5) (2) Setup Completed cucmsub 172.16.100.2 0.086 Y/Y/Y 0 (g_4) (2) Setup Completed cucmpub 172.16.100.1 0.019 Y/Y/Y 0 (g_2) (2) Setup Completed admin:
In Example 5-14, note that reachability is still fine between the nodes, and they all match in terms of the database tables. However, the highlighted portion shows that replication has ended. Note that each node has a Replication Group ID associated (2, 4, 5). This information is useful in tracking the status of the reset process.
After the replication process is stopped on all nodes, you can reset it by using the utils dbreplication reset all command. As tiring as it may be to see the same advice—“Be patient!”—throughout the chapter, patience is necessary. This is not a fast process and will take an hour or more, depending on the size of the cluster and number of nodes involved. Monitor it using the utils dbreplication runtimestate command. Example 5-15 shows the output of the utils dbreplication reset all command.
Example 5-15. utils dbreplication reset all Command Output
admin:utils dbreplication reset all This command will try to start Replication reset and will return in 1-2 minutes. Background repair of replication will continue after that for 1 hour. Please watch RTMT replication state. It should go from 0 to 2. When all subs have an RTMT Replicate State of 2, replication is complete. If Sub replication state becomes 4 or 1, there is an error in replication setup. Monitor the RTMT counters on all subs to determine when replication is complete. Error details if found will be listed below OK [172.16.100.8] OK [172.16.100.2] Reset command completed successfully on: --> cucmsub --> cucmsub2 Reset completed: 2 Failed: 0 Duration: 6.32 minute Use CLI to see detail: ‘file view activelog cm/trace/dbl/sdi/dbl_repl_output_util.log' admin:
Viewing the file output created by the command shows a verification of reachability, deletion of the replicate, and redefinition of the replicate. The output mentions that all subscribers should go to State 0 and then to State 2. If one or more nodes happen to remain at State 0 for more than 4 hours, you should reinitiate a stop and reset on the problematic node(s). This is true if the State=4. Each Subscriber node is noted by its Replication Group ID, as shown in the output of the utils dbreplication runtimestate command in Example 5-14. Monitoring the reset progress with the utils dbreplication runtimestate command enables you to see the progression of each node through various states. Example 5-16 shows the utils dbreplication runtimestate command output during the reset process.
Example 5-16. Monitoring Reset Progression
admin:utils dbreplication runtimestate Server Time: Mon Jan 11 23:40:52 CST 2016 Cluster Replication State: PUB SETUP Started at 2016-01-11-23-34 Setup Progress: 1 node(s) added to the replication network Setup Errors: No errors DB Version: ccm11_0_1_20000_2 Repltimeout set to: 300s PROCESS option set to: 1 Cluster Detailed View from cucmpub (3 Servers): PING DB/RPC/ REPL. Replication REPLICATION SETUP SERVER-NAME IP ADDRESS (msec) DbMon? QUEUE Group ID (RTMT) & Details ----------- ---------- ------ ------- ----- ----------- ------------------ cucmsub 172.16.100.2 0.125 Y/Y/Y -- (-) (-) Waiting... cucmsub2 172.16.100.8 0.377 Y/Y/Y -- (-) (0) Defining... cucmpub 172.16.100.1 0.020 Y/Y/Y 0 (g_2) (2) Setup Completed admin:
In Example 5-16, cucmsub is waiting for replication to begin while cucmsub2 is in a defining state (RTMT=0). The setup on the Publisher node is complete and at RTMT=2 state. The state of each subscriber should go to 0 and then to 2 after the replication reset is completed. If it does not do so on one or more nodes, more troubleshooting is needed. But, again, be patient. The output of the utils dbreplication runtimestate command will be in somewhat constant flux. Keep watching it. Example 5-17 shows the difference a 10-minute wait can make. Take a look at the subscribers’ replication states.
Example 5-17. Monitoring Replication Reset
admin:utils dbreplication runtimestate Server Time: Mon Jan 11 23:50:10 CST 2016 Cluster Replication State: BROADCAST SYNC Started on 2 server(s) at: 2016-01-11-23-49 Use CLI to see detail: ‘file view activelog cm/trace/dbl/20160111_234950_dbl_repl_output_Broadcast.log' DB Version: ccm11_0_1_20000_2 Repltimeout set to: 300s PROCESS option set to: 1 Cluster Detailed View from cucmpub (3 Servers): PING DB/RPC/ REPL. Replication REPLICATION SETUP SERVER-NAME IP ADDRESS (msec) DbMon? QUEUE Group ID (RTMT) & Details ----------- ---------- ------ ------- ----- ----------- ------------------ cucmsub 172.16.100.2 0.086 Y/Y/Y 0 (g_4) (0) Syncing... cucmsub2 172.16.100.8 0.392 Y/Y/Y 0 (g_5) (0) Syncing... cucmpub 172.16.100.1 0.016 Y/Y/Y 0 (g_2) (2) Setup Completed admin:
In Example 5-17, note that both subscribers are in an RTMT=0 state, Syncing. This is excellent progress. Keep watching until all three nodes are at the RTMT=2 state, Setup Completed.
As with the stop command, there is no “Are you sure?” prompting when you enter the reset command. So, be sure you really mean it when you press Enter.
The addition to the all parameter on utils dbreplication stop and utils dbreplication reset was quite useful. If replication is okay on all but one node, focus on that one node. As an example, the cluster in use for much of the construction of this book is a three-node cluster. If the Publisher and Subscriber 1 both show Replication State 3 in RTMT, while Subscriber 2 shows Replication State 4, it is worth exploring a reset of just Subscriber 2’s replication.
To do so, enter utils dbreplication stop on Subscriber 2 only. Remember, there is no safety net prompt to ask if you are sure. After you press Enter, replication stops on that node. Example 5-18 shows the output from Subscriber 2.
Example 5-18. Stop Replication on Subscriber 2 Only
admin:utils dbreplication stop ******************************************************************************************** This command will delete the marker file(s) so that automatic replication setup is stopped It will also stop any replication setup currently executing ******************************************************************************************** Deleted the marker file, auto replication setup is stopped Service Manager is running Commanded Out of Service A Cisco DB Replicator[NOTRUNNING] Service Manager is running A Cisco DB Replicator[STARTING] A Cisco DB Replicator[STARTING] A Cisco DB Replicator[STARTED] Killed dblrpc process - 14313 Completed replication process cleanup Please run the command ‘utils dbreplication runtimestate' and make sure all nodes are RPC reachable before a replication reset is executed admin:
From the publisher, enter the utils dbreplication reset [nodename] command. For the hostname, use the hostname of Subscriber 2. The command becomes utils dbreplication reset cucmsub2. Example 5-19 shows the command output on the publisher.
Example 5-19. Reset Replication for Subscriber 2 from Publisher
admin:utils dbreplication reset cucmsub2
Repairing of replication is in progress.
Background repair of replication will continue after that for 30 minutes....
OK [172.16.100.8]
Reset completed: 0 Failed: 0
Duration: 2.9 minute
Use CLI to see detail: ‘file view activelog cm/trace/dbl/sdi/dbl_repl_output_util.log'
admin:
Example 5-20 shows the log file generated by the reset detailed for the preceding example.
Example 5-20. Reset of Subscriber 2 Replication from Publisher
admin:file view activelog cm/trace/dbl/sdi/dbl_repl_output_util.log Tue Jan 12 00:04:58 2016 replutil DEBUG: --> Tue Jan 12 00:04:58 2016 replutil DEBUG: task to do [teardown] Tue Jan 12 00:04:58 2016 replutil DEBUG: hostname [cucmsub2] Tue Jan 12 00:04:58 2016 replutil DEBUG: Inside task == teardown Tue Jan 12 00:04:58 2016 replutil.getList DEBUG: --> Tue Jan 12 00:04:58 2016 replutil.getList DEBUG: Inside getList Tue Jan 12 00:04:58 2016 replutil.getList DEBUG: Inside : getList(hostname) :: and if (Hostname != None ) Tue Jan 12 00:05:25 2016 replutil.getList DEBUG: <-- Tue Jan 12 00:05:25 2016 replutil DEBUG: Starting replication reset on node: cucmsub2 Tue Jan 12 00:05:25 2016 replutil.cdrDelete DEBUG: --> Tue Jan 12 00:05:25 2016 replutil.cdrDelete DEBUG: Inside cdrDelete() Tue Jan 12 00:05:25 2016 replutil.cdrDelete DEBUG: length of groupnameslist is [1] Tue Jan 12 00:05:25 2016 dbllib.cdrdeleteserver DEBUG: --> Tue Jan 12 00:05:46 2016 dbllib.cdrdeleteserver DEBUG: Executing su - informix -c "source /usr/local/cm/db/informix/local/ids.env; cdr delete server -f --connect=g_5_ccm11_0_1_20000_2 g_5_ccm11_0_1_20000_2" Tue Jan 12 00:05:46 2016 dbllib.cdrdeleteserver DEBUG: Successfully deleted g_5_ccm11_0_1_20000_2 remotely Tue Jan 12 00:05:47 2016 dbllib.cdrdeleteserver DEBUG: Executing su - informix -c "source /usr/local/cm/db/informix/local/ids.env; cdr delete server -f g_5_ccm11_0_1_20000_2" Tue Jan 12 00:05:47 2016 dbllib.cdrdeleteserver DEBUG: Successfully deleted g_5_ccm11_0_1_20000_2 Tue Jan 12 00:05:47 2016 dbllib.cdrdeleteserver DEBUG: <-- Tue Jan 12 00:05:47 2016 replutil.cdrDelete DEBUG: Successfully deleted the g_5_ccm11_0_1_20000_2 server from replication network Tue Jan 12 00:06:47 2016 replutil.cdrDelete DEBUG: <-- Tue Jan 12 00:06:47 2016 replutil DEBUG: <-- Tue Jan 12 00:06:48 2016 replutil DEBUG: --> Tue Jan 12 00:06:48 2016 replutil DEBUG: task to do [setup] Tue Jan 12 00:06:48 2016 replutil DEBUG: hostname [cucmsub2] Tue Jan 12 00:06:48 2016 replutil DEBUG: Inside task == setup Tue Jan 12 00:06:48 2016 replutil.getList DEBUG: --> Tue Jan 12 00:06:48 2016 replutil.getList DEBUG: Inside getList Tue Jan 12 00:06:48 2016 replutil.getList DEBUG: Inside : getList(hostname) :: and if (Hostname != None ) Tue Jan 12 00:07:23 2016 replutil.getList DEBUG: <-- Tue Jan 12 00:07:23 2016 replutil.cdrDefine DEBUG: --> Tue Jan 12 00:07:23 2016 replutil.cdrDefine DEBUG: Inside cdrDefine method Tue Jan 12 00:07:23 2016 replutil.cdrDefine DEBUG: Inside cdrDefine Tue Jan 12 00:07:23 2016 replutil.cdrDefine DEBUG: val is g_5_ccm11_0_1_20000_2 Tue Jan 12 00:07:23 2016 replutil.cdrDefine DEBUG: cmd is [/usr/local/cm/bin/dbl mkrepl --delsub g_5_ccm11_0_1_20000_2] Tue Jan 12 00:07:52 2016 replutil.cdrDefine DEBUG: <-- Tue Jan 12 00:07:52 2016 replutil DEBUG: Reset completed: 0 Failed: 0 Tue Jan 12 00:07:52 2016 replutil DEBUG: <-- end of the file reached options: q=quit, n=next, p=prev, b=begin, e=end (lines 101 - 105 of 105) : admin:
In Example 5-20, key aspects are highlighted. The teardown task begins, followed by the reset and redefining of the database on cucmsub2. Even though this was performed on only one node, it still took considerable time to complete. It will continue processing for some time after the command is entered. Keep monitoring until it returns to the RTMT=2 state.
If it does not change from RTMT=0 to RTMT=2 within 4 hours, check the relevant services using the utils service list command to make sure they are started. The services in question include A Cisco DB, A Cisco DB Replicator, and Cisco Database Layer Monitor. If any one of these services is not started, start it and retry the reset. If the services are all running, and connectivity shows to be good between the nodes within the cluster, a corruption may exist within the replication process itself. On the affected Subscriber node, issue the utils dbreplication stop command followed by the utils dbreplication dropadmindb command. This forces a corrupted syscdr to restart from scratch. On the publisher CLI, issue the utils dbreplication reset [nodename] command and monitor the output of the utils dbreplication runtimestate command again.
Replication troubleshooting can be a tricky business on the best of days and on the smallest of clusters. We cannot stress enough that this level of troubleshooting is quite advanced and involved. On a production system, it is highly recommended that TAC be involved until such time you gain a comfort level with the commands and quirks surrounding replication. This chapter is by no means an exhaustive resource on replication. In fact, it merely scratches the surface. If you have access to a lab system and can build a multinode environment, it is exceedingly useful to work with the commands discussed in this chapter and get a feel for how the database replication process goes and the time it takes, even for a small cluster.
Tearing down and rebuilding replication node-by-node as well as clusterwide can provide a great deal of comfort level and general know-how when it comes to understanding how Cisco Collaboration Systems behave in the best and worst of times. Don’t rush the process. Fully re-establishing replication to the desired state can take a great deal of time. Above all, please be patient.
For additional information, refer to the following:
• Command Line Interface Guide for Cisco Unified Communications Solutions
• Troubleshooting CUCM Database Replication in Linux Appliance Model
• Steps to Troubleshoot Database Replication
Use these questions to review what you’ve learned in this chapter. The answers appear in Appendix A, “Answers to Chapter Review Questions.”
1. A cluster contains which of the following components?
a. Publisher
b. Subscribers
c. TFTP Server(s)
d. PBX
e. CTI Server
2. Device-specific settings are stored in a database using which of the following?
a. PostGRE
b. SQL
c. MySQL
d. IBM IDS
3. The master copy of the database is maintained by which of the following?
a. Subscriber
b. DBMaster
c. Publisher
d. TFTP Server
4. Which type of traffic flows among CUCM nodes in a cluster to communicate call and device state among the nodes?
a. IDS
b. ICCS
c. SCCP
d. SIP
5. CUCM subscribers obtain NTP synchronization from where?
a. Publisher
b. Cisco IOS Router
c. MS Windows Server
d. External NTP Source
6. Which of the following should never be used for CUCM NTP synchronization?
a. Publisher
b. Cisco IOS Router
c. MS Windows Server
d. External NTP Source
7. Which Replication_State indicates that replication is initializing?
a. 0
b. 1
c. 2
d. 3
e. 4
8. Which Replication_State indicates replicate creation but an incorrect count?
a. 0
b. 1
c. 2
d. 3
e. 4
9. Which Replication_State indicates replication setup failure?
a. 0
b. 1
c. 2
d. 3
e. 4
10. Which two CUCM CLI commands show current Replication_State?
a. utils dbreplication runtimestate
b. show dbreplication runtimestate
c. utils dbreplication status
d. show dbreplication status
11. Which CUCM CLI command is useful in verifying network connectivity and reachability?
a. utils network status
b. show status
c. utils network ping
d. show network eth0
12. Which CUCM CLI command is preferred for testing DNS?
a. utils network nslookup
b. utils network traceroute
c. utils diagnose test
d. utils diagnose dns
13. Which CUCM CLI command displays network time information?
a. show ntp synchronization
b. utils ntp status
c. show ntp status
d. network ntp status
14. Which command should be run only from the CUCM publisher?
a. utils dbreplication repair all
b. utils dbreplication status
c. utils dbreplication runtimestate
d. utils dbreplication diagnose
15. Which command can be issued only from the publisher and stops clusterwide replication on all nodes?
a. utils dbreplication stop
b. utils dbreplication stop all
c. utils dbreplication clusterreset
d. utils dbreplication status
18.216.118.40