Crashes and hangs can happen with any device running software, and the NetScaler is no different in this regard. While a large percentage of them get picked up during testing, the complexity involved in catching all use cases and packet combinations means that some will make their way to the Customers. The good news is that most are usually fixed by the next revision of the software. This is one of the biggest reasons to stay current in terms of NetScaler builds.
Let's first start by differentiating these crashes and hangs. While their impact on your application's availability can be the same, the underlying issues are very different, and how you have to approach them as an Administrator are different as well.
A NetScaler crash can happen due to several reasons:
/var/crash
.Most administrators will notice a crash in the form of an unexpected reboot or a failover. One way to verify whether the issue was due to a crash is to look for newly created files in /var/core
or /var/crash
.
You will need to engage Citrix Technical Support to help identify the root cause for the crash and get advice on corrective steps, which will often involve upgrading to a build that contains the fix. To facilitate the investigation, capture the following information to share with the engineer:
/var/core
or /var/crash
that matches the time of the issuetechsupport
fileWhile waiting for the engagement to complete, consider reverting to an earlier build if the crash is seen immediately after an upgrade. If you are using a very out of date build, consider upgrading to the latest by looking up the release notes for similar potential issues that are fixed, or alternatively consider one of the Citrix-certified Safe Harbor builds (see the upcoming section about the various build types).
A hang is a situation where the NetScaler is stuck in a race condition because two functions mutually waiting on each other, or because a process is running in a never-ending loop. One sign of a hang is when the device appears to power up but doesn't respond to any input. There are also cases where the device continues to handle traffic while being unreachable via GUI/SSH/Console.
In the case of a hang you will not see any core dumps. A reboot will almost certainly restore access to the unit, but it should not be the first line of troubleshooting as this will result in important diagnostic information being lost. You should instead attempt to dump a core.
You can dump a core by aborting one of the packet engines from console. Here are a quick set of steps taken from the knowledge base article CTX207598 on how to do this:
pb_policy -o abort
. This tells the NetScaler to dump cores if packet engines are interrupted.ps -aux
and note down the PID of all the packet engines.kill -6
command and list the PIDs of all packet engines in the command. For example, kill -6 325 326 327 328
.pb_policy
back to its default by running the shell command pb_policy -d
. This is important, as the abort
mode of running the system is performance-intensive.GA (General Availability) builds are available for all Citrix customers to use. This is a good thing. It means that there is a very large install base for such builds and any issues present will have a greater chance of being reported and fixed in the next iteration. GA builds are of two types:
.e builds
make it into the maintenance builds of the next release of code. For example, features introduced in 10.5.e
became available in the regular 11.0 version.11.0 releases introduced naming changes in the form of .M builds (Maintenance – containing bugs and security fixes) and .F builds (which introduce new features). Then, there are the special builds:
A Safe Harbor build is a GA build that has been available publicly for at least six months and on which customers have reported very few issues. Citrix clearly calls out these builds on the download page so they are easy to notice. At the time of writing, the latest safe harbor build available is 10.5 56.22. There are no 11.0 safe harbor builds yet, but this is subject to change.
An NDPP build is a build for certain sectors, such as government agencies. Such organizations are governed by regulations that require that the security status of networking devices (such as NetScaler) are independently verified as conforming to a specific standard, namely the NDPP standard. This particular build has passed that standard.
13.59.61.147