Chapter 1. Introduction to Observability Tools

Bryan Cantrill’s foreword describes operating systems as “proprietary black boxes, welded shut to even the merely curious.” Bryan paints a realistic view of the not-too-distant past when only a small amount of the software stack was visible or observable. Complexity faced those attempting to understand why a system wasn’t meeting its prescribed service-level and response-time goals. The problem was that the performance analyst had to work with only a small set of hardwired performance statistics, which, ironically, were chosen some decades ago by kernel developers as a means to debug the kernel’s implementation. As a result, performance measurement and diagnosis became an art of inferencing and, in some cases, guessing.

Today, Solaris has a rich set of observability facilities, aimed at the administrator, application developer, and operating systems developer. These facilities are built on a flexible observability framework and, as a result, are highly customizable. You can liken this to the Tivo[1] revolution that transformed television viewing: Rather than being locked into a fixed set of program schedules, viewers can now watch what they want, when they want; in other words, Tivo put the viewer in control instead of the program provider. In a similar way, the Solaris observability tools can be targeted at specific problems, converging on what’s important to solve each particular problem quickly and concisely.

In Part One we describe the methods we typically use for measuring system utilization and diagnosing performance problems. In Part Two we introduce the frameworks upon which these methods build. In Part Three we discuss the facilities for debugging within Solaris.

This chapter previews the material explored in more detail in subsequent chapters.

Observability Tools

The commands, tools, and utilities used for observing system performance and behavior can be categorized in terms of the information they provide and the source of the data. They include the following.

  • Kernel-statistics-gathering tools. Report kstats, or kernel statistics, collected by means of counters. Examples are vmstat, mpstat, and netstat.

  • Process tools. Provide system process listings and statistics for individual processes and threads. Examples are prstat, ptree, and pfiles.

  • Forensic tools. Track system calls and perform in-depth analysis of targets such as applications, kernels, and core files. Examples are truss and MDB.

  • Dynamic tools. Fully instrument-running applications and kernels. DTrace is an example.

In combination, these utilities constitute a rich set of tools that provide much of the information required to find bottlenecks in system performance, debug troublesome applications, and even help determine what caused a system to crash—after the fact! But which tool is right for the task at hand? The answer lies in determining the information needed and matching it to the tools available. Sometimes a single tool provides this information. Other times you may need to turn detective, using one set of tools, say, DTrace, to dig out the information you need in order to zero in on specific areas where other tools like MDB can perform in-depth analysis.

Determining which tool to use to find the relevant information about the system at hand can sometimes be as confusing to the novice as the results the tool produces. Which particular command or utility to use depends both on the nature of the problem you are investigating and on your goal. Typically, a systemwide view is the first place to start (the “stat” commands), along with a full process view (prstat(1)). Drilling-down on a specific process or set of processes typically involves the use of several of the commands, along with dtrace and/or MDB.

Kstat Tools

The system kernel statistics utilities (kstats) extract information continuously maintained in the kernel Kstats framework as counters that are incremented upon the occurrence of specific events, such as the execution of a system call or a disk I/O. The individual commands and utilities built on kstats can be summarized as follows. (Consult the individual man pages and the following chapters for information on the use of these commands and the data they provide.)

  • mpstat(1M). Per-processor statistics and utilization.

  • vmstat(1M). Memory, run queue, and summarized processor utilization.

  • iostat(1M). Disk I/O subsystem operations, bandwidth, and utilization.

  • netstat(1M). Network interface packet rates, errors, and collisions.

  • kstat(1M). Name-based output of kstat counter values.

  • sar(1). Catch-all reporting of a broad range of system statistics; often regularly scheduled to collect statistics that assist in producing reports on such vital signs as daily CPU utilization.

The utilities listed above extract data values from the underlying kstats and report per-second counts for a variety of system events. Note that the exception is netstat(1), which does not normalize values to per-second rates but rather to the per-interval rates specified by the sampling interval used on the command line. With these tools, you can observe the utilization level of the system’s hardware resources (processors, memory, disk storage, network interfaces) and can track specific events systemwide, to aid your understanding of the load and application behavior.

Process Tools

Information and data on running processes are available with two tools and their options.

  • ps(1). Process status. List the processes on the system, optionally displaying extended per-process information.

  • prstat(1M). Process status. Monitor processes on the system, optionally displaying process and thread-level microstate accounting and per-project statistics for resource management.

Per-process information is available through a set of tools collectively known as the ptools, or process tools. These utilities are built on the process file system, procfs, located under /proc.

  • pargs(1). Display process argument list.

  • pflags(1). Display process flags.

  • pcred(1). Display process credentials.

  • pldd(1). Display process shared object library dependencies.

  • psig(1). Display process signal dispositions.

  • pstack(1). Display process stack.

  • pmap(1). Display process address space mappings.

  • pfiles(1). Display process opened files with names and flags.

  • ptree(1). Display process family tree.

  • ptime(1). Time process execution.

  • pwdx(1). Display process working directory.

Process control is available with various ptools.

  • pgrep(1). Search for a process name string, and return the PID.

  • pkill(1). Send a kill signal or specified signal to a process or process list.

  • pstop(1). Stop a process.

  • prun(1). Start a process that has been stopped.

  • pwait(1). Wait for a process to terminate.

  • preap(1). Reap a zombie (defunct) process.

Forensic Tools

Powerful process- and thread-level tracing and debugging facilities included in Solaris 10 and OpenSolaris provide another level of visibility into process- or thread-execution flow and behavior.

  • truss(1). Trace functions and system calls.

  • mdb(1). Debug or control processes.

  • dtrace(1M). Trace, analyze, control, and debug processes.

  • plockstat(1M). Track user-defined locks in processes and threads.

Several tools enable you to trace, observe, and analyze the kernel and its interaction with applications.

  • dtrace(1M). Trace, monitor, and observe the kernel.

  • lockstat(1M). Track kernel locks and profile the kernel.

  • mdb(1) and kmdb(1). Analyze and debug the running kernel, applications, and core files.

Last, specific utilities track hardware-specific counters and provide visibility into low-level processor and system utilization and behavior.

  • cputrack(1). Track per-processor hardware counters for a process.

  • cpustat(1M). Track per-processor hardware counters.

  • busstat(1M). Track interconnect bus hardware counters.

Drill-Down Analysis

To see how these tools may be used together, let us introduce the strategy of drill-down analysis (also called drill-down monitoring). This is where we begin examining the entire system and then narrow down to specific areas based on our findings. The following steps describe a drill-down analysis strategy.

  1. Monitoring. Using a system to record statistics over time. This data may reveal long term patterns that may be missed when using the regular stat tools. Monitoring may involve using SunMC, SNMP or sar.

  2. Identification. For narrowing the investigation to particular resources, and identifying possible bottlenecks. This may include kstat and procfs tools.

  3. Analysis. For further examination of particular system areas. This may make use of truss, DTrace, and MDB.

Note that there is no one tool to rule them all; while DTrace has the capability for both monitoring and identifying problems, it is best suited for deeper analysis. Identification may be best served by the kstat counters, which are already available and maintained.

It is also important to note that many sites may have critical applications where it may be appropriate to use additional tools. For example, it may not be suitably effective to monitor a critical Web server using ping(1M) alone, instead a tool that simulates client activity while measuring response time and expected content may prove more effective.

About Part One

In this book, we present specific examples of how and when to use the various tools and utilities in order to understand system behavior and identify problems, and we introduce some of our analysis concepts. We do not attempt to provide a comprehensive guide to performance analysis; rather, we describe the various tools and utilities listed previously, provide extensive examples of their use, and explain the data and information produced by the commands.

We use terms like utilization and saturation to help quantify resource consumption. Utilization measures how busy a resource is and is usually represented as a percentage average over a time interval. Saturation is often a measure of work that has queued waiting for the resource and can be measured as both an average over time and at a particular point in time. For some resources that do not queue, saturation may be synthesized by error counts. Other terms that we use include throughput and hit ratio, depending on the resource type.

Identifying which terms are appropriate for a resource type helps illustrate their characteristics. For example, we can measure CPU utilization and CPU cache hit ratio. Appropriate terms for each resource discussed are defined.

We’ve included tools from three primary locations; the reference location for these tools is at http://www.solarisinternals.com.

  • Tools bundled with Solaris: based on Kstat, procfs, DTrace, etc.

  • Tools from solarisinternals.com: Memtool and others.

  • Tools from Brendan Gregg: DTraceToolKit and K9Toolkit.

Chapter Layout

The next chapters on performance tools cover the following key topics:

This list can also serve as an overall checklist of possible problem areas to consider. If you have a performance problem and are unsure where to start, it may help to work through these sections one by one.



[1] Tivo was among the first digital media recorders for home media. It automatically records programs to hard disk according to users’ viewing and selection preferences.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.191.134