Monitoring
This chapter provides preferred practices for managing day-to-day monitoring of IBM Scale Out Network Attached Storage (SONAS) cluster resources by using internal tools, the graphical user interface (GUI), and the command-line interface (CLI). You can monitor components outside of the SONAS appliance, such as Fibre Channel Switches, zone configurations, and gateway storage devices.
These should be managed separately by using documentation and tools that are appropriate and specific to that device. For more concise back-end storage device monitoring tools, review the storage brand product administration guides.
If you are new to your SONAS or IBM Storwize V7000 Unified solution, keep this chapter and use the Daily, Weekly, and Monthly check lists at the end of this chapter as a guide for walking through health checks. As you become more proficient, you will find preferred methods of validating system health and readiness.
As a component of preferred practices, avoid becoming too comfortable with learned processes, and continue to challenge yourself to learn more about system health, logs, and code changes on all of your systems from end-to-end. Provide your suggestions to IBM through your account and technical resources so that IBM can continue to shape products to your needs.
This chapter contains the following information:
8.1 Monitoring daily work
This section provides step-by-step monitoring suggestions to add to your daily, weekly, or monthly activity routines. It is a high-level consideration of monitoring tools that is intended to help you understand how to make sure that everything is optimally functioning, or point out what might need deeper investigation. It highlights relevant monitoring short cuts and tips along the way, to help you quickly assess your clusters’ status.
For more information, see the IBM SONAS Implementation Guide, SG24-7962 and Scale Out Network Attached Storage Monitoring, SG24-8207 IBM Redbooks publications.
8.1.1 Daily checks
The following list shows the health checks you should make daily:
Cluster health
Cluster node health
Cluster services health
File system capacity
File system and independent file set iNode capacity
Storage pool capacity
Backup job status
Replication task status
Events list review
User quota checks
8.1.2 Health and system component check
See the daily monitoring chapter in the Scale Out Network Attached Storage Monitoring, SG24-8207 IBM Redbooks publication for step-by-step information about the daily activity routine for health and system component check.
File system capacity
When you run out of file system capacity, the only way to mitigate the situation is to delete data, or add Network Shared Disks (NSDs) to the file system.
Figure 8-1 shows how to run this check from the GUI.
Figure 8-1 GUI showing file system capacity (from the tab selection)
Storage pool capacity
When you run out of storage pool capacity, the only way to mitigate the situation is to migrate data to a different pool, or add NSDs to that specific storage pool.
Figure 8-2 shows how to monitor storage pool capacity from the GUI.
Figure 8-2 Monitoring storage pool capacity
Figure 8-3 shows how to monitor storage pool capacity from the CLI.
[[email protected] ~]# lspool -d gpfs0 -r
EFSSG0015I Refreshing data.
Filesystem Name Size Usage Available fragments Available blocks Disk list
gpfs0 system 53.23 TB 0% 2.32 MB 53.23 TB DCS3700_360080e50002ea78c000014fb521cdbeb;DCS3700_360080e50002ea78c000014fd521cdc32;DCS3700_360080e50002ee7bc00001304521cdeaa;DCS3700_360080e50002ee7bc00001306521cdef4
EFSSG1000I The command completed successfully.
Figure 8-3 Monitoring storage pool capacity from the CLI
Node file system capacity
When the root file systems on nodes in your cluster (or your Tivoli Storage Manager server) fill up, the product can become unstable. The only way to mitigate that situation is to find what is filling the file system and compress or remove it. However, this task does require some caution. File system capacity can be checked from the CLI.
From the root directory of the node, as the root user, run the df -h command, as shown in Figure 8-4.
Filesystem Size Used Avail Use% Mounted on
/dev/sdb2 19G 11G 7.2G 59% /
tmpfs 24G 4.0K 24G 1% /dev/shm
/dev/sda1 276G 302M 261G 1% /ftdc
/dev/sdb1 9.1G 150M 8.5G 2% /persist
/dev/sdb5 103G 2.7G 95G 3% /var
/dev/sdb6 9.2G 158M 8.6G 2% /var/ctdb/persistent
/dev/gpfs0 54T 7.0G 54T 1% /ibm/gpfs0
Figure 8-4 Sample df -h output
Quotas
When a quota reaches a hard limit, the user cannot write to the quota-protected space unless other data is removed (to free capacity), or the quota is modified (to expand the limitation).
Figure 8-5 shows how to view the quota from the GUI.
Figure 8-5 Sortable quotas status view in the GUI
Figure 8-6 shows how to check the quota with the CLI. You can optionally use the --force option to prevent the display of the confirmation prompt. In the example, quotas are checked on the file system named gpfs0.
# chkquota gpfs0 --force The system displays information similar to the following output:
gpfs0: Start quota check
1 % complete on Wed Dec 1 11:25:09 2010
...
100 % complete on Wed Dec 1 11:26:41 2010
Finished scanning the inodes for gpfs0.
Merging results from scan.
Figure 8-6 Forcing quota checks with the CLI
Backup scratch pool tape availability
When the backup server runs out of scratch pool tapes, it fails backups when the last writable tape is full and there is no space left to write data.
The check for the number of available scratch tapes uses the RUN Q_SCRATCH command, as shown in Figure 8-7. Run this command only from the Tivoli Storage Manager server as the dsadmin user.
Scratch Tapes
Check for the number of available scratch tapes.
 
tsm: TSM1CLIENT> RUN Q_SCRATCH
“Query_Scratch” is a user-defined server command script.
Figure 8-7 Commands run from the Tivoli Storage Manager server to check quantity of scratch tapes
Backup job status
When backup jobs are failing, the system might not have data that can be efficiently restored from tools outside of the cluster. The only way to mitigate this risk is to determine the reason why backups are failing, close the gap, and catch up with a new incremental backup. Monitoring of backup task success begins with the GUI or the CLI command lsjobstatus.
To determine whether a Tivoli Storage Manager backup session is running, run the lsjobstatus -j backup -r command. See Example 8-1 on page 265.
Example 8-1 Example of the output for the lsjobstatus -j backup -r command
File system Job Job id Status Start time End time/Progress RC Message
gpfs0 backup 34 running 10/17/12 8:13:49 AM MST backing up 776/602200 Errors=0 Expire=562/569
EFSSG1000I The command completed successfully
The lsjobstatus command shows the running and completed jobs for a file system, and the primary node where the job started. By default, only the running jobs are shown, but this display can be modified with the --done and --all options. The command lists the start and end times for the jobs. See Example 8-2.
Example 8-2 Sample lsjobstatus output with common options
[[email protected] ~]# lsjobstatus -j backup --all
File system Job Job id Status Start time End time/Progress RC Message
gpfs0 backup 0 done(+) 2/3/12 4:50:30 PM EST 2/3/12 4:56:50 PM EST 0 EFSSG1000I
........
gpfs0 backup 854 done(-) 8/25/13 2:00:05 AM EDT 8/25/13 2:21:46 AM EDT 1 Errors=51 Expire=7962/7962 Backup=559/965 EFSSA0797C Backup partially failed. Please check the logs for details.
gpfs0 backup 855 done(-) 8/26/13 2:00:03 AM EDT 8/26/13 2:27:53 AM EDT 1 Errors=31 Expire=26/26 Backup=379/759 EFSSA0797C Backup partially failed. Please check the logs for details.
gpfs0 backup 859 done(-) 8/27/13 2:00:05 AM EDT 8/27/13 6:30:38 AM EDT 1 Errors=32 Expire=27996/27996 Backup=52300/50442 EFSSA0797C Backup partially failed. Please check the logs for details.
gpfs0 backup 860 running 8/28/13 2:00:05 AM EDT Errors=6102 Expire=266610/266610 Backup=447529/452892 Errors=6102 Expire=266610/266610 Backup=447529/452892
EFSSG1000I The command completed successfully.
Check replication tasks
When replication fails, the changes from the last incremental backup get larger and larger as time goes by, extending the time that it takes to complete the corrective action and the next incremental asynchronous replication. This leaves your system more vulnerable to being inconsistent if you have a disaster recovery requirement. The only way to mitigate the situation is to repair the issues that are causing failure, and catch up with the next incremental replication.
To display the status of the selected asynchronous replication in the management GUI complete the following tasks:
1. Log on to the SONAS GUI.
2. Select Copy Services  Replication.
3. Right-click an asynchronous replication, and then select Details.
In the CLI, use the lsrepl command to display the status of asynchronous replications, as shown in Example 8-3.
Example 8-3 Sample lsrepl output display
$ lsrepl
Filesystem Log ID Status Description Last Updated Time
pfs0 20120217023214 FINISHED 8/8 Asynchronous replication process finished 2/17/12 02:33 AM
gpfs0 20120221004341 FINISHED 8/8 Asynchronous replication process finished 2/21/12 00:45 AM
gpfs0 20120217023214 FINISHED 8/8 Asynchronous replication process finished 2/21/12 01:33 AM
gpfs0 20120217023116 FAILED The replication was aborted due to a critical error.
Please use '--clearlock' option the next time, to remove the lock file. 2/21/12 03:44 AM
gpfs0 20120221004341 FINISHED 8/8 Asynchronous replication process finished 2/21/12 04:45 AM
gpfs0 20120221004341 FINISHED The replication completed successfully through failover recovery of node pairs. 2/21/12 05:45 AM
EFSSG1000I The command completed successfully.
You can use the --status option to display the status of processes that are involved in the asynchronous replications. This option can be used to investigate a node name to be used in the runreplrecover command. By default, this option displays the number of active (running), available (not banned), and banned replication processes for each node, as shown in the output in Example 8-4.
Example 8-4 lsrepl sample output with the --status option
[email protected] ~]$ lsrepl gpfs0 --status
Filesystem: gpfs0
Log ID: 20120429064041
Source Target Active Procs Available Procs Total Procs
int001st001 10.0.100.141 2 3 3
int002st001 10.0.100.143 3 3 3
You can use the --process option to display the status of active (running), available (not banned), and banned replication processes for each node. See Example 8-5.
Example 8-5 lsrepl sample output with the --process and the --status option
[[email protected] ~]$ lsrepl gpfs0 –-status --process
Filesystem: gpfs0
Log ID: 20120429064041
Index Source Target Repl Status Health Status
1 int002st001 10.0.100.143 active available
2 int002st001 10.0.100.143 active available
3 int002st001 10.0.100.143 active available
4 int001st001 10.0.100.141 inactive available
5 int001st001 10.0.100.141 active available
6 int001st001 10.0.100.141 active available
You can use the --progress option to display information about the transferred size, progress, transfer rate, elapsed time, remaining size, and remaining time for each process running in the asynchronous replication. See Example 8-6.
Example 8-6 lsrepl sample output with the --progress option
[[email protected] ~]$ lsrepl gpfs0 --progress
Filesystem: gpfs0
Log ID: 20120429064041
Mon Apr 16 07:56:46 CEST 2012
interval 1 sec, remain loop: 26
display rsync progress information
================================================================================
 
PROC #: NODE-PAIR <HEALTH_STATUS>
FILE-PATH
FILE: XFER-SIZE(TOTAL) PROG(%) XFER-RATE ELAPSED
REMAIN(TIME)
--------------------------------------------------------------------------------
Proc 1: int002st001->10.0.100.144 <available>
dir/file3 65,536,000(500.00MB) 12.50% 10.79MB/s 0:00:07
437.50MB(0:00:41)
Proc 2: int001st001->10.0.100.143 <available>
dir/file4 98,435,072(500.00MB) 18.77% 7.16MB/s 0:00:10
406.12MB(0:00:58)
Proc 3: int003st001->10.0.100.145 <available>
dir/file5 75,202,560(500.00MB) 14.34% 6.51MB/s 0:00:08
428.28MB(0:01:07)
Proc 4: mgmt002st001->10.0.100.141 <available>
dir/file1 43,548,672(500.00MB) 8.31% 6.74MB/s 0:00:06
458.46MB(0:01:09)
Proc 5: mgmt001st001->10.0.100.142 <available>
dir/file2 115,736,576(500.00MB) 22.07% 9.50MB/s 0:00:13
389.62MB(0:00:42)
--------------------------------------------------------------------------------
Overall Progress Information: 0 of 8 files comp
XFER-SIZE(TOTAL) PROG(%) XFER-RATE ELAPSED REMAIN(TIME)
   80MB( 2.45GB) 15.09% 41.36MB/s 0:00:10 2.08GB(0:01:06)
Monitoring health from the CLI
See the chapter about monitoring as a daily administration task from Scale Out Network Attached Storage Monitoring, SG24-8207 for CLI commands that can be used to monitor
the cluster.
SONAS log review
SONAS log review is important for understanding the status and diagnosing any issues. With a quick search for logs in the SONAS section of the IBM Knowledge Center, you can find the commands for monitoring different logs. It is useful and helps to explain the logging system, and provide information about monitoring logs from a different perspective.
The SONAS section of the IBM Knowledge Center can be found on the following website:
The Using IBM SONAS logs topic provides a good starting point:
It can also be helpful to become familiar with common log entries when there are no issues. This can help you avoid spending time looking at entries that do not indicate problems.
The showlog CLI command is useful for troubleshooting or validating job or task success. It can be used to show the log file for the specified job. Example 8-7 shows common uses of the showlog command.
Example 8-7 Common use of the showlog command
showlog 15
- Shows the log for job with jobID 15.
 
showlog backup:gpfs0
- Shows the backup log for the latest backup job done for file system gpfs0.
 
showlog 15 -count 20
- Shows only last 20 lines of the log for job with jobID 15.
 
showlog backup:gpfs0 -t 03.05.2011 14:18:21.184
- Shows the backup log taken of file system gpfs0 at date and time specified.
Example 8-8 shows common uses of the showerrors CLI command.
Example 8-8 Sample syntax of the showerrors command
showerrors 15
- Shows the error log for job with jobID 15.
 
showerrors backup:gpfs0
- Shows the backup error log for the latest backup job done for file system gpfs0.
 
showerrors 15 -count 20
- Shows only the last 20 lines of the error log for job with jobID 15.
 
showerrors backup:gpfs0 -t 03.05.2011 14:18:21.184
- Shows the backup error log taken of file system gpfs0 at the date and time specified.
Output from the lsaudit CLI command
Audit logs help to track the commands that are run in the GUI and the CLI by the SONAS administrators on the system. Figure 8-8 shows sample output from the lsaudit command.
Figure 8-8 The lsaudit command output
Audit logs are valuable for tracking what administrators run from the GUI or the CLI, and when they run it. One of the reasons administrators should get personal accounts with administrator privileges rather than using the default Admin account is to make sure that the audit log defines the person who issued a command. Otherwise, it might show a group that shares the account. You can also download the audit logs from the system in the Support view in the Settings menu, as shown in Figure 8-9.
Figure 8-9 Image of the log download from the Support menu
Output from the lsjobstatus CLI command
The lsjobststaus command displays the status of jobs that are currently running or already finished. For example, you can check whether an ILM policy or backup job completed, or if they completed with errors. See Figure 8-10.
Figure 8-10 The lsjobstatus --all command
The lsjobstatus command can be used to find the Job ID of a specific process. This information can be used to get more details about any specific job or task. It can help identify trace information that is reported when you are using the showlog command.
The lsjobstatus command can be used to list the running and completed backups. Specify a file system to show the completed and running backups for only that file system as shown in Example 8-9. The output of the lsjobstatus CLI command shows only backup session results for the past seven days.
Example 8-9 List the backups for file system gpfs0
[[email protected] ~]# lsjobstatus gpfs0
Filesystem Date Message
gpfs0 20.01.2010 02:00:00.000 G0300IEFSSG0300I The filesystem gpfs0 backup started.
gpfs0 19.01.2010 16:08:12.123 G0702IEFSSG0702I The filesystem gpfs0 backup was done successfully.
gpfs0 15.01.2010 02:00:00.000 G0300IEFSSG0300I The filesystem gpfs0 backup started.
Output from the lsbackupfs CLI command
The lsbackupfs CLI command displays backup configurations as shown in Example 8-10. For each file system, the display includes the following information:
The file system
The Tivoli Storage Manager server
The interface nodes
The status of the backup
The start time of the backup
The end time of the most recently completed backup
The status message from the last backup
The last update
Example 8-10 Example of a backup that started on 1/20/2010 at 2 a.m.
# lsbackupfs
File system TSM server List of nodes
gpfs0 SONAS_SRV_2 int001st001,int002st001 Status Start time End time Message Last update
RUNNING 1/20/10 2:00 AM 1/19/10 11:15 AM INFO: backup successful (rc=0). 1/20/10 2:00AM
Monitoring capacity
See the chapter about monitoring as a daily administration task from Scale Out Network Attached Storage Monitoring, SG24-8207 for information about monitoring capacity-related tasks.
8.1.3 Monitoring inode use and availability
See the chapter about monitoring as a daily administration task from Scale Out Network Attached Storage Monitoring, SG24-8207 for information about monitoring inode use and availability.
8.2 Weekly tasks to add to daily monitoring tasks
Tip: Every file system and independent file set has a root file set that reports cache, snapshots, capacity, and inode consumption. When they run out of inodes, data stops writing. In that case, the chfs or chfset command can be used to increase the number of maximum and pre-allocated inodes.
Along with daily tasks, you can add the following things to check weekly (preferably before the weekend):
All daily checks
SONAS cluster node root file system capacity
SONAS audit log review
SONAS storage node NSD resource and workload balance analysis
SONAS interface node network saturation levels
SONAS back-end storage health reviews (gateway solutions)
Tivoli Storage Manager server root file system capacity
Tivoli Storage Manager server scratch tape count
Tivoli Storage Manager server error and log review
Open technical service request progress validations
8.2.1 Cluster node root file system capacity
When the root file system of one of the nodes fills up, it is possible for that node to either crash or become unstable. Occasional monitoring of this capacity can prevent surprises. The only way to mitigate an issue from occurring is to preemptively search (with IBM support) the file systems that are in dangerously high watermarks. Do this to help understand what options are preferred for freeing required capacity. See Example 8-11.
Example 8-11 Sample output of cndsh df -h command run on every cluster node
[[email protected] ~]# cndsh df -h
mgmt001st001: Filesystem Size Used Avail Use% Mounted on
mgmt001st001: /dev/sdb2 19G 11G 7.2G 59% /
mgmt001st001: tmpfs 24G 4.0K 24G 1% /dev/shm
mgmt001st001: /dev/sda1 276G 302M 261G 1% /ftdc
mgmt001st001: /dev/sdb1 9.1G 150M 8.5G 2% /persist
mgmt001st001: /dev/sdb5 103G 2.7G 95G 3% /var
mgmt001st001: /dev/sdb6 9.2G 158M 8.6G 2% /var/ctdb/persistent
mgmt001st001: /dev/gpfs0 54T 7.0G 54T 1% /ibm/gpfs0
mgmt002st001: Filesystem Size Used Avail Use% Mounted on
mgmt002st001: /dev/sda2 19G 9.3G 8.2G 54% /
mgmt002st001: tmpfs 24G 4.0K 24G 1% /dev/shm
mgmt002st001: /dev/sda7 103G 222M 98G 1% /ftdc
mgmt002st001: /dev/sda1 9.1G 150M 8.5G 2% /persist
mgmt002st001: /dev/sda5 103G 2.4G 96G 3% /var
mgmt002st001: /dev/sda6 9.2G 158M 8.6G 2% /var/ctdb/persistent
mgmt002st001: /dev/gpfs0 54T 7.0G 54T 1% /ibm/gpfs0
strg001st001: Filesystem Size Used Avail Use% Mounted on
strg001st001: /dev/sda2 19G 5.2G 13G 30% /
strg001st001: tmpfs 3.8G 0 3.8G 0% /dev/shm
strg001st001: /dev/sda7 103G 213M 98G 1% /ftdc
strg001st001: /dev/sda1 9.1G 150M 8.5G 2% /persist
strg001st001: /dev/sda5 103G 1.8G 96G 2% /var
strg001st001: /dev/sda6 9.2G 149M 8.6G 2% /var/ctdb/persistent
strg002st001: Filesystem Size Used Avail Use% Mounted on
strg002st001: /dev/sda2 19G 5.2G 13G 30% /
strg002st001: tmpfs 3.8G 0 3.8G 0% /dev/shm
strg002st001: /dev/sda7 103G 213M 98G 1% /ftdc
strg002st001: /dev/sda1 9.1G 150M 8.5G 2% /persist
strg002st001: /dev/sda5 103G 1.7G 96G 2% /var
strg002st001:    /dev/sda6                 2G 149M  8.6G 2% /var/ctdb/persistent
8.2.2 Audit log review
When working with a team of storage administrators on a large SONAS solution, it can be helpful to review the work that is done by the team by auditing the logs weekly. This information can be useful for cross training, measuring trends of common work patterns, or evaluating training opportunities or process improvements.
Audit logs can be captured from either the GUI or the CLI. The following examples explain how this information can be obtained.
Figure 8-11 shows how to review the audit logs with the GUI.
Figure 8-11 GUI access to the Admin Audit Log
Example 8-12 shows CLI access to the audit log by using the lsaudit CLI command.
Example 8-12 CLI access to the audit log
[[email protected] ~]# lsaudit
INFO : 01.08.2013 21:29:27.353 root@unknown(60715) CLI mkusergrp Administrator --role admin RC=0
INFO : 01.08.2013 21:29:27.699 root@unknown(60715) CLI mkusergrp SecurityAdmin --role securityadmin RC=0
INFO : 01.08.2013 21:29:28.022 root@unknown(60715) CLI mkusergrp ExportAdmin --role exportadmin RC=0
INFO : 01.08.2013 21:29:28.351 root@unknown(60715) CLI mkusergrp StorageAdmin --role storageadmin RC=0
INFO : 01.08.2013 21:29:28.679 root@unknown(60715) CLI mkusergrp SystemAdmin --role systemadmin RC=0
INFO : 01.08.2013 21:29:29.008 root@unknown(60715) CLI mkusergrp Monitor --role monitor RC=0
INFO : 01.08.2013 21:29:29.333 root@unknown(60715) CLI mkusergrp CopyOperator --role copyoperator RC=0
INFO : 01.08.2013 21:29:29.656 root@unknown(60715) CLI mkusergrp SnapAdmin --role snapadmin RC=0
INFO : 01.08.2013 21:29:30.003 root@unknown(60715) CLI mkusergrp Support --role admin RC=0
8.2.3 Monitoring performance and workload balance
When the performance becomes sluggish, or complaints are made about performance, your first response should be to capture a clear and concise articulation of the following information:
Who is complaining
What they are complaining about
What details summarize the user, client, share, storage, data, and timing
An accurate description of complaint
If you understand the performance indicators for normal behavior, it can help lead you to the points of interest in serious deviations. More detail is provided later in this chapter.
Performance monitoring is always a complex question. There are several components that offer simplified, high-level reference. However, realize that a client issue can (sometimes) be realized at the client system itself, and for that reason it can prove advantageous to develop methods or check points for quickly validating client concerns.
The easiest way to evaluate the SONAS system performance is to use the built-in Performance Center in SONAS. It collects the status of SONAS environment components every second, and enables you to look back on system performance averages historically.
The Performance Center is provided in both the GUI and CLI, and they use the same status log data when it is collected. You can view performance graphs of a specific part of your system in near real-time for cluster operations, front-end and back-end devices, and protocol services.
In cases of more complex configurations, such as SONAS gateways, it can be useful to use a centralized monitoring system, such as Tivoli Storage Productivity Center, to have a whole view of the storage environment.
Performance monitoring is somewhat basic for NAS systems in Tivoli Storage Productivity Center. However, it facilitates centralized monitoring of the block storage systems, the SAN network devices, and the SONAS gateway solution. These tools can chart relevant drops in performance that can help triangulate root cause indicators.
8.2.4 Performance monitoring with the SONAS GUI
See the daily monitoring information in the Scale Out Network Attached Storage Monitoring, SG24-8207 IBM Redbooks publication for information about performance monitoring-related tasks.
In addition to these tasks, in some cases, privileged user access can be authorized by support for advanced diagnostic review.
Top
Top is a simple and useful tool for monitoring the workload and performance in the interface and storage nodes, and at clients. So, if you run the top command on either interface nodes or storage nodes, you can see how hard nodes are working, and which daemons or processes are running on the top of the list, as shown in Figure 8-12.
Figure 8-12 Top command output
For example, the report provides the following information:
The mmfsd activity shows the GPFS demon in every node.
The smbd activity shows the heavy CIFS shot in the front end.
The nfsd activity shows the heavy NFS shot in the front end.
Front-end performance
Ensure that as many interface nodes as possible are sharing the network, and that all subnet ports are up. The easiest way to check these configurations is the lsnwinterface command.
The lsnwinterface -x command
Figure 8-13 shows an example of the lsnwinterface -x command.
Figure 8-13 lsnwinterface output
The lsnwinterface command shows that Internet Protocol (IP) addresses that are assigned for external client access are evenly distributed across all interface nodes.
The lsnwinterface -x command shows that all subordinate ports on your network bonds are up, and that the distribution of IPs is evenly balanced.
Make sure that the interfaces are up and IP addresses are balanced across all of the active interface nodes in detail.
In the next step, you can go deeper in monitoring methods. With root access, you can use non-SONAS-specific commands, such as sdstat, to monitor the front-end performance, as shown in Figure 8-14.
The sdstat -a 1 command
Figure 8-14 shows an example of the sdstat command output.
Figure 8-14 Example sdstat command output
If you run this command on each interface node in parallel, you can compare the use of nodes. This command shows information about processor use, disk, network (send and receive), paging, and system type information.
The send and receive date can show if any interface node was more loaded by clients. If you have multiple IPs on each node, you can move one of the IPs to a different node that is less busy.
 
Tip: If this view shows that you have reached the physical limit of network interfaces, you might need either more NICs on your interface nodes or more interface nodes to maximize your network capabilities. Low processor idle statistics can also indicate that you might not have enough interface nodes to manage the current workload.
Back-end performance
To monitor the back-end system performance, one commonly used tool is the iostat command. It also requires root access on the SONAS system:
iostat -xm /dev/dm* 1
This command shows the dynamic multipath device activity (reads, writes, queue-size, I/O wait, and device use), as shown in Figure 8-15.
Figure 8-15 iostat -xm /dev/dm* 1 command output
In this example, you should see a fairly even I/O pattern, which means there are four NSDs working together in the target file system.
If the utilization of devices is fairly high, it might be an indication that you have not provided enough spindles behind the NSDs. Alternatively, it might mean that you have not provided enough NSDs to distributed workload for this workload, or that the file system NSD or parameter settings are not optimal for the I/O in the client workload pattern.
 
Remember: If percent use and average queue size are high on NSDs of file system, it is a good indicator to add more disk to the file system.
Weekly checkups on the Tivoli Storage Manager server
Run the following weekly checkups on the Tivoli Storage Manager server to monitor scheduled operations and to ensure that client and server scheduled operations are completing successfully, and that no problems exist:
Tivoli Storage Manager Server root file system capacity
Ensure that the Root, Database, and Log directories do not fill up.
Tivoli Storage Manager Server error log review and scratch tape count
Client events
Check that the backup and archive schedules did not fail using the following Tivoli Storage Manager command. This check is only for Tivoli Storage Manager server initiated schedules:
tsm: TSM1URMC> Query Event * * Type=Admin BEGINDate=-1 BEGINTime=17:00 (Format=Detail)
In this case, the first asterisk (*) is for the domain name. The second asterisk (*) is for the schedule name.
Administrative events
Check that the administrative command schedules did not fail using the following Tivoli Storage Manager command:
tsm: TSM1URMC> Query Event * Type=Administrative BEGINDate=TODAY (Format=Detail)
The asterisk (*) is for the schedule name.
Scratch tapes
Check the number of available scratch tapes using the following Tivoli Storage Manager command. Note that Query_Scratch is a user-defined server command script.
tsm: TSM1URMC> RUN Q_SCRATCH
Read-only tapes
Check for any tapes with access of read-only using the following Tivoli Storage Manager command. If any tapes are in read only (RO) mode, check the Tivoli Storage Manager Server activity log for related errors:
tsm> Query VOLume ACCess=READOnly
Unavailable tape
Check for any tapes with access of unavailable using the following Tivoli Storage Manager command. If any tapes are in this mode, check the Tivoli Storage Manager Server activity log for related errors and take appropriate actions:
tsm: TSM1URMC> Query VOLume ACCess=UNAVailable
Open technical service request progress validations
Any SONAS-related technical service calls on the NAS solution or its clients should be logged and tracked for team review, knowledge transfer, and post-event process improvement. With every platform of complex technology, the quickest road to improvement comes from study of failures, issues, and events.
8.3 Monthly checks for trends, growth planning, and maintenance review
The following are monthly checks you should perform on your SONAS for monitoring and planning purposes:
File system growth trend analysis
File set growth trend analysis
File system storage pool growth trend analysis:
 – Capturing growth trends and analyzing storage pool and file set growth is important for planning incremental growth. This information is obtainable from cndump files.
 – Capturing dumps once a month and charting valuable statistics from the data can help you understand trends that prevent unplanned surprises and off-hours calls to service.
 – The output of the cndump is a compressed collection of log files and command outputs that support services use to analyze your SONAS cluster health. Take the time to become familiar with its output and value for collecting the previously mentioned points of interest. A monthly review schedule seems adequate for most high-end clients. It might also be a good tool for reviewing these trends with your technical advisors or IBM account teams for planning future growth demands.
System performance trend analysis
Cataloging some basic performance watermarks and trends from your weekly performance reviews adds value to your monthly report and trend analysis. Actions for expansion consideration should be considered if and when you near saturation points on the front- or back-end devices in your SONAS cluster.
User and client satisfaction survey
At the end of the day, the clients who use the services day-in and day-out are the best resources for consistent feedback on solution satisfaction. It might be beneficial to establish a high-level, simplified service satisfaction survey for your client base that can be reviewed on a monthly or semiannual basis. Any sense of dissatisfaction can be further escalated and better understood for focused, effective response.
IBM technical advisors and other vendor bug and patch advisory meetings
Most IBM account teams are happy to meet monthly to discuss or review solution feedback, growth analysis, needs changes, and so on. In many cases, clients have assigned technical advisors that actively follow their needs and product developments to keep both sides well-informed.
Maintenance and process improvement considerations review:
 – Internal NAS team meetings are a common place for reviewing monthly trend analysis, cluster growth trend maintenance plans, and to discuss ideas for process improvements.
 – A strong, informed team with continuous process improvement, and a stable NAS solution with adequate capacity, are the keys to success with a Scale Out NAS requirement.
8.4 Monitoring with IBM Tivoli Storage Productivity Center
For a complete guide to using Tivoli Productivity Center to monitor SONAS, see the Scale Out Network Attached Storage Monitoring, SG24-8207 IBM Redbooks publication.
8.4.1 Summary for monitoring SONAS daily, weekly, and monthly
This checklist summarizes the preferred practices for SONAS monitoring:
Daily Checks
o Cluster health
o Cluster node health
o Cluster services health
o File system capacity
o File system and independent file set iNode capacity
o Storage pool capacity
o Backup job status
o Replication task status
o Events list review
o User quota checks
Weekly Checks (preferably prior to weekend)
o All daily checks
o SONAS cluster node root file system capacity
o SONAS audit log review
o SONAS storage node NSD resource and workload balance analysis
o SONAS interface node network saturation levels
o SONAS back-end storage health reviews (gateway solutions)
o Tivoli Storage Manager server root file system capacity
o Tivoli Storage Manager server scratch tape count
o Tivoli Storage Manager server error and log review
o Open technical service request progress validations
Monthly checks (trends, growth planning, and maintenance review)
o File system growth trend analysis
o File set growth trend analysis
o File system storage pool growth trend analysis
o System performance trend analysis
o User and client satisfaction survey
o IBM TA and other vendor bug and patch advisory meetings
o Process improvement considerations review
o Maintenance and project planning meetings
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.28.70