Chapter 9. Troubleshooting and support

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Troubleshooting and support

This chapter introduces how to open a problem management record (PMR) for software and hardware support and how to use the IBM Scale Out Network Attached Storage (SONAS) IBM Knowledge Center to provide troubleshooting guidance. It also explains how to prepare for correct, efficient service escalation to get correct support attention quickly. It explains how to work with IBM teams such as technical advisors and development when service challenges are at a heightened sense of concern.

Also see the Start here for troubleshooting topic in the IBM Knowledge Center:

http://www.ibm.com/support/knowledgecenter/STAV45/com.ibm.sonas.doc/sonas_starthere.html

This chapter describes the following topics:

•Creating PMRs

•Network troubleshooting

•Troubleshooting authentication issues

9.1 Creating PMRs

In any case where there is a technical product issue, a problem ticket is required for official service and troubleshooting assistance. This ticket is called a PMR, and it is a requirement to open one to get assistance with diagnostics in SONAS. IBM uses the PMR system to track support activity for client-owned IBM products and applications. SONAS call home services might automatically create a PMR, or you can manually create a PMR to track maintenance issues and resolutions from onset to close.

In some cases, code defects are opened by the SONAS support team at the same time as PMR activity if the problem is suspect as a coding defect, if an anomaly is discovered, or even if an error exists in documentation. This section walks you through the process to create a hardware and a software PMR. This section provides the phone numbers and test scenarios to guide you through the process.

As issues develop, it is the client’s responsibility to manage the understanding of severity levels. It is encouraged that they keep their assigned technical advisor informed of issues that arise to ensure the highest level of cooperative awareness, escalation, resolution, and ensure that further risk mitigation is properly applied. The IBM team that is assigned to every SONAS is committed to client success, and is helpful at escalating to any level to help maintain the highest level of customer satisfaction.

Note that in the case of a SONAS that is installed as a gateway with separate storage (such as IBM XIV, IBM Storwize V7000, or IBM DS8000), there might be a need to establish a PMR for either the SONAS stack or the storage stack independently. The products in a gateway configuration are separate, and are not bundled under the same warranty or serial number.

9.1.1 Software PMR

Complete the following steps to open a software PMR:

1. Call IBM at 1-800-IBM-SERV, which is 1-800-426-7378.

You are prompted to respond if you are calling about a Lenovo product or an IBM product:

– Press 1 for a Lenovo product.

– Press 2 for an IBM product.

In this scenario, press 2.

2. You are asked if you have a five-digit premium service number. This number does not apply in this case.

3. You are asked if it is a hardware or software problem:

– Press 1 for a hardware problem.

– Press 2 for a software problem.

In this scenario, press 2.

4. You are asked if the system is IBM AIX or Other:

– Press 1 for AIX.

– Press 2 for Other.

In this scenario, press 2.

5. You are transferred to a live representative, who asks for the following information:

– Customer ID, which is your 7-digit customer number.

– Component ID, which is 5639SN100 for SONAS.

– Software version of SONAS that you are running, for example 1.3.1.1. If you do not know this answer, it does not matter.

– Your name, phone number, and email address.

– Severity of the problem.

A PMR is opened based on the information you provide.

9.1.2 Hardware PMR

The following steps take you through the prompts to open a hardware PMR.

You go through the same phone steps as in, 9.1.1, “Software PMR” on page 280, except that you select 1 for hardware. Then you give the machine type 2851, and the serial number of any of your SONAS servers:

1. Call IBM at 1-800-IBM-SERV, which is 1-800-426-7378.

You are prompted to respond if you are calling about a Lenovo product or an IBM product:

– Press 1 for a Lenovo product.

– Press 2 for an IBM product.

In this scenario, press 2.

2. You are asked if you have a five-digit premium service number. This number does not apply in this case.

3. You are asked if it is a hardware or software problem:

– Press 1 for a hardware problem.

– Press 2 for a software problem.

In this scenario, press 1.

4. You are asked if the system is an AIX or Other:

– Press 1 for AIX.

– Press 2 for Other.

In this scenario, press 2.

5. You are transferred to a live representative, who asks for the following information:

– Customer ID, which is your 7-digit customer number

– Machine type, which is 2851, and a serial number for any of the SONAS servers

– Your name, phone number, and email address

– Severity of the problem

You can easily get the serial number by using the lsnode -v command as shown in Figure 9-1 on page 282. Notice that under the Serial Number column for mgmt001st001, you see KQRGMZV.

This machine type and serial number is required to pass hardware entitlement. After you pass, IBM opens a hardware PMR, which gets routed to SONAS L1 support.

If you need an IBM service support representative (SSR) dispatched to come on site, this request must always be made with a hardware PMR. SSR Dispatch by using a software PMR cannot be done.

Serial numbers of the SONAS nodes can be obtained with the lsnode -v command, as shown in Figure 9-1.

Figure 9-1 Use the lsnode -v command output to display a SONAS serial number for a PMR

9.2 Network troubleshooting

This section describes methods to gather diagnostic information about your SONAS network.

9.2.1 Checking network interface availability

You have several options for checking network availability by using the IBM SONAS graphical user interface (GUI) or the command-line interface (CLI):

1. Complete the following steps to use the GUI:

a. In the GUI, select Settings → Network → Public Network Interface. The status should be up for each network interface for all the nodes, as shown in Figure 9-2.

Figure 9-2 Public Network Interfaces status check

b. By expanding each of the network interfaces, you can view the status and other properties of the subordinate interface.

2. In the CLI, check the status of the interface ethX0 (the interface of interface nodes to the customer network):

a. Open the CLI.

b. Use the lsnwinterface command to display the status for the wanted Internet Protocol (IP) addresses.

The #lsnwinterface -x command shows the port status with the subordinate port, as highlighted in Example 9-1.

In the Up/Down column, UP indicates a valid connection.

Int001st001 shows one of six 1 gigabit Ethernet (GbE) subordinate ports (1000) available to keep networking alive on that node, where int002st001 shows one of two 10 GbE ports (10000) alive.

The lsnwinterface command also displays all IP addresses that are actively assigned to Interface nodes, as highlighted on int003st001.

Example 9-1 Sample lsnwinterface command output

Node Interface MAC Master/Subordinate Bonding mode Up/Down IP-Addresses Speed MTU

int001st001 ethX0 00:21:5e:08:8b:44 MASTER balance-alb (6) UP 9.11.136.52 N/A 1500

int001st001 ethXsl0_0 00:15:17:cc:7b:06 SUBORDINATE DOWN 1000 1500

int001st001 ethXsl0_1 00:21:5e:08:8b:46 SUBORDINATE DOWN 1000 1500

int001st001 ethXsl0_2 00:21:5e:08:8b:44 SUBORDINATE UP 1000 1500

int001st001 ethXsl0_3 00:15:17:cc:7b:04 SUBORDINATE DOWN 1000 1500

int001st001 ethXsl0_4 00:15:17:cc:7b:07 SUBORDINATE DOWN 1000 1500

int001st001 ethXsl0_5 00:15:17:cc:7b:05 SUBORDINATE DOWN 1000 1500

int002st001 eth2 00:15:17:cc:79:cd DOWN N/A 1500

int002st001 eth3 00:15:17:cc:79:cc DOWN N/A 1500

int002st001 eth4 00:15:17:cc:79:cf DOWN N/A 1500

int002st001 eth5 00:15:17:cc:79:ce DOWN N/A 1500

int002st001 eth8 00:21:5e:08:88:b8 DOWN N/A 1500

int002st001 eth9 00:21:5e:08:88:ba DOWN N/A 1500

int002st001 ethX0 00:c0:dd:11:36:98 MASTER active-backup (1) UP 9.11.136.51 N/A 1500

int002st001 ethXsl0_0 00:c0:dd:11:36:98 SUBORDINATE DOWN 10000 1500

int002st001 ethXsl0_1 00:c0:dd:11:36:98 SUBORDINATE UP 10000 1500

int003st001 eth2 00:15:17:c5:cf:a5 DOWN N/A 1500

int003st001 eth3 00:15:17:c5:cf:a4 DOWN N/A 1500

int003st001 eth4 00:15:17:c5:cf:a7 DOWN N/A 1500

int003st001 eth5 00:15:17:c5:cf:a6 DOWN N/A 1500

int003st001 eth8 e4:1f:13:8d:d1:6c DOWN N/A 1500

int003st001 eth9 e4:1f:13:8d:d1:6e DOWN N/A 1500

int003st001 ethX0 00:c0:dd:12:1e:58 MASTER active-backup (1) UP 9.11.136.50 N/A 1500

int003st001 ethXsl0_0 00:c0:dd:12:1e:58 SUBORDINATE DOWN 10000 1500

int003st001 ethXsl0_1 00:c0:dd:12:1e:58 SUBORDINATE UP 10000 1500

mgmt001st001 eth3 00:21:5e:08:8b:be DOWN N/A 1500

mgmt001st001 ethX0 00:21:5e:08:8b:bc UP N/A 1500

3. If the network interface is not available, perform a visual inspection of the cabling to ensure that it is plugged in. For example, if you have no system connectivity between nodes and switches, check the external Ethernet cabling. If that cabling is in place, next check the internal InfiniBand cabling. If the cabling is all good, you then need to work upstream. Check intranet availability, for example, or external Internet availability. If none of these checks leads to a resolution of the problem, contact your next level of support.

9.2.2 Collecting network data

This section explains how to collect network data.

The starttrace command

The starttrace command declares a trace to monitor network traffic, system calls, or both. The monitoring can generate a huge amount of log data in a short time. The following conditions are checked and must be met before the starttrace command is accepted:

•Nodes must have at least 1.1 gigabytes (GB) free storage in the /var/tmp folder.

•The management node must have at least (1.2 * <number of nodes> * 0.66) GB plus 100 megabytes (MB) contingency of storage capacity in its /ftdc folder to store the collected and compressed log files from the interface nodes after the tracing ends.

The starttrace command has the following options: --cifs, --nfs, --network, --gpfs, --client, --systemcalls, --restart-service, --duration <duration>. For more information, see the starttrace topic in the SONAS section of the IBM Knowledge Center:

http://www.ibm.com/support/knowledgecenter/STAV45/com.ibm.sonas.doc/manpages/starttrace.html

Example 9-2 shows an example of running the starttrace command for network diagnostics.

Example 9-2 Example of running a starttrace command to capture network diagnostics

[furby.storage.tucson.ibm.com]$ starttrace --network --client 9.11.213.38 --duration 1

In Example 9-2, a trace is run for one minute (the duration that is specified), and the compressed file is placed in the /ftdc directory on completion with the following naming convention:

trace_<date-time-stamp_network>_<Trace-ID-number>_root.gzp

The lstrace command

The lstrace command lists all known traces that were previously created by the starttrace command and which are not yet finished. Example 9-3 is the output from the network trace followed by an lstrace command.

Example 9-3 Sample output from running a network trace followed by an lstrace command

[furby.storage.tucson.ibm.com]$ starttrace --network --client 9.11.213.38 --duration 1

TraceID=5248875006284398592

Logfilename=trace_20141003134237_network_5248875006284398592_admin.tgz

EFSSG1000I The command completed successfully.

[furby.storage.tucson.ibm.com]$ lstrace

TraceID User Cifs Network Nfs Syscalls Starttime Endtime Logfilename Size ClientIP

5248875006284398592 admin no yes no no 20141003134237 20141003134337 trace_20141003134237_network_5248875006284398592_admin.tgz 0 9.11.213.38

EFSSG1000I The command completed successfully.

The lstrace command does not list a trace that has completed. To see logs for traces that were previously run, you must have admin-privileged access to the GUI.

9.2.3 Working with the network and protocol trace commands

This section provides additional examples of network and protocol trace commands.

Starting a trace

In Example 9-4, the starttrace command is initiated to collect all Common Internet File System (CIFS) and network traces from the client 9.11.213.38.

Example 9-4 An example of starting network tracing with the CLI

[furby.storage.tucson.ibm.com]$ starttrace --cifs --client 9.11.213.38 --duration 1

Caution. The ip address 9.11.213.38 has a cifs connection.

Starting traces for the given clients will stop their connections prior to tracing.

Open files on these connections might get corrupted so please close them first.

Do you really want to perform the operation (yes/no - default no):yes

TraceID=5248878942638702592

Logfilename=trace_20141003134632_cifs_5248878942638702592_admin.tgz

EFSSG1000I The command completed successfully.

Listing running traces

Example 9-5 shows how to list running traces using the lstrace command.

Example 9-5 Listing running traces

#[furby.storage.tucson.ibm.com]$ lstrace

TraceID User Cifs Network Nfs Syscalls Starttime Endtime Logfilename Size ClientIP

5248878942638702592 admin yes no no no 20141003134632 20141003134732 trace_20141003134632_cifs_5248878942638702592_admin.tgz 0 9.11.213.38

EFSSG1000I The command completed successfully.

The TraceID identifies the ID number of the trace to show. This ID must be used to stop the trace.

The Starttime and Endtime values identify the data collection duration. The default value is 10 minutes.

The Logfilename identifies the file name of the collected data. The file can be found in the /ftdc directory, or you can download it from the GUI. See “Downloading the trace files using the GUI” on page 286.

Stopping a running trace

The stoptrace command stops traces that were created by the starttrace command. You can stop a dedicated TraceID or all traces by using this command.

Example 9-6 shows an example of the stoptrace command.

Example 9-6 Stopping a trace example

# stoptrace 4266819943737196544

EFSSG1000I The command completed successfully.

Downloading the trace files using the GUI

Complete the following steps to download the trace files using the GUI:

1. Log in to the GUI and select Support → Download Logs, as shown in Figure 9-3.

Figure 9-3 GUI expansion of the full log view

2. Click the Show full log listing link, as shown in Figure 9-3. The list of downloadable logs displays, as shown in Figure 9-4.

Figure 9-4 Sample GUI view of the list of downloadable logs full list

The compressed file contains configuration files, log files, and the network dump files, as shown in Figure 9-5.

Figure 9-5 Sample compressed file view

The .pcap files can be used with any network analysis program, such as Wireshark.

Using the dig command

The domain information groper (dig) command is a flexible tool for interrogating domain name server (DNS) name servers. It does DNS lookups and displays the answers that are returned from the queried name servers, A records, and MX records. Example 9-7 shows a sample dig command output.

Example 9-7 Sample dig command output

$ dig ADS.virtual.com

; <<>> DiG 9.7.3-P3-RedHat-9.7.3-2.el6_1.P3.3 <<>> ADS.virtual.com

;; global options: +cmd

;; Got answer:

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19433

;; flags: qr aa rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:

;ADS.virtual.com. IN A

;; ANSWER SECTION:

ADS.virtual.com. 3600 IN A 10.0.0.100

ADS.virtual.com. 3600 IN A 10.0.2.100

ADS.virtual.com. 3600 IN A 10.0.1.100

;; Query time: 0 msec

;; SERVER: 10.0.0.100#53(10.0.0.100)

;; WHEN: Tue Dec 4 18:53:09 2012

;; MSG SIZE rcvd: 81

$ dig +nocmd ADS.virtual.com any +multiline + noall +answer

ADS.virtual.com. 3600 IN A 10.0.2.100

ADS.virtual.com. 3600 IN A 10.0.0.100

ADS.virtual.com. 3600 IN A 10.0.1.100

9.2.4 Troubleshooting GPFS and file system issues

Troubleshooting can go in many directions. In general, if you suspect that there might be an issue, open a PMR on the software to get the correct level of attention and guidance.

Before you open the PMR, download a support package. See the IBM SONAS Implementation Guide, SG24-7962 IBM Redbooks publication for details about collecting logs for support.

Complete the following steps to collect support logs:

1. Look at the Event log in the GUI and collect any errors or indicators of issues that you might be having.

2. Collect the log Warnings or Severe Issues Report.

Run the lslog -l SEVERE and lslog -l WARNING commands and use the grep command or search the /var/log/messages file for gpfs to find fault indicators or long input/output (I/O) waiters. Lists logs (lslog) for the specified log level and higher (FINEST, FINER, FINE, CONFIG, INFO, WARNING, or SEVERE).

3. Look at the node performance with the top and iostat commands as a previleged root user.

The top command shows you how hard the node is working, and what processes and applications are working the hardest (it is often an easy way to see what is struggling). This command requires root-privileged access on cluster nodes, but it can provide a good description of what a node is spending its time and resources doing.

Figure 9-6 shows sample output from the top command on an interface node.

Figure 9-6 Sample output from the top command on an interface node

Figure 9-7 shows sample output from the top command on a storage node.

Figure 9-7 Sample output from the command on a storage node

The iostat -xm /dev/dm* 1 command shows you 1-second updates on how busy each multipath device is on the storage node. When all of the devices are consistently hitting 100% busy, it can be an indication that the solution is disk-bound or back-end bottlenecked. See Figure 9-8.

Figure 9-8 Sample iostat output from a SONAS storage node with four multipath storage NSDs

Analyze the GPFS logs

Contact IBM support for analysis of IBM General Parallel File System (IBM GPFS) log entries. Use this procedure when reviewing GPFS log entries:

1. Download and decompress the support package.

2. Review the log file /var/adm/ras/mmfs.log.latest. The log file is ordered from oldest to newest, so the end of the log has the current GPFS information.

The GPFS log is a complex raw log file for GPFS. If you are unable to understand the conditions that are listed in the log, contact IBM support. If you already have advanced GPFS skills, this might be a good source of information.

9.2.5 Capturing GPFS traces

You might receive feedback from the GPFS L2 or L3 team to run a GPFS trace or gpfs.snap so that they can review traces on events to get detailed information and better understand issues. In this case, they might request that you activate GPFS tracing on the affected cluster.

Attention: Use this command only under the direction of your IBM service representative, because it can affect the cluster performance.

On the active management node, enter the following commands:

#mmtracectl --set --trace=def --trace-recycle=global

#mmtracectl --start

The first command updates the GPFS configuration with trace parameters. The second command starts tracing on all nodes. Trace data is saved in the /ftdc/dumps/gpfs directory on each node. These traces do not get listed in the GUI download list, and must be manually collected with root-level privileges.

Leave these settings in place until you have another event (Network File System (NFS) exports failover not working, clustered trivial database (CTDB) flapping, GPFS-related performance issue, and so on). When you have an event that is captured by the trace, you must turn tracing off by using the following --stop command (do not run traces unless directed to do so by support):

#mmtracectl --stop

When tracing has stopped, run the following command:

#gpfs.snap

The snap capture with the current GPFS data and the trace files that were generated are found in the /tmp/gpfs.snapOut/ directory. Upload the snap data .tar file to the IBM-directed File Transfer Protocol (FTP) server.

After the cause of any issues is determined, you can run the following command to clear out the trace parameters:

#mmtracectl --off

Attention: An mmtrace is not a trivial data collection, and must be correctly stopped and shut off when collection is completed. Leaving it running under heavy workloads too long can create stability issues with SONAS node devices. You must run both of the following commands to correctly shut down the GPFS traces:

#mmtracectl --stop

#mmtracectl --off

9.2.6 Recovering from a failed file system

IBM authorized service providers can use this procedure to recover a GPFS file system after a storage unit failure has been fully addressed. Such an occurrence is rare, and is unlikely to occur. You can use this procedure to become familiar with the concepts of file system recovery if such a need is driven by IBM SONAS support.

Important: The IBM 2851-DR1/DE1 information applies to hardware configurations purchased before the 1.4 release. As of SONAS 1.4, for any new configurations that are sent from manufacturing, DR1/DE1s have been withdrawn and replaced by DR2/DE2 storage models.

This task has the following prerequisites:

•You must be running this procedure on one of the storage nodes.

•All storage node recovery steps must be completed.

•The storage node must be fully functional.

•All storage devices must be available.

For storage node recovery, see Troubleshooting the System x3650 server and Removing a node to perform a maintenance action in the SONAS section of the IBM Knowledge Center. This procedure provides steps to recover a GPFS file system on a storage node after a total failure of the storage unit.

Note: The mmnsddiscover and mmchdisk commands are used in the following procedure.

An administrative user with the privileged role can use the servicecmd service command to run these commands. For more information about the servicecmd command, see Service provider commands reference in the SONAS section of the IBM Knowledge Center.

Important: Because, in this state, no I/O can be done by GPFS, it can be assumed for these procedures that a storage unit failure caused a GPFS file system to unmount.

After you ensure that the preceding prerequisites are met, complete the following steps:

1. Verify that GPFS is running across the IBM SONAS cluster by using the mmgetstate command:

#mmgetstate -a

The system displays information similar to that shown in Figure 9-9.

Node number Node name GPFS state

------------------------------------------

1 mgmt001st001 active

2 int003st001 active

3 int001st001 active

4 int002st001 active

5 strg001st001 active

6 strg002st001 active2

Figure 9-9 Sample information from the mmgetstate -a command

2. Verify that all storage devices are accessible to Linux by using the multipath command, which shows that all devices are active. In addition, verify that there are no faulty devices.

3. Run the #multipath -ll command.

The system displays information similar to Example 9-8.

Example 9-8 Results of the multipath -ll command when run on a SONAS storage node

[[email protected] ~]# multipath -ll

array1_sas_89360007 (360001ff070e9c0000000001989360007) dm-0 IBM,2851-DR1

[size=3.1T][features=1 queue_if_no_path][hwhandler=0][rw]

\_ round-robin 0 [prio=50][active]

\_ 6:0:0:0 sdb 8:16 [active][ready]

\_ round-robin 0 [prio=10][enabled]

\_ 8:0:0:0 sdg 8:96 [active][ready]

array1_sas_89380009 (360001ff070e9c0000000001b89380009) dm-2 IBM,2851-DR1

[size=3.1T][features=1 queue_if_no_path][hwhandler=0][rw]

\_ round-robin 0 [prio=50][active]

\_ 6:0:0:2 sdd 8:48 [active][ready]

\_ round-robin 0 [prio=10][enabled]

\_ 8:0:0:2 sdi 8:128 [active][ready]

4. With GPFS functioning normally on all nodes in the cluster, ensure that GPFS detects the devices with the mmnsddiscover command:

#mmnsddiscover -a -N all

If GPFS is not functioning normally on all nodes, run the mmnsddiscover command with the list of functioning nodes:

#mmnsddiscover -N <node>,...

In this example, <node> lists the names of all of the functioning nodes.

The system displays information similar to Figure 9-10.

mmnsddiscover: Attempting to rediscover the disks. This may take a while ...

mmnsddiscover: Finished.

Figure 9-10 Sample mmnsddiscover command output

5. Run the mmlsnsd command and verify that system devices are listed under the Device column in the output. Devices should have names that begin with /dev.

If device names do not display, GPFS cannot access the devices.

Note: This process can take several minutes to complete.

The system displays information similar to Figure 9-11.

#mmlsnsd -M

Disk name NSD volume ID Device Node name Remarks

---------------------------------------------------------------------------------------

array0_sas_888f0013 AC1F86024B4E2048 /dev/mpath/array0_sas_888f0013 strg001st001 server node

array0_sas_888f0013 AC1F86024B4E2048 /dev/mpath/array0_sas_888f0013 strg002st001 server node

array0_sas_88910015 AC1F86024B4E2044 /dev/mpath/array0_sas_88910015 strg001st001 server node

Figure 9-11 Sample output for the mmlsnsd -M command

6. Run the mmlsdisk command to display the state of the disks:

#mmlsdisk fs_name

The system displays information that is similar to Figure 9-12.

disk driver sector failure holds holds storage

name type size group metadata data status availability pool

------------ -------- ------ ------- -------- ----- ------------- ------------ ------------

array0_sas_888f0013 nsd 512 2 yes yes ready up system

array0_sas_88910015 nsd 512 2 yes yes ready up system

array0_sas_88930017 nsd 512 2 yes yes ready up system

Figure 9-12 Sample output for the mmlsdisk fs_name command

7. If all of the disks have the availability status up, go to the next step.

Otherwise, you must run the mmchdisk command:

#mmchdisk fs_name start -a

Important: If you need to run the mmchdisk command, be sure to rerun the mmlsdisk command to verify the availability of all of the disks before you go to the next step.

This process can take several minutes to complete.

8. Verify that the node has a file system mounted by running the mmlsmount command:

#mmlsmount fs_name -L

The system displays information similar to Figure 9-13.

File system gpfs0 is mounted on 6 nodes:

172.31.132.1 int001st001

172.31.132.2 int002st001

172.31.132.3 int003st001

172.31.136.2 mgmt001st001

172.31.134.1 strg001st001

172.31.134.2 strg002st001

Figure 9-13 Sample output for the mmlsmount fs_name -L command

If the file system is not mounted, skip the next step and go to step 10. Otherwise, continue with the next step.

9. Check the Linux system log (/var/log/messages) on all nodes in the cluster for the presence of MMFS_FSSTRUCT errors.

To complete this process, you might need to restart the node where the GPFS file system is mounted. You can often avoid a node restart by correctly halting the following processes or services. Even after stopping all the processes and services, the file system can remain mounted:

a. Stop NFS and CIFS servers by using the service nfs stop and service smb stop commands.

b. Stop the Tivoli Storage Manager and Hierarchical Storage Manager (HSM) processes by using the disengages stop command.

c. Stop the Tivoli Storage Manager backup process. You can use the stopbackup command.

d. Stop the Network Data Management Protocol (NDMP)-based backup.

e. Stop the asynchronous replication.

f. Before unmounting GPFS on all nodes, scan /proc/mounts for bind mounts that point to the lost GPFS. These mounts need to be unmounted. If the system still does not unmount, use the mmshutdown and mmstartup commands.

g. Try to unmount the GPFS file system. You can use the unmountfs command.

h. If it does not unmount, you can use the fuser -m command to determine what processes still hold open GPFS file descriptors. First are processes that just have a file open in GPFS, and second are processes that have their working directory in GPFS. If this is an interactive user shell, change the directory out of GPFS.

10. Run the fuser command on all nodes in the cluster by using the following command:

#cndsh -N all fuser -m <fs_name>ii

11. If all attempts to stop these services do not allow the file system to be unmounted, change the auto mounting of the file system to no, and restart the node. (See step 12):

#mmchfs <fs_name> -A no8

Issue the mmfsck command as follows:

#mmfsck fs_name -v -n > /tmp/mmfsck_no.out 2>&1

Review the output file (/tmp/mmfsck_no.out) for errors.

If the file contains a message that Lost blocks were found, some missing file system blocks are normal. In this situation, go to step 13.

However, if the mmfsck command reports more severe errors, and you are certain that running the command does no further harm to the file system, continue with the following step. Otherwise, contact IBM support.

12. Check the messages output by running the mmfsck command and viewing the exit code. A successful run of the mmfsck command generates the exit code 0. Issue the mmfsck command and save the output to a file:

#mmfsck fs_name -v -y > /tmp/mmfsck_yes.out 2>&1

Immediately after the command completes, verify its exit status by entering echo $?. It should report the value 0.

Verify: If the exit value is not 0, continue to run the command until it reports the value 0. Also, check the output file /tmp/mmfsck_yes.out and verify that the mmfsck command reports that all errors were corrected.

Important: On a large file system, the mmfsck command with the -y option can require several hours to run.

13. If the mmfsck command completes successfully, you can mount the file system across the cluster by using the IBM SONAS GUI, or you can issue the mmmount command:

#mmmount fs_name -a

Note: For more information about recovering from unmounted file systems, see the Resolving problems with missing mounted file systems topic in the SONAS section of the IBM Knowledge Center:

http://www.ibm.com/support/knowledgecenter/STAV45/com.ibm.sonas.doc/trbl_gpfs_mount_resolving_missing.html

14. If you had to use the mmchfs command in step 11 to disable the automatic mounting of the file system when GPFS is started, undo that change now by running the following command:

#mmchfs fs_name -A yes

9.2.7 Troubleshooting authentication issues

Use the following commands for basic authentication troubleshooting:

1. Do a general check:

#chkauth -c <cluster-id>

2. Check on a specific user:

#chkauth -c <cluster-id> -i -u <sonas-domain-name>\<User-name>

3. Ping the authentication server:

#chkauth -c <cluster-id> -p

4. Test the secrets:

#chkauth -c <cluster-id> -t

5. List the user on the Active Directory server:

#chkauth -c <cluster-id> -u <sonas-domain-name>\<User-name> -p <password>

9.3 Troubleshooting authentication issues

This section describes some methods to debug an authentication issue. If your issue is not similar to any of the issues described earlier in this chapter, use one of these commands to check what might be going wrong.

9.3.1 CLI commands to check

This section lists some of the commands that you can run to determine whether output is as expected, or if something is wrong.

List the authentication configured

Check the authentication that is configured by using the lsauth command. This command provides information about the configuration. You can check for the different parameters and determine if they are set correctly, depending on the authentication that is configured. Example 9-9 shows an example for Active Directory (AD) authentication. Similarly, for all other authentication methods, you can check for the parameters.

Example 9-9 Checking authentication that is configured on the cluster

# lsauth

AUTH_TYPE = ad

idMapConfig = 10000000-299999999,1000000

domain = SONAS

idMappingMethod = auto

clusterName = bhandar

userName = Administrator

adHost = SONAS-PUNE.SONAS.COM

passwordServer = *

realm = SONAS.COM

EFSSG1000I The command completed successfully.

Check the ID mapping for users and groups

Run the chkauth command to check the user information details, such as user ID (UID) and group ID (GID) for a user or group, as shown in Example 9-10.

Example 9-10 Check user information by using the chkauth command

# chkauth -c st002.vsofs1.com -i -u VSOFS1\testsfuuser2 -p Dcw2k3dom01

Command_Output_Data UID GID Home_Directory Template_Shell

FETCH USER INFO SUCCEED 250 10000011 /var/opt/IBM/sofs/scproot /usr/bin/rssh

Check for node synchronization

Run the chkauth command to check whether the nodes are in synchronization, as shown in Example 9-11.

Example 9-11 Check node synchronization using chkauth

# chkauth

ALL NODES IN CLUSTER ARE IN SYNC WITH EACH OTHER

EFSSG1000I The command completed successfully.

Check if a user is able to authenticate successfully

Run the chkauth command to check whether a user is able to authenticate with the authentication server, as shown in Example 9-12.

Example 9-12 Check if user can authenticate with server using ckhauth

# chkauth -c st002.vsofs1.com -a -u VSOFS1\testsfuuser2 -p Dcw2k3dom01

Command_Output_Data UID GID Home_Directory Template_Shell

AUTHENTICATE USER SUCCEED

Check if the authentication server is reachable

Run the chkauth command to check whether the authentication server is reachable, as shown in Example 9-13.

Example 9-13 Check if authenticate server is reachable using ckhauth

# chkauth -c st002.vsofs1.com -p

Command_Output_Data UID GID Home_Directory Template_Shell

PING AUTHENTICATION SERVER SUCCEED

9.3.2 Logs to check

The logs contain useful information to help resolve errors and determine what has happened in the SONAS.

System logs

Check the system logs to see if there are any errors. The lslog CLI command displays the system log.

Audit logs

To check what commands recently ran, and the command parameters, you can run the lsaudit CLI command. This command shows all the commands that were run. You might want to see the sequence of commands run, whether any of them were incorrect, and so on.

9.3.3 More logs to collect and check

If the preceding logs do not help, you can contact IBM support. It is advisable to collect these logs, which help for further analysis or debugging:

•Samba debug 10 logs

Run the following commands to collect the Samba logs:

starttrace -cifs -client <client ip address>

Re-create the issue on a Windows client, and then run the following command:

stoptrace #traceid

See the online SONAS documentation for the starttrace and stoptrace commands before you use them.

•UID and GID information for the user

Along with the preceding logs, also collect UID and GID information for the users you see problems with. You can run the chkauth command to get the information:

# chkauth -i -u <Username>

•Collect cndump information.

9.3.4 Active Directory users denied access due to mapping entry in NIS

Even after AD + Network Information Service (NIS) is successfully configured, some Active Directory users are denied access to the IBM SONAS share.

How to debug this issue

If, after you confirm that sufficient access control lists (ACLs) exist for data access, data is inaccessible, check if the UID for that user is correctly set in Active Directory. Use the following command to check whether the user or group has a UID or GID assigned:

#chkauth -i -u "<DOMAIN>\<username>"

Example 9-14 shows that the chkauth command does not show the UID or GID for the user (in our example, autouser3). Even if the SONAS cluster can resolve the users on the domain controller, it can be that the user is present in the Active Directory server but not in the
NIS server.

Example 9-14 Failed to get IDMapping for user

# chkauth -i -u "SONAS\autouser3"

EFSSG0002C Command exception found: FAILED TO FETCH USER INFO

Conclusion

There are several possible reasons for this failure.

Access is denied for users who are present only in the AD server but not present in NIS server

There are multiple ways to resolve this issue:

•Define a corresponding user in the NIS server. See your internal process for creating a new user in NIS.

•Based on your requirements, use one of the three options available to configure the wanted option to --userMap while configuring authentication by using the cfgnis command. This is the default behavior for the cfgnis command. However, there is still a requirement for a valid user map for every valid user.

UNIX-style names do not allow spaces or special characters in the name

For mapping Active Directory users or groups to NIS users, consider the following conversion on the NIS server:

•Convert all uppercase characters to lowercase.

•Replace all blank spaces with underscores.

For information about special characters to be avoided in UNIX names, see the Limitations topics in the SONAS section of the IBM Knowledge Center:

http://www.ibm.com/support/knowledgecenter/STAV45/com.ibm.sonas.doc/adm_limitations.html

To correct this issue, define the corresponding user in NIS server as in the UNIX-style name conventions.

When the corresponding user is defined in the NIS server as in the UNIX-style name conventions, in this case auto_user4, you can use the chkauth command as shown in Example 9-15 to verify it.

Example 9-15 Successfully displays IDMapping for user

# chkauth -i -u "SONAS\AUTO user4"

Command_Output_Data UID GID Home_Directory Template_Shell

FETCH USER INFO SUCCEEDED 21015 2100 /var/opt/IBM/sofs/scproot /usr/bin/rssh

EFSSG1000I The command completed successfully.

If the default, which is deny access, was used with --userMap in cfgnis, for those users who are not in NIS, or improperly placed in NIS, access is denied. However, if the --userMap option chosen was either AD_domain:AUTO or AD_domain:DEFAULT (with a specified GUEST account), those unmapped users are either provided auto-incremented UIDs by SONAS, or they are mapped to a guest user predefined in AD. In such cases, even though the user might be able to access the shares, they might not have the level of access that they expect.

Authentication preferred practices

The preferred practice is to define all of the users in NIS server with appropriate UNIX-style naming conventions for all of the AD users without uppercase, spaces, or special characters in their names. This needs to be done before storing any data or accessing any data in the cluster share.

9.3.5 NFS share troubleshooting

Sometimes NFS clients fail to mount NFS shares after a client IP change.

Use this information to resolve a refused mount or Stale NFS file handle response to an attempt to mount Network File System (NFS) shares after a client IP change.

After a client IP change, a df -h command returns no results, as shown in Example 9-16.

Example 9-16 Command issues with refused mounts or stale NFS file handles.

Filesystem Size Used Avail Use% Mounted on

machinename: filename: - - - - /sharename

Also, you can see the following error from the ls command:

ls: .: Stale NFS file handle

Also, the hosting node in the IBM SONAS system displays the following error:

int002st001 mountd[3055867]: refused mount request from hostname for sharename (/): not exported

If you receive one of these errors, perform the following steps:

1. Access the node that hosts the active management node role with root privileges. Then run the following command to flush the NFS cache in each node:

onnode all /usr/sbin/exportfs -a

Verify that the NFS mount is successful. If the problem persists, restart the NFS service on the node that is refusing the mount requests from that client.

2. Verify that the NFS share mount is successful.

9.3.6 CIFS share troubleshooting

The following limitations do not affect many implementations. However, you might need to consider them if you are having trouble with CIFS shares:

•Alternative data streams are not supported. One example is an New Technology File System (NTFS) alternative data stream from a Mac OS X operating system.

•Server-side file encryption is not supported.

•Level 2 opportunistic locks (oplocks) are currently not supported. This means that level 2 oplock requests are not granted from SONAS CIFS shares.

•Symbolic links cannot be stored or changed and are not reported as symbolic links, but symbolic links that are created with NFS are respected as long they point to a target under the same exported directory.

•Server Message Block (SMB) signing for attached clients is not supported.

•Secure Sockets layer (SSL) communication to Active Directory is not supported.

•IBM SONAS acting as the distributed file system (DFS) root is not supported.

•Windows Internet Name Service (WINS) is not supported.

•Retrieving quota information using NT_TRANSACT_QUERY_QUOTA is not supported.

•Setting Quota information using NT_TRANSACT_SET_QUOTA is not supported.

•The IBM SONAS system does not grant durable or persistent file handles.

•CIFS UNIX Extensions are not supported.

•Managing the IBM SONAS system using the Microsoft Management Console Computer Management Snap-in is not supported, with the following exceptions:

– Listing shares and exports

– Changing share or export permissions

– Users must be granted permissions to traverse all of the parent folders of an export to enable access to a CIFS export.

9.3.7 Power supply LEDs

It is understood that the system automatically detects issues with node hardware as they are introduced to the cluster. However, it is a preferred practice to have your data center representative perform a weekly walk-through to examine the SONAS frames and examine the cabinets for any error, such as light-emitting diode (LED) warnings. The SONAS node power supply LED indicator guide in Figure 9-14 can help you understand how to validate issues with node power supplies.

Figure 9-14 Power supply LED fault light indicators

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9. Troubleshooting and support

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 9. Troubleshooting and support