Troubleshooting
This chapter describes different troubleshooting concepts that you can use to help analyze potential problems that might occur while the storage cloud tier function is used in the TS7700C.
This chapter includes the following topics:
14.1 Network firewall problems
A TS7760C communicates with cloud Object Storage through the GRID network. Your grid network firewall must allow communications on port 443 when using a secure HTTPS connection to IBM COS or AWS S3. If you are using a standard HTTP connection to IBM COS, port 80 must be open.
14.2 DNS and JSON settings for AWS S3
In this section we describe the DNS server and AWS JSON file settings for AWS S3.
14.2.1 DNS Server
A DNS server is required when AWS S3 is used. It is not required for IBM COS. For AWS S3, the TS7760C requires a DNS server to resolve the AWS host names through the customer network. This DNS server must exist within the customer internal network, or the same network that is used for services, such as SNMP, ISKLM, MI, SYSLOG, and other customer provided services.
The DNS server resolves a URL, such as https://mybucket.s3.amazonaws.com into an IPv4 IP address. The TS7700C then routes the connection through the grid network to the DNS-provided IP address. Therefore, the DNS server must resolve AWS public addresses.
14.2.2 AWS JSON file
The TS7760C must set up internal routing and firewall tables to communicate with all possible AWS addresses. The TS7760C uses an AWS-provided JSON file that contains all possible IP addresses that are used by AWS services. During the first Create Cloud URL and Cluster Association operation on a cluster, the TS7760C attempts to download the latest AWS JSON file through the customer internal network.
If the DNS server is operational and the customer network is also attached to the internet, the latest JSON download is successful. If it is unsuccessful (which is likely), the TS7760C uses an internal JSON file to set up the routing and firewall tables to communicate over the grid network.
This initial JSON seed file is included in the TS7760C firmware and can be outdated. Therefore, the TS7760C attempts to download the latest version over the grid network after the JSON seed file is used.
Assuming the DNS server is set up correctly and the DNS provided AWS IP address for the JSON file is in the initial JSON seed table, the latest JSON file is downloaded through the grid network and used to update the routing tables and firewall. From this point forward, the TS7760C checks periodically and downloads a copy of the latest JSON file through the GRID network.
A rare chance exists that the persistent JSON seed file does not contain the IP address the DNS server provided for the JSON file location. IBM Support must be involved to help provide a later JSON file manually if this issue occurs.
TS7760C can access AWS S3 Object Storage through the GRID network, which must be connected to the internet.
Figure 14-1 shows connections between TS7760C, DNS server in the Customer network, and AWS.
Figure 14-1 TS7760C and AWS network communication
To check the communications from the TS7760C to the AWS object store network, you can use the Network Test ping or trace route diagnostic function from the MI menu. Select Service ICON  Network Diagnostic.
The following example IP address is used:
s3.ap-northeast-1.amazonaws.com
A sample ping window is shown in Figure 14-2.
Figure 14-2 Ping to AWS S3’s host name from Network Diagnostics
If the DNS server or JSON file setup was unsuccessful, the ping to such an external s3 based address fails.
14.3 Cloud service problems
This section describes problems that are related to a cloud Object Storage service.
14.3.1 Did your cloud SSL certificate expire?
When you use the HTTPS protocol with the IBM COS, a valid trust server certificate is required for SSL/TLS. If the certificate expired, you cannot access the IBM COS and the health monitor (if enabled for the Cloud Account) detects an error and issues the event OP0880.
14.3.2 Did your cloud credential expire or is no longer valid?
If your access key or the valid secret access key expired or is no longer valid, you cannot access the cloud and the health monitor (if enabled for the Cloud Account) detects an error and issues the event OP0838.
14.3.3 Is your Object Storage receiving heavy requests from other devices?
If your Object Storage is shared by the TS7760C with other devices, heavy requests for the Object Storage can affect performance.
Premigration throughput to the Object Storage is slower and the premigration-queue is longer. If it exceeds the premigration throttling threshold, host write throttling occurs and host write performance is affected and throttled down.
To monitor, you can use the MI to check health status on the Cluster Summary window and check throttling status on Monitor-Performance window.
In VEHSTATS, the premigration statistics and host write throttling statistics history are available for review. Those reports are the same as a TS7700 with tape drives attached.
You can also use the LI REQ GRLNKACT command to analyze 15-second periods of cloud throughput.
 
14.3.4 Is your grid network receiving heavy replication throughput?
Because your Object Storage is shared by the TS7760C with the grid network traffic, heavy replication and remote mount activity can slow the maximum performance to the object store.
You can also use the LI REQ GRLNKACT command to analyze 15-second periods of cloud throughput to see how the total network throughput is being shared among grid and cloud activity.
14.3.5 Is your object store full?
If your storage becomes full, you cannot add data to the object store. A health monitor detects the error and issues the event OP0882. It is recommended that you monitor your object store’s available capacity for the TS7760C to ensure that a full condition is not reached. Because the TS7760C cannot determine the object store’s available capacity, the monitoring must be done by using other methods.
 
14.3.6 Time difference between TS7760C and Object Storage is greater than 10 minutes
The time on the TS7760C and the cloud Object Storage must be synchronized. If the time difference between the TS7760C and IBM COS is greater than 10 minutes, a health monitor detects the error and issues the event OP0866. Use time servers within your TS7760C and IBM COS configuration to ensure that the times are synchronized.
14.4 Cloud or Grid network failures
This section describes failure warnings that can surface if the TS7760C cannot communicate with one or more provided object store URLs. Depending on the scope, the failure might lead to cloud access failure.
The following error messages are issued for various network-related failures:
CBR3750I Message from library GRIDCL20: OP0831 The Transparent Cloud
Tiering service is unreachable: MKAC01. cloud account (mkcn01),
container (auto-generated S3 URL), url (). Severity impact: SERIOUS.
CBR3762E Library GRIDCL20 intervention required.
CBR3786E VTS operations degraded in library GRIDCL20.
CBR3786E VTS operations degraded in library GRIDLIB2.
CBR3750I Message from library GRIDCL20: OP0728 Ping test to address
10.32.1.1 has an excessive packet loss. Has been in this condition for up to 10 minutes.. Severity impact: WARNING.
CBR3750I Message from library GRIDCL20: OP0541 The link to gateway IP
10.32.1.1 is degraded.. Severity impact: WARNING.
CBR3762E Library GRIDCL22 intervention required.
After the network issue is resolved, you see the following messages (they can be delayed by several minutes):
CBR3768I VTS operations in library GRIDCL20 no longer degraded.
CBR3768I VTS operations in library GRIDLIB2 no longer degraded.
14.4.1 Migration of volumes suspended
During a network failure where all available URLs on a TS7760C to an object store failed, migration to the cloud is suspended for that specific TS7760C cluster. After at least one URL connection reconnects, migration automatically resumes within a few minutes.
14.4.2 Recall of volumes suspended
When a job mounts a volume where the only available copy in the grid is within an object store and all TS7760C clusters include failed connections to the object store, recalling from the cloud is suspended and the following messages are issued:
CBR3750I Message from library GRIDCL20: OP0846 Mount of virtual volume
1K1664 from the cloud pool MKPL01 failed. Severity impact: SERIOUS.
CBR3750I Message from library GRIDCL20: OP0831 The Transparent Cloud
Tiering service is unreachable: MKAC01. cloud account (mkcn01),
container (auto-generated S3 URL), url (). Severity impact: SERIOUS.
CBR3762E Library GRIDCL20 intervention required.
CBR4195I LACS retry possible for job COPYTEZ2: 801
IEE763I NAME= CBRLLACS CODE= 140394
CBR4000I LACS WAIT permanent error for drive 0E04.
CBR4171I Mount failed. LVOL=1K1664, LIB=GRIDLIB2, PVOL=??????, RSN=22.
IEE764I END OF CBR4195I RELATED MESSAGES
007 CBR4196D Job COPYTEZ2, drive 0E04, volser 1K1664, error code
140394. Reply 'R' to retry or 'C' to cancel.
A subset of these messages might surface if a recall attempt at one TS7760C cluster fails because of a network issue yet a second TS7760C cluster is successful.
After the network reconnects to the cloud and the TS7760C warning state is cleared, retry the mount by replying R to the CBR4196D message if the condition resulted in a failed mount attempt.
14.4.3 Delete expire processing of volumes
Deleting an object from the cloud Object Storage is do0ne asynchronously by using an eject request of a logical volume. A volume becomes a candidate for delete-expire after all of the following conditions are met:
The amount of time since the volume entered the scratch category is equal to or greater than the Expire Time.
The amount of time since the volume’s record data was created or last modified is greater than 12 hours.
At least 12 hours passed since the volume was migrated out of or recalled back into disk cache.
After these criteria are met, the delete expire process handles up to 2,000 logical volume deletes per hour per TS7700 as described in the LI REQ SETTING DELEXP count setting. After the logical volume is deleted from within the TS7760C, the object is marked pending deletion in the TS7760C DB and the background delete threads (as configured through CLDSET) requests deletions in the object store.
If the object store is unavailable, the deletions are suspended until it becomes available. The logical volume can be reused during this period, even if the previous volume instance is still marked for pending deletion
14.4.4 Logical volume eject processing
When a logical volume eject is completed, any object instance in an object store that is associated with the ejected logical volume is marked for pending deletion. The eject is viewed as successful after the object is marked for pending deletion.
Asynchronously, the number of delete tasks that are defined in LI REQ CLDSET are used to delete the objects in the cloud. Any network or communication with the object store defer these deletions until the network condition is resolved. A new instance of the logical volume can be reinserted during this period, if needed.
14.4.5 LI REQ CLDINFO command
Logical volume status on the cloud by using the LI REQ LVOL,<volser>, CLDINFO command is available, even if the TS7700 cannot connect to the Object Storage during a network failure.
14.5 Events related to a cloud
You can check events on the MI’s Events window or in the CBR3750I messages on the host console when issues occur.
For more information about events that are related to a cloud storage tier, see Chapter 12, “Monitoring the TS7700C” on page 115.
 
 
 
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.104.215