File system configuration
Managing the IBM General Parallel File System (IBM GPFS) file systems involves several tasks, such as creating, expanding, removing, mounting, unmounting, restriping, listing, and setting the attributes of the file systems. This chapter describes some of the different ways to create and modify the file system for preferred practice applications in your environment, and explains the key points of logic behind these guidelines.
It covers the preferred practice considerations for file systems in general, and reviews some conceptual configurations for different common workloads. This chapter cannot cover every specific requirement. However, the intent is to help you make better decisions in your cluster planning efforts.
This chapter describes creating, modifying, restriping, and removing a file system. It also includes use cases and typical scenarios.
Specifically, this chapter describes the following topics in terms of preferred practices:
5.1 Getting access to file system content
You can access the file system content in a SONAS solution by using file services like Common Internet File System (CIFS) or Network File System (NFS). However, when you are using the administration graphical user interface (GUI) or command-line interface (CLI), data access is not possible. This configuration is intentional.
It is not intended that the administrators have access to file or directory content, because file or directory names have the potential to disclose guarded business information. Although access rights are covered in Chapter 2, “Authentication” on page 25, some authentication topics that pertain to file systems are also considered.
5.1.1 Access rights
CLI and GUI administrators cannot change the access control lists (ACLs) of existing files or directories. However, they can create a new directory when they define a new share or export, and assign an owner to this still-empty directory. They can also create a share or export from any path they know. However, exporting a directory does not grant any additional access. The existing ACLs remain in place. Therefore, an administrator cannot get or create access for himself or others to files solely by creating a share or export.
During share or export creation, when a new directory is created, a user can be specified as the owner of the new directory. This user owns the underlying directory and therefore acts as the security administrator of this directory. Therefore, it is important that the owner is set specifically and thoughtfully on creation of the directory (or file set) to be shared.
Access rights must be managed by the owner and the authorized users, as specified in the ACL by using a Microsoft Windows workstation that is connected to the share or export by using the CIFS protocol. An NFS user can set access rights by using Portable Operating System Interface (POSIX)-style access controls, not the ACLs that the Windows operating system uses.
In addition, during the initial setup of the file system, a default ACL can be set on the file system root directory, implementing inheritance, where any newly-created directory inherits the ACLs of the default set. This configuration is important to remember, and is described further in Chapter 6, “Shares, exports, and protocol configuration” on page 171.
 
 
Restriction: Access rights and ACL management can be run only on Windows clients. The command-line utilities of the UNIX stations that deal with the POSIX bits must not be used on files that are shared with Windows or CIFS clients (multiprotocol shares), because they destroy the ACLs. This restriction applies to the UNIX commands, such as chmod.
Tip: SONAS 1.5.1 provides functions for managing ACLs in the file sets, file systems, and Shares section of the GUI. The chacl and lsacl commands support this function in the CLI. You can use this function to manage ACL for all clients. For more information, see IBM SONAS Implementation Guide, SG24-7962.
5.2 Creating a file system
When you are creating a file system, there are two types of parameters:
Parameters that can be changed dynamically
Parameters that cannot be changed dynamically
The parameters that cannot be changed dynamically are likely to require downtime to unmount the file system. Therefore, it is important to plan the file system well before it goes into production.
5.2.1 File system block size
One key parameter that must be determined at file system creation is the file system block size. After the block size is set, the only way to change it is to create a new file system with the changed parameter and migrate the data to it. In most cases, test the block size and usage type on a test system before you apply your configuration to your production deployment.
GPFS in SONAS supports a few basic block sizes: 256 kilobytes (KB), 1 megabyte (MB), and 4 MB, with a default size of 256 KB. The block size of a file system is determined at creation by using the -b parameter to the mkfs command.
This chapter presents some guidelines for how to choose an appropriate block size for your file system. Whenever possible, it is best to test the effect of various block size settings with your specific application before you define your file system.
5.2.2 Methods to create the file system
You can use IBM SONAS CLI commands or the IBM SONAS GUI to create a file system. A maximum of 256 file systems can be mounted from an IBM SONAS system at one time. However, the SONAS system reduces the value of managing so many independent file systems. This concept is explained later in this section.
Typically, creating file systems from the CLI offers more specific options and better control of the task, and more technical clients prefer the CLI (as a leading practice). However, if devices are properly provisioned, well-balanced, and easily identified by type, the GUI offers the simplest way of creating file systems in SONAS. Choose the device type and size and create the file system.
 
Tip: The CLI offers the preferred-practice flexibility and control for file system creation.
5.2.3 File system options
Every file system contains an independent set of Network Shared Disks (NSDs). So, reducing the number of file systems enables you to load more NSDs (logical unit numbers (LUNs) and Redundant Array of Independent Disks (RAID)-protected drive resources) behind the file systems that you create. Loading more NSDs improves your potential back-end performance by striping data across more spindle surfaces and read/write (RW) access controller arms.
Therefore, the best performance that is achievable in a file system is typically the one that contains the greatest number of spindles. Further performance assurance can be obtained when the file system structure is configured for highest suitability for the file I/O type and workload access patterns that are used by that file system.
By increasing the number of spindles in the file system, you achieve the following benefits:
Maximize the RW head devices that work together to share the input/output (I/O) demands
Reduce the seek times
Improve latency
Improve capacity in scale with performance from the back-end storage
Consider the following file system NSD and GPFS disk options before file system creation:
Failure group assignment
NSD allocation and storage node preference
Storage pool use
Usage type
Consider the following file system options before you create file systems:
File system naming
NSD allocations
Usage types
GPFS data or metadata placement and replication
Block size
Segment size
Allocation types
Quota management
Data management application program interface (DMAPI) enablement
File system logging
Snapshot directory enablement
iNode allocations and metadata
This section examines these options as they apply to preferred practices for several common use case scenarios to help illustrate the value of these considerations.
5.2.4 File system naming option
The simplest of these options is file system naming. From the file system, there can be many independent or dependent file sets. File system naming serves different clients differently. Some clients choose to name file systems by simply applying the IBM generic standard that typically appears in documentation with a numeric value, such as gpfs0, gpfs1, and gpfs2. Some clients use descriptive titles to help them identify the content type such as PACS1, Homes1, FinanceNY.
In either case, it is left to the customer to decide. The preferred practice is to keep it short, simple, and descriptive for your own purposes. Do not include spaces or special characters that can complicate access or expand in the shell.
Every file system also requires a specified path (such as file system device gpfs0 on the specified path /ibm/gpfs0 or /client1/gpfs0). It is required that the path is consistent in the naming convention for simplicity of overall management (such as: /ibm/gpfs0, /ibm/gpfs1, /ibm/gpfs2, and so on).
The name must be unique and is limited to 255 characters. File system names do not need to be fully qualified. For example, fs0 is as acceptable as /dev/fs0. However, file system names must be unique within a GPFS cluster. Do not specify an existing path entry such as /dev.
For IBM Storwize V7000 Unified systems, the GPFS file name is limited to 63 characters.
5.2.5 NSD allocations for a balanced file system
GPFS NSDs are RAID-protected disk volumes that are provisioned from back-end storage solutions to the SONAS GPFS Server (which are the SONAS Storage nodes in a SONAS cluster), for use in building SONAS file systems. Back-end storage systems, such as DataDirect Networks (DDN) for appliance storage, Storwize V7000, IBM XIV, IBM DS8000 and the DS8x00 series, and IBM DCS3700 are attached to NSD Server sets (and are typically configured as SONAS Storage node pairs).
An exception to this rule is in the case of DS8x00 or XIV Gen3, where performance benefits can be extended beyond storage node pair limitations by attaching to a four-node storage node set (a special solution that requires special development team consultation). When the back-end devices can support a great deal more bandwidth than can be managed by the storage node pair that they are attached to, special allowance is provided for spreading the volumes of those back-end devices across two storage node pairs rather than just one.
At installation, after adding new storage, or when new storage is provisioned to the SONAS cluster, the GPFS servers (storage nodes) discover the newly attached storage. By default the provisioned volumes are mapped as multipath devices by the Linux multipath daemon. Then, GPFS converts the volumes into NSDs. The NSD headers are placed (written) on the volume (LUN), and the devices are added to the general disk availability list in the cluster as the system storage pool devices for dataAndMetadata type storage, in failure group 1 (default).
Regardless of the disk type or disk size, the NSDs are automatically added to the system storage pool, with all the defaults mentioned previously. It is specifically for this reason that the storage use must be planned and defined correctly before it is assigned to a file system.
Ideally, the devices should be identified by controller serial number and separated into two separate failure groups (failure group 1 and failure group 2). Then, depending on the intended use of the devices, you put the devices into the intended storage pool and specify the type of data that is allowed to be placed on that device (dataOnly, metadataOnly, or dataAndMetadata). Figure 5-1 shows two XIVs behind a SONAS pair.
[[email protected] ~]# lsdisk
Name File system Failure group Type Pool Status Availability Timestamp
XIV7826160_SoNAS001 gpfs0 1 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826160_SoNAS002 gpfs0 1 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826160_SoNAS003 gpfs0 1 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826160_SoNAS004 gpfs0 1 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826160_SoNAS005 gpfs0 1 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826160_SoNAS006 gpfs0 1 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826160_SoNAS007 gpfs0 1 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826160_SoNAS008 gpfs0 1 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826161_SoNAS009 gpfs0 2 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826161_SoNAS010 gpfs0 2 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826161_SoNAS011 gpfs0 2 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826161_SoNAS012 gpfs0 2 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826161_SoNAS013 gpfs0 2 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826161_SoNAS014 gpfs0 2 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826161_SoNAS015 gpfs0 2 dataAndMetadata system ready --- 4/7/13 3:03 AM
XIV7826161_SoNAS016 gpfs0 2 dataAndMetadata system ready --- 4/7/13 3:03 AM
Figure 5-1 Example lsdisk command output from the CLI
Figure 5-1 shows two XIVs behind SONAS, each providing 8 volumes. The NSDs are split into two failure groups to enable GPFS replication of metadata across the two XIVs. Note that all the devices remain in the system pool to enable these devices to be used for primary data placement and set to enable both data and metadata to be spread across all listed devices. If you had set aside a few devices with a specific usage type of metadataOnly, they can be added to the file system with all the other devices. However, only metadata would be written to these devices. The same rule applies if you set aside devices with usage type dataOnly.
As shown in Figure 5-2, you can see that, by using the chdsk command, you can set the designated pool for solid-state drives (SSDs), serial-attached SCSI (SAS), and Near Line SAS (NLSAS) drives, along with usage type and failure group. This enables GPFS to manage specific data to specific targets.
[[email protected] ~]# lsdisk
Name File system Failure group Type Pool Status Availability Timestamp
array0_sas_60001ff076d682489cb0001 gpfs0 1 metadataOnly system ready --- 4/7/13 3:03 AM (SSD for metadata)
array0_sas_60001ff076d682489cb0002 gpfs0 1 metadataOnly system ready --- 4/7/13 3:03 AM (SSD for metadata)
array0_sas_60001ff076d682489cb0003 gpfs0 1 metadataOnly system ready --- 4/7/13 3:03 AM (SSD for metadata)
array0_sas_60001ff076d682489cb0004 gpfs0 1 metadataOnly system ready --- 4/7/13 3:03 AM (SSD for metadata)
array0_sas_60001ff076d682489cb0005 gpfs0 1 dataOnly system ready --- 4/7/13 3:03 AM (SAS for Tier1 Data)
array0_sas_60001ff076d682489cb0006 gpfs0 1 dataOnly system ready --- 4/7/13 3:03 AM (SAS for Tier1 Data)
array0_sas_60001ff076d682489cb0007 gpfs0 1 dataOnly system ready --- 4/7/13 3:03 AM (SAS for Tier1 Data)
array0_sas_60001ff076d682489cb0008 gpfs0 1 dataOnly system ready --- 4/7/13 3:03 AM (SAS for Tier1 Data)
array0_sas_60001ff076d682489cb0009 gpfs0 1 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array0_sas_60001ff076d682489cb0010 gpfs0 1 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array0_sas_60001ff076d682489cb0011 gpfs0 1 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array0_sas_60001ff076d682489cb0012 gpfs0 1 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array0_sas_60001ff076d682489cb0013 gpfs0 1 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array0_sas_60001ff076d682489cb0014 gpfs0 1 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array0_sas_60001ff076d682489cb0015 gpfs0 1 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array0_sas_60001ff076d682489cb0016 gpfs0 1 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array1_sas_60001ff076d682589cc0001 gpfs0 2 metadataOnly system ready --- 4/7/13 3:03 AM (SSD for metadata)
array1_sas_60001ff076d682589cc0002 gpfs0 2 metadataOnly system ready --- 4/7/13 3:03 AM (SSD for metadata)
array1_sas_60001ff076d682589cc0003 gpfs0 2 metadataOnly system ready --- 4/7/13 3:03 AM (SSD for metadata)
array1_sas_60001ff076d682589cc0004 gpfs0 2 metadataOnly system ready --- 4/7/13 3:03 AM (SSD for metadata)
array1_sas_60001ff076d682589cc0005 gpfs0 2 dataOnly system ready --- 4/7/13 3:03 AM (SAS for Tier1 Data)
array1_sas_60001ff076d682589cc0006 gpfs0 2 dataOnly system ready --- 4/7/13 3:03 AM (SAS for Tier1 Data)
array1_sas_60001ff076d682589cc0007 gpfs0 2 dataOnly system ready --- 4/7/13 3:03 AM (SAS for Tier1 Data)
array1_sas_60001ff076d682589cc0008 gpfs0 2 dataOnly system ready --- 4/7/13 3:03 AM (SAS for Tier1 Data)
array1_sas_60001ff076d682589cc0009 gpfs0 2 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array1_sas_60001ff076d682589cc0010 gpfs0 2 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array1_sas_60001ff076d682589cc0011 gpfs0 2 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array1_sas_60001ff076d682589cc0012 gpfs0 2 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array1_sas_60001ff076d682589cc0013 gpfs0 2 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array1_sas_60001ff076d682589cc0014 gpfs0 2 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array1_sas_60001ff076d682589cc0015 gpfs0 2 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
array1_sas_60001ff076d682589cc0016 gpfs0 2 dataOnly silver ready --- 4/7/13 3:03 AM (NLSAS for Tier2 Data)
Figure 5-2 Example of three tiers of disk in one gpfs0 file system
 
Note: The data type and target definition must be set before file system creation.
The CLI command lsdisk -r -v displays the status of all NSDs (storage volumes) from the back-end storage. This command lists the device name, with its assigned failure group, storage pool, data allocation type, size and available capacity, and storage node preference. This data is critically important to building a well-balanced, high-performance SONAS file system. Consider a few of these properties and why they might be important.
If the disk is already associated with a file system, the file system device name is included in the lsdisk output.
The CLI command lsdisk -v reports the device name, file system association, failure group association, disk pool association, data allocation type, and storage node preference.
The chdisk command enables you to change the settings of disks before they are added to a file system. However, if the disk is already in a file system, it must be removed, changed, and re-added to the file system (in normal circumstances, this task can also be done non-disruptively with GPFS).
NSD allocation and storage node preference
NSD allocation and storage node preference is an important component of workload balance. The lsdisk -v command can be used to detect the NSD’s storage node preference (such as strg001st001, strg002st001). What this list defines is the storage node that owns the primary workloads of the defined NSD.
The first node in the storage node preference list always does the work unless the primary preferred node is down or otherwise unavailable. In that case, the second node that is listed in the storage node preference takes over control of access and use of the defined NSD until the primary node comes back online and GPFS regains control of those NSDs.
Number of NSDs used for the file system
The number of NSDs that are used for the file system data targets are in a preferred practice configuration when there are enough NSDs provisioned to fill all of the busses and channels, and to maximize both the storage node host bus adapter (HBA) port channels and storage device controller port channels available for best performance. In this regard, typically NSDs in multiples of four, per storage node, per NSD pool, and usage type, is best.
The performance team has tested that a minimum of eight metadata NSDs per storage node has shown excellent performance for metadata scan rates in SONAS performance testing (or 16 NSDs per storage node pair). Conversely, if you have only two NSDs in a file system, you greatly limit the parallelism of I/O at the storage node level, because one NSD per storage node is written, and the file system data is striped across only two NSDs.
Usage type that is assigned to disks
The usage type that is assigned to disks defines the type of data that will be targeted on that device. The system pool is the default pool and, by default, it sends all data and metadata to devices in the system pool that are added to a file system. If you want to isolate metadata to go only to designated disk targets in the system, the usageType definition must be metadataOnly. This can be set with the chdisk command from the CLI before assigning the disk (or adding the disk) to the file system. See Example 5-1 for the syntax of the --usagetype option of the chdisk command.
Example 5-1 Example chdisk option for setting usageType
--usagetype usageType
Specifies the usage type for the specified disks. Valid usage types are dataAndMetadata, dataOnly, metadataOnly, and descOnly.
The system pool requires that some devices be added to the system pool for data usage type. Whether the devices are added as dataAndMetadata or dataOnly, you need to have some devices in the system pool that support metadata, and some devices that support data.
Any devices that are added to a pool other than the system pool are added as dataOnly, because GPFS puts all metadata on devices in the system pool.
 
Important: Setting disk failure groups, storage pools, and allocation types before they go into a file system is critically important. They should always be balanced across storage node preference to avoid having a single storage node carrying the workload of a defined file system and all its associated file sets.
5.2.6 GPFS synchronous replication and failure group definitions
When the pools, usage type, and failure groups are established, it is time to create the file system. An even distribution of NSDs suggests that you have an even number of devices for each storage node preference. Also, each storage pool that is included, and each data allocation type that is defined, are also evenly balanced across the two failure groups and storage node preferences.
The primary reason for using failure groups is to replicate metadata only. As mentioned in 5.2.5, “NSD allocations for a balanced file system” on page 145, your metadata is placed in failure group 1 on any device in the system pool that is defined for metadata allocation type and replicated to devices in failure group 2.
If the device is in the file system and GPFS file system replication is not enabled, the file system can use all failure groups for data or metadata. Replication is synchronous, and requires that both writes be committed before it gets acknowledged.
For example, assume that you have eight NSDs (volumes) with an assigned allocation type of data and metadata. Further assume that all eight devices are in the system (default) storage pool, but half the devices are in failure group 1 and the other half are in failure group 2. At file system creation, all devices are added.
Without further enablement of replication at the time of file system creation, the metadata and data are striped across all the NSDs in both failure groups, and no data or metadata is replicated. For GPFS to replicate metadata between the file system failure groups, it must be defined or set using mkfs or chfs commands. This option can be applied or disabled without unmounting the file system. See Figure 5-3.
mkfs fs-device-name [-R { none | meta | all }]
chfs fs-device-name [-R { none | meta | all }]
Figure 5-3 example commands for changing GPFS replication level activity
Performance of data and metadata is optimal for the devices of the file system. Performance scales as new, similarly-defined devices are added and restriped. Data is redistributed in balance across more devices, because more spindles and read/write heads are available to manage distributed data access without the extra write latency.
The chfs command
Replication can be set or modified with the chfs CLI command.
The chfs command has the following options for setting or changing replication:
-R { none | meta | all }
-R Sets the level of replication that is used in this file system. Optional. The variables are as follows:
The none option indicates no replication at all.
The meta option indicates that the file system metadata is synchronously mirrored across two failure groups.
The all option indicates that the file system data and metadata is synchronously mirrored across two failure groups.
If this property is changed on an existing file system, it takes effect on new files only. To apply the new replication policy to existing files, run the restripefs command.
Using synchronous file system replication
Synchronous file system replication is a feature that is provided by the GPFS file system. The file system can use synchronous replication only if the NSDs that make up the file system exist in two or more predefined failure groups. If two or more failure groups do exist, the standard preferred practice is to replicate metadata across the failure groups.
Typically, metadata is equal to roughly 2.5 - 3% of the file system capacity, and replicating metadata adds a layer of file identification protection by enabling GPFS to have two separate copies to read from if GPFS determines that one copy is faulty or corrupted. In this case, the metadata capacity can take up to 5 or 6% of the file system capacity.
The preferred practice for most configurations is to use two failure groups for all back-end storage, based on a dual-controller technology. However, do not use two failure groups for XIV storage frames when a single XIV is defined as the back-end storage to SONAS. When two or more XIVs are configured in the SONAS solution, place the volumes from each serial number in two separate failure groups for metadata replication.
The preferred practice for highest reliability is to implement replication of both data and metadata. This practice creates multiple synchronous copies of all data and metadata on redundant RAID-protected volumes, and provides the highest level of redundancy that is practical within the immediate cluster. However, this protection also comes at the highest cost penalty in capacity and performance.
For highest performance purposes, do not implement replication. All NSDs would exist in a Single (default) failure group, and the only protection of the data or metadata would be the RAID-protected volumes (that make up the NSDs).
Select the configuration that is appropriate to your environment.
Figure 5-4 shows a SONAS file system that is built on default settings.
Figure 5-4 A SONAS file system that is built on default settings
5.2.7 Placing metadata on SSD
Put metadata on the fastest tier (for most implementations and services). It is common to put both data and metadata on the same storage. Consider a few scenarios to drive the point on preferred practice.
Scenario 1
The client has a cluster with three interface nodes, two storage nodes, and one storage controller with 120 SAS drives.
Here, there are not many choices:
Option 1. Put all devices in two failure groups. Put both metadata and data on all devices. If you need to speed up GPFS metadata scan rate, you must add more spindles and restripe the file system. Alternatively, add a new tier and move some performance-contentious data off the primary tier of storage to reduce the competition for metadata scans on read/write heads.
Option 2. Use several NSDs from the mix for metadata only. This means that the metadata-only devices do not get busy with data read/write I/O, and metadata scans use only those designated devices (free from normal data access contention).
Be sure that you have enough read and write heads in this approach to satisfy both data types. In solutions with a huge amount of spindles, option one might be faster. In a solution with high user contention, option two might work better.
Either scenario can be improved by adding spindles to that pool and data allocation usage type. For any pool and usage type, there should be four or more devices per storage node.
Scenario 2
The client has a cluster with three interface nodes, two storage nodes, and one storage controller with 60 SAS drives and 60 NLSAS drives.
Here you have the same choice as Scenario 1, but now you can use SAS or NLSAS for metadata:
Option 1. Put all devices in two failure groups. Put both metadata and data in an all-SAS device storage pool. If you need to speed up the GPFS metadata scan rate, you need to add more spindles to that pool and restripe the file system. Put the NLSAS devices into a silver pool for tier 2 data only placement.
Option 2. Use a few SAS NSDs from the mix for metadata only. This means that the devices do not get busy with data read/write I/O and metadata scans, but instead use only those designated devices for each specific workload.
When you have enough read/write heads in this approach to satisfy both data types, the storage is optimally configured. In solutions with a huge number of spindles, option 1 might be faster. In a solution with high user contention, option 2 might work better.
Again, either scenario can be improved by adding spindles to that pool and data allocation type.
Scenario 3
You have a cluster with three interface nodes, two storage nodes, and one storage controller with 20 SSD drives and 100 NLSAS drives.
Here, you might use the 20 SSD drives for metadataOnly:
Option 1. Put all devices in two failure groups. Assign metadataOnly on all SSD devices in the system storage pool. If you need to speed up GPFS metadata scan rate, you need to add more devices to that pool and restripe the file system. Then, put all NLSAS NSDs in the system pool as dataOnly usage type. This placement can greatly improve the performance of GPFS scan rates for tasks that are metadata-intensive, like backup and restore, replication, snapshot deletion, or antivirus.
 
Consider: If the client tier 1 is on SAS, SSDs might not hugely improve scan rates in this case. It might be better to have more SAS spindles that share both data and metadata workloads.
Preferred practice: In all cases, when you assign a designated metadataOnly set of devices in the system storage pool, you want to make sure that each storage node sees at least eight devices of that type to optimize the node data path and channels, and the back-end storage controller cache. In some cases, 16 devices per storage node pair (for metadata) has tested as an optimal performance minimum. Therefore, sizing your storage volumes for achieving this minimum is a good exercise before putting your file system into production.
Summary
This section describes several options. When you plan to use a large number of SAS drives for your tier 1 storage, a preferred practice is to put your metadata and data on as many of the fastest tier of drive types possible, and to replicate all metadata. If you plan to use NLSAS drives for your tier 1 data storage, and you have processes planned that require fast metadata scan rates, it is best to allocate a subset of SSD drives for metadataOnly (approximately 5%) in your system pool and NLSAS for dataOnly (also in your system pool).
 
XIV Gen2: The XIV Gen2 does not support an SSD option. Therefore, placing data and metadata in the system pool is common, and a preferred practice for high performance and high reliability on that platform.
XIV Gen3: The XIV Gen3 is considerably faster than the XIV Gen2, and, although the SSD option to XIV Gen3 is not directly assigned as volumes provisioned to SONAS, using the SSD option with SONAS XIV Gen3 can significantly improve data and metadata random read performance. Allocate XIV Gen3 storage to the system pool for data and metadata when the Gen3 is used with the SSD enabled volume option. Considerable improvement can be achieved with metadata and small-file random I/O workloads.
To yield the fullest value from SONAS with XIV Gen3 or DS8x00, manage only one XIV Gen3 behind each SONAS storage node pair. Highest performance from the storage is yielded when placing the XIV behind four storage nodes (this might require special consultation with SONAS implementation experts on the XIV or DS8x00 platforms). However, in this case, half of the volumes from that XIV are mapped to one storage node pair, and the other half are mapped to the second storage node pair.
5.2.8 Block size considerations
For a preferred-practice SONAS file system configuration for small file, mixed small, and large file random or sequential work loads, you can consider a few changes from default or typical installation configurations.
If many or most of the files are small, you also need to keep the GPFS subblock size small. Otherwise, you risk wasting significant capacity for files that are smaller than the file system subblock size. This is because all files use at least one subblock for each write. So when writing a new 2 KB file to a file system that is set to 1 MB file system block size, it uses 32 KB of disk space because that is the GPFS file system subblock size (1/32 of the assigned block size) and you cannot write to a smaller space.
You can modify that file to growth up to 32 KB without using more capacity, but consider the potential waste. However, files that are larger than the block size will waste performance from the extra read/write operations in patching together the larger files. These points are not uncommon to any file system layout and performance or capacity consideration.
 
Note: For SONAS gateway solutions with an XIV back end, it is more of a one size fits all process regarding the back-end segment size of the storage. However, the file system block size can still provide an effect on performance and capacity consumption. Gen3 XIV with SSD does an excellent job with small block I/O, where XIV Gen2 does a good job for larger blocks and more sequential workloads.
XIV overall provides more persistent performance, even if there are drive failures, and the fastest recovery time to full redundancy in RAID protection. XIV Gen3 shows the highest persistent performance and client satisfaction in the field today, and it is the easiest storage to configure and maintain.
For a preferred-practice SONAS file system configuration for large-file and more sequential work loads, you can consider a different change from the default settings (see Figure 5-5).
Figure 5-5 Preferred practice for small or mixed file sizes
If many or most of the files are large (> 256 KB), keep the GPFS subblock small enough to make the best use of space, but large enough to avoid wasting significant performance by breaking up the writes for I/O files. So, when writing a file, always take multiple subblocks.
Of course, if balanced well, you push a better performance. However, if files take a little more space than the subblock size, you can still waste a little capacity on the carry-over bits. These larger file writes and reads tend to be more sequential in nature. Sequential workloads tend to read and write more efficiently to a GPFS cluster block allocation map type.
Because XIV acknowledges writes from its large distributed cache, it is the fastest performance for back-end storage options with large file, sequential workloads of all the storage solutions that are available to SONAS today.
Now consider a file system that is designed for service where the average file size is large
(> 10 MB), as shown in Figure 5-6.
Figure 5-6 Preferred practice large file system layout
In a typical configuration, you can use a 4 MB file system block size for clients that use predominantly large files. These workloads are typically sequential, and you can minimize performance reduction by writing to larger chunks in fewer transactions. Sequential workloads tend to write more efficiently to a GPFS cluster block allocation map type.
When you are deciding on the block size for a file system, consider these points:
Supported SONAS block sizes are 256 KB, 1 MB, or 4 MB. Specify this value with the character K (Kilobytes) or M (megabytes); for example, 4M. The default is 256K.
The block size determines the following values:
 – The minimum disk space allocation unit. The minimum amount of space that file data can occupy is a subblock. A subblock is 1/32 of the block size.
 – The block size is the maximum size of a read or write request that the file system sends to the underlying disk driver.
 – From a performance perspective, set the block size to match the application buffer size, the RAID stripe size, or a multiple of the RAID stripe size. If the block size does not match the RAID stripe size, performance can be degraded, especially for write operations.
 – In file systems with a high degree of variance in the sizes of the files within the file system, using a small block size has a large effect on performance when you are accessing large files. In this kind of system, use a block size of 256 KB and an 8 KB subblock. Even if only 1% of the files are large, the amount of space that is taken by the large files usually dominates the amount of space that is used on disk, and the waste in the subblock that is used for small files is usually insignificant.
Looking at the preceding images, you can see that when you are setting up a file system optimally you need to share the RAID configuration below it to optimize performance to align the data striping of the file system to match the design of the RAID configuration below the volumes.
The effect of block size on file system performance largely depends on the application
I/O pattern:
A larger block size is often beneficial for large-file, sequential read and write workloads.
A smaller block size is likely to offer better performance for small file, random read and write, and metadata-intensive workloads.
The efficiency of many algorithms that rely on caching file data in a page pool depends more on the number of blocks that are cached than the absolute amount of data. For a page pool of a given size, a larger file system block size means that fewer blocks are cached. Therefore, when you create file systems with a block size larger than the default of 256 KB, you might also want to increase the page pool size in proportion to the block size.
Remember that this increase requires special approval through a request for price quotation (RPQ), and clearly warrants value validation through testing.
The file system block size must not exceed the value of the maxblocksize configuration parameter, which is 4096 KB (4 MB).
GPFS uses a typical file system logical structure, common in the UNIX space. A file system is essentially composed of a storage structure (data and metadata) and a file system device driver. It is in fact a primitive database that is used to store blocks of data and keep track of their location and the available space.
The file system can be journaled (all metadata transactions are logged in to a file system journal) for providing data consistency in case of system failure. GPFS is a journaled file system with in-line logs (explained later in this section).
 
Note: Network Data Management Protocol (NDMP) backup prefetch is designed to work on files that are less than or equal to 1 MB. NDMP backup prefetch does not work for a file system that has a block size that is greater than 1 MB. Therefore, do not use a larger block size if you plan to use NDMP backups.
5.2.9 Segment size considerations
Of the storage options available as back-end devices for the SONAS solution, the XIV platform is the only one that does not allow choosing the segment size or RAID type that is used for your file system storage solution.
For DDN storage, these parameters are chosen for you. Therefore, the platforms that do allow setting these parameters are the Storwize V7000, the IBM DCS3700, and the DS8x00.
The segment size of the back-end storage should clearly be aligned with the RAID type, spindle count, and file system block size for preferred practice performance.
In this explanation, RAID 6 (8 + P + Q) is used because this is a preferred practice data storage configuration for data on non-XIV storage solutions. It is a preferred practice because it offers the best overall performance and capacity preservation with two-drive fault tolerance.
Segment size effect
Now consider the science of the segment size effect.
If the file system has a 1 MB block size, SONAS tries to write data in prescribed subblock sizes (which in this case would be 32 KB). However, a full write is considered 1 MB.
If the file system has a 256 KB block size, SONAS tries to write data in prescribed subblock sizes (which in this case would be 8 KB). However, a full write is considered 256 KB.
So, if you write data in a 256 KB write from the host to a RAID 6 volume that is made up of 8 data drives plus 2 Parity (P and Q), the write is striping across the 8 data drives. The host write should ideally be divided into 8 drive writes. A 256 KB write divided by 8 drives has 32 KB segments. In this case, splitting the 256 KB stripe into 32 KB segments is ideal.
For a 1 MB file system block, that same division equates to a 128 KB segment size because the full 1 MB write would be divided into 128 KB segments that align evenly across the 8 data drives.
If the segment size is 32 KB and the data write from the host is 1 MB, it takes 4 writes across all 8 drives (plus parity) to complete the write, because 8 drives x 32 KB segments x 4 = 1 MB.
Therefore, by aligning the segment size in the RAID group to the file system block size and data stripe, you can avoid more writes in I/O transactions and make your storage performance more efficient.
5.2.10 Allocation type
Consider the allocation type before file system creation because it cannot be changed after a file system is created. There are basically two options to choose from with SONAS GPFS file systems:
-j {cluster | scatter}
-j Specifies the block allocation map type. The block allocation map type cannot be changed after the file system is created. The variables are as follows:
 – The cluster option. Using the cluster allocation method might provide better disk performance for some disk subsystems in relatively small installations. The cluster allocation stores data across NSDs in sequential striping patterns.
 – The scatter option. Using the scatter allocation method provides more consistent file system performance by averaging out performance variations because of block location. (For many disk subsystems, the location of the data relative to the disk edge has a substantial effect on performance.) The scatter allocation pattern attempts to balance a scattered pattern of writes across NSDs to provide the maximum spread of data to all channels in a pseudo-random pattern.
In most cases, using a scatter allocation type can yield better performance across large distributions of NSDs with mixed workloads. Therefore, unless the data profiles are large files and sequential workloads, it is usually better to use a scatter allocation type, which is the GPFS default assignment.
Summary of file system block size and allocation type preferred practice
Clients with many small files waste a considerable amount of capacity when they use a large file system block size. Clients with heavy read and write patterns of small files often perform best with a 256 KB block and scatter allocation type.
Clients with many large files degrade performance if they use a small file system block size. Clients with large files tend to use sequential data access patterns, and therefore often benefit from 1 MB block size and the scatter allocation type. However, if the large file manager also has many small files and they are space-sensitive, it might benefit them (from a capacity usage perspective) to use 256 KB file system block sizes with scatter access patterns:
Workloads that are sequential with file systems smaller than 60 TB might benefit from the cluster allocation type.
Only file patterns that are confirmed large and mostly sequential should use a 4 MB file system block size.
The 4 MB block size should not be used when NDMP is planned for backup.
Figure 5-7 shows considerations for a huge SONAS file system.
Figure 5-7 Preferred practice huge file system layout
5.2.11 Quota enablement set with file system creation
This section describes preferred practices for enabling quotas when the file system is created with the mkfs command. The following options are available:
-q, --quota { filesystem | fileset | disabled }
Enables or disables the ability to set a quota for the file system.
File systems must be in an unmounted state to enable or disable quotas with the -q option.
The file system and file set quota system helps you to control the allocation of files and data blocks in a file system or file set.
File system and file set quotas can be defined for:
Authenticated individual users of an IBM SONAS file system.
Authenticated groups of users of an IBM SONAS file system.
Individual file sets.
Authenticated individual users within a file set.
Authenticated groups of users within a file set.
The filesystem option, the default value, enables quotas to be set for user, groups, and each file set. A user and group quota applies to all data within the file system regardless of the file set.
The fileset option enables user and group quotas to be set on a per file set basis. This option also enables file set level quotas, therefore restricting total space or files that are used by a file set.
The disabled option indicates that a change in this setting requires the file system to be unmounted.
Because quota information in the database is not updated in real time, you must submit the chkquota CLI command as needed to ensure that the database is updated with accurate quota information. This update is critical after you enable or disable quotas on a file system, or after you change the basis of quotas in a file system to enable or disable quotas per file set within the file system by using the chfs CLI command with the -q option. The file system must be in an unmounted state when the chfs CLI command is submitted with the -q option.
The default setting enables quotas on a file system basis.
Reaching quota limits
When you reach the soft quota limit, a grace period starts. You can write until the grace period expires, or until you reach the hard quota limit. When you reach a hard quota limit, you cannot store any additional data until you remove enough files to take you below your quota limits or your quota is raised. However, there are some exceptions:
By default, file set quota limits are enforced for the root user.
If you start a file create operation before a grace period ends, and you have not met the hard quota limit, this file can take you over the hard quota limit. The IBM SONAS system does not prevent this behavior.
If quota limits are changed, you must recheck the quotas to enforce the update immediately (use the chkquota command).
It is not advised to run chkquota command during periods of high workloads. However, it is advised that quotas are monitored along with file system, file set, and storage pool capacities by SONAS or v7000 Unified administrators on a daily and weekly basis as a standard component of monitoring cluster health.
5.2.12 DMAPI enablement
This section describes preferred practices for enabling quotas when DMAPI is enabled with the mkfs command. The following options are available:
--dmapi | --nodmapi
If you use the -b option to specify a block size that is different from the default size, be aware that the NDMP backup prefetch is designed to work on files that are less than or equal to 1 MB. NDMP backup prefetch does not work for a file system that has a block size greater than 1 MB.
<list of disks> is a set of the disk names that are displayed by the lsdisk command in the previous section. The names are separated by commas.
The --dmapi and --nodmapi are keywords that are used to designate that support for external Storage Pool Management software (for example, IBM Tivoli Storage Manager) is enabled (--dmapi) or disabled (--nodmapi).
Specify either --dmapi or --nodmapi on file system creation. If you do not specify it on file system creation, the default is --dmapi enabled (which is required for supporting backup, external space management, and NDMP).
 
Important: If DMAPI is disabled, the client cannot recall data in space management with Tivoli Storage Manager and Hierarchical Storage Manager (HSM). Enabling and disabling DMAPI requires that the file system is unmounted.
5.2.13 File system logging
This section describes preferred practices for enabling quotas when DMAPI is enabled with the mkfs command. The following options are available:
--logplacement { pinned | striped }
These options set the data blocks of a log file, which is to be striped, across all metadata disks, like all other metadata files:
The pinned option indicates that the data blocks are not striped. Use pinned logs only for small file systems that are spread across few NSDs.
The striped option, the default, indicates that the data blocks of a log file are striped across all metadata disks. If not specified, the default is striped and, if this property is set to striped, it is not possible to turn it back to pinned. Striped is the preferred practice high-performance solution for log placement for most file system work loads with eight or more NSDs.
The IBM SONAS system by default stripes data blocks for file system log files across all metadata disks, like all other metadata files, for increased performance. When you create a file system, you can specify that the file system log file data blocks instead are pinned, and later change to striped, but when striped, you cannot change to pinned.
When you are creating the file system with the mkfs command, the --logplacement { pinned | striped } option decides how you want your logging done. Typically, striped logging runs faster than pinned, and striped is the default.
5.2.14 Snapshot enablement
This section describes preferred practices for enabling quotas when snapshots are enabled with the mkfs command. The following options are available:
--snapdir | --nosnapdir
This selection enables or disables the access to the snapshots directories. The --snapdir option to the mkfs command is the default value. This setting creates a special directory entry named .snapshots in every directory of the file system and in independent file sets. The .snapshots directory entry enables the user to access snapshots that include that directory.
If the --nosnapdir option is used, access to all snapshots can be done only by using the .snapshots directory entry at the root of the file system or file set.
For example, if a file system is mounted at /ibm/gpfs0, the user needs to use the /ibm/gpfs0/.snapshots directory entry.
If, for any reason, the default .snapshot directory is changed, remember that the default backup and replication exclude file might also require changes to reflect the new .snapshot directory location. Otherwise, backups might not get excluded. The amount of data that you back up to tape reflects the number of snapshots, and can be 10x or 100x the size of the data footprint, because each snapshot file gets backed up without space efficiency. Therefore, the preferred practice is to use the default location for snapshot directories.
Automatically generated snapshots are managed by rules for creation and retention-based deletion. It is best to avoid using the same snapshot rules for all file system and independent file set snapshots. If the same rules are used, it increases the amount of metadata that must be scanned.
It also increases the work that is associated in snapshot deletion operations in that specific time frame. Therefore, during the snapshot deletion operation, metadata transactions and data I/O can be so intense that they affect normal user data access or other operations, such as backup, replication, or antivirus scanning.
When you are creating file set snapshot rules, try to break up the number of file sets that share the same rules at any given time, and stagger the snapshot times to avoid or limit contention.
Rule example
Here is an example of a rule that is defined at one customer site to simplify and stagger the snapshot management policies, and yet accomplish enough user-recoverable file protection.
All independent file sets are on an automatic snapshot schedule for semi-daily snapshots, weekly snapshots, and monthly snapshots. Snapshots are not run on the underlying file system directly, but instead the root file set of the underlying file system is snapshot.
The root file set snapshots of the underlying file system capture any data that is not managed in independent file sets (so that it captures dependent file set data).
In this example, the snapshot rule that is applied to each file set is assigned alphabetically in groups to ensure that not all file sets get applied snapshots or snapshot deletions occurring at the same time of day. This enables the schedules to stagger, and distributes the heavy metadata scan requirements that are associated with snapshots at varying times in the
daily schedule.
A file set, whose name begins with the letters A - E, is assigned a semi-daily, weekly, and a monthly snapshot rule that ends in A - E (see Figure 5-8).
Figure 5-8 GUI for example snapshot rule distribution
The semi-daily snapshot takes a snapshot at roughly 5 am or 6 am and 5 pm or 6 pm daily, and these snapshots are automatically deleted after one week. The weekly snapshot takes a snapshot at roughly 9 pm on a weekend and these snapshots are automatically deleted after five weeks of age. The monthly snapshot takes a snapshot at roughly 1 am on a weekend and these snapshots are automatically deleted after six months of age.
This process ensures adequate protection of file set data (restorable by clients), and yet limits the number of active snapshots to roughly 25 for any file set, while it mixes the activation time and distributes the metadata scan burden of deletion schedules.
Do not carry the number and frequency beyond practical value.
Most clients keep daily snapshots for 10 days, weekly snapshots for 5 or 6 weeks, and monthly snapshots for 10 months. Otherwise, the number of snapshots become overwhelming and offers more pain than value.
You must follow a naming convention if you want to integrate snapshots into a Microsoft Windows environment. Therefore, do not change the naming prefix on snapshots if you want them to be viewable by Microsoft Windows clients in previous versions listings.
Example details
The following example shows the correct name format for a snapshot that can be viewed on the Microsoft Windows operating system under the previous version:
@GMT-2008.08.05-23.30.00
Use the chsnapassoc CLI command or the GUI to change a snapshot rule association between a single snapshot rule and a single file system or independent file set.
To optionally change the rule that is associated with the file system or independent file set, use the -o or --oldRuleName option to specify the rule that should no longer be associated, and use the -n or --newRuleName option to specify the rule that should become associated with the file system or independent file set.
To optionally limit the association to an independent file set within the specified device, use the -j or --filesetName option and specify the name of an existing independent file set within the specified file system. The following CLI command example removes the association of a rule named oldRuleName with an independent file set named filsetName within a file system that is named deviceName and replaces the old rule in the association with a new rule named newRuleName:
# chsnapassoc deviceName -o oldRuleName -n newRuleName -j filsetName
 
Important: When a snapshot rule association is changed from an old rule to a new rule, all of the corresponding snapshot instances are immediately altered to reflect the changed snapshot rule association so that the existing snapshots continue to be managed, but according to the new rule.
If the retention attributes of the new rule are different from the previous rule, these alterations might cause more than the usual number of existing snapshots to be deleted.
If, instead, a snapshot rule association is removed and a new association is created, all of the snapshots corresponding to the old rule in the association are either deleted or marked as unmanaged.
5.2.15 inodes and metadata
The inodes and indirect blocks represent pointers to the actual blocks of data as shown in Figure 5-9.
Figure 5-9 An inode to data relationship diagram
For a journaled file system, all inodes and indirect blocks that are subject to a modification are stored into the file system log before they are modified.
In addition to the inodes and indirect blocks, which keep track of where the data is stored, the file system also keeps information about the available space by using a block allocation map.
The block allocation map is a collection of bits that represent the availability of disk space within the disks of the file system, which is maintained by the FSMgr. One unit in the allocation map represents a subblock, or 1/32 of the block size of the GPFS file system. The allocation map is broken into regions, which are on disk sector boundaries. However, this is deeper than you need to go to help define the point and purpose of metadata.
The key is that metadata is small in comparison to file system data and it gets read every time a job list must be created to support a task that involves all files.
The typical preferred practice is to enable capacity for metadata at approximately 5% of the file system usable capacity (with metadataOnly replication enabled). Then, if the metadata is on an independent disk type from where you put data (within the system pool) (such as SSD), you need to monitor that capacity regularly and validate that your file systems and independent file sets do not run out of space or pre-allocated inodes. When a file system or independent file set runs out of inodes, you stop writing data to that store.
If you choose not to do metadata replication, metadata typically does not use more than
2.5 - 3% of the file system usable data capacity. This usage depends on how many files and objects exist in the file system. One thousand large files take much less metadata capacity than 100 million small files.
5.3 Removing a file system
Removing a file system from the active management node deletes all of the data on that file system. This is not a common task, and there are a few considerations. Use caution when you are performing this task. There is a sequence of events that should be followed before you remove a file system.
Do the following steps to remove a file system:
1. Ensure that the clients are not using shares from the file system.
2. Stop any replication of the file system.
3. Stop any backup of the file system.
4. Stop any antivirus scan scope that includes the file system.
5. Unmount shares to the file system from all clients.
6. Disable and remove all shares.
7. Remove any existing links to the file system.
8. Remove the file system.
9. Validate that the disks are no longer associated with the file system.
5.4 File system performance and scaling
When performance requirements increase, you need to see whether the back-end devices are heavily used, or if the front-end devices have maximized their throughput capacity.
If the front-end devices are at their capacity ceiling (system resource use is nearing maximum potential), adding interface nodes is a good first step. However, if the NSD performance use is high, it is important to consider growing the back end.
The sizing section of this book (1.4, “Sizing” on page 20) provides more information about how to increase back-end performance. However, there are three ways to grow performance on the back end:
1. Increase the number of spindles behind each storage node pair.
2. Increase the tier pool options to add a faster drive technology, or to offload contention of some user data to a lower-performance tier of storage.
3. Add more storage node pairs with more spindles and restripe the file system to better disperse the I/O across more devices all together.
When the storage node pair processing power is at its maximum, you need to add a storage node pair with additional spindles behind it, then restripe the file system with the restripefs command. Doubling capacity in this way can typically double the back-end performance.
Modifying file system attributes
You can use the chfs CLI command to modify the file system attributes. This command enables you to add NSDs, remove NSDs, add inode allocations, and set file system replication rules for data or metadata when the NSDs are in two or more predefined
failure groups.
Listing the file system mount status
Use the lsmount CLI command to determine whether a file system mount status is mounted, not mounted, partially mounted, or internally mounted. Validating the file system mount status in the cluster is an important troubleshooting step.
Running lsmount -v shows the status of all nodes mounting all file systems. Occasionally, make sure that there are no inconsistencies with file system mounts. When there are issues, this should be one of the first things that is checked because it is often easily corrected.
Mounting a file system
Before you mount the file system on one interface node or all interface nodes, ensure that all of the disks of the file system are available and functioning properly. When replication is configured, all of the disks in at least one failure group must be fully available.
Unmounting a file system
You can use the GUI or the unmountfs CLI command to unmount a file system on one interface node or on all interface nodes.
 
Tip: There must be nothing accessing the file system before it can be unmounted. Therefore, make certain that clients discontinue use of and unmount the file shares, then disable the shares. Then, you can unmount the file system with the unmountfs command from the management node. This is typically done only when removing a file system, or if support asks you to unmount it for maintenance reasons.
Restriping file system space
You can rebalance or restore the replication of all files in a file system by using the restripefs CLI command:
restriprefs gpfs0 --balance
It is important that data that is added to the file system is correctly striped. Restriping a large file system requires many insert and delete operations, and might negatively affect system performance temporarily. Plan to perform this task when system demand is low (if possible).
When a disk is added to an existing file system, data is moved to rebalance data across all of the disks in the file system slowly over time (running at a low priority), as files are accessed, in the background. In the case where you want to rebalance data across all of the disks in a file system immediately on demand, you can use the restripefs CLI command after considering the potential effect to overall system performance because I/O is performed to read and write data during the restriping operation.
By default, the restriping operation is performed using all of the nodes in the IBM SONAS system. You can lessen the overall effect to the system by using the -n option, specifying a subset of nodes on which the restriping operation is performed.
If you are increasing the available storage of an existing file system because the current disks are nearing capacity, consider the following guidelines when you determine whether an immediate on-demand restriping operation is appropriate and needed.
Rebalancing applies only to user data.
The file system automatically rebalances data across all disks as files are added to, and removed from, the file system.
 
Important: If the number of disks that you are adding is small relative to the number of disks that already exist in the file system, rebalancing is not likely to provide significant performance improvements.
Forcing restriping might be necessary if all of the existing disks in a file system are already near capacity. However, if, for example, only 1 of 10 disks in a file system is full, that might be considered 90% performance. If the current level of performance is sufficient, an immediate, on-demand restriping operation might not be required.
If you are increasing the available storage of an existing file system because the current disks are nearing capacity, consider the following guidelines when you are determining whether an immediate on-demand restriping operation is appropriate and wanted:
Rebalancing applies only to user data.
The file system automatically rebalances data across all disks as files are added to, and removed from, the file system.
If the number of disks that you are adding is small relative to the number of disks that already exist in the file system, rebalancing is not likely to provide significant performance improvements.
Restriping might be necessary if all of the existing disks in a file system are already near capacity. However, if, for example, only 1 of 10 disks in a file system is full, that might be considered 90% performance. If the current level of performance is sufficient, an immediate, on-demand restriping operation might not be required.
Figure 5-10 shows output from a restripefs operation.
[[email protected] bin]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdb2 19G 7.3G 9.9G 43% /
/dev/sdb5 102G 621M 96G 1% /var
/dev/sdb6 9.1G 151M 8.4G 2% /var/ctdb/persistent
/dev/sdb1 9.1G 202M 8.4G 3% /persist
/dev/sda1 271G 602M 257G 1% /ftdc
tmpfs 16G 0 16G 0% /dev/shm
/dev/gpfs0 15T 1.9T 13T 13% /ibm/gpfs0
 
 
[[email protected] bin]# restripefs gpfs0 --balance
Scanning file system metadata, phase 1 ...
1 % complete on Fri Jun 18 14:02:52 2010
2 % complete on Fri Jun 18 14:02:56 2010
3 % complete on Fri Jun 18 14:02:59 2010
4 % complete on Fri Jun 18 14:03:02 2010
5 % complete on Fri Jun 18 14:03:05 2010
.... skipping to the end ...
97 % complete on Fri Jun 18 14:07:34 2010
98 % complete on Fri Jun 18 14:07:38 2010
99 % complete on Fri Jun 18 14:07:41 2010
100 % complete on Fri Jun 18 14:07:43 2010
Scan completed successfully.
Scanning file system metadata, phase 2 ...
13 % complete on Fri Jun 18 14:07:46 2010
24 % complete on Fri Jun 18 14:07:50 2010
49 % complete on Fri Jun 18 14:07:53 2010
72 % complete on Fri Jun 18 14:07:58 2010
90 % complete on Fri Jun 18 14:08:01 2010
100 % complete on Fri Jun 18 14:08:02 2010
Scan completed successfully.
Scanning file system metadata, phase 3 ...
Scan completed successfully.
Scanning file system metadata, phase 4 ...
Scan completed successfully.
Scanning user file metadata ...
0.34 % complete on Fri Jun 18 14:08:28 2010 ( 2248217 inodes 6426 MB)
0.62 % complete on Fri Jun 18 14:08:48 2010 ( 2263973 inodes 11689 MB)
0.89 % complete on Fri Jun 18 14:09:08 2010 ( 2279559 inodes 16657 MB)
1.20 % complete on Fri Jun 18 14:09:28 2010 ( 2296383 inodes 22530 MB)
... skipping to the end ...
91.23 % complete on Fri Jun 18 16:31:18 2010 ( 49991322 inodes 1716246 MB)
91.32 % complete on Fri Jun 18 16:31:43 2010 ( 49991322 inodes 1717998 MB)
91.43 % complete on Fri Jun 18 16:32:08 2010 ( 49991322 inodes 1720000 MB)
91.54 % complete on Fri Jun 18 16:32:30 2010 ( 49991322 inodes 1722002 MB)
EFSSG0043I Restriping of filesystem gpfs0 completed successfully.
 
Figure 5-10 Sample output from a restripefs operation
When NSD devices are added to the SONAS file system, the capacity is available immediately, but, to take full advantage of a balanced capacity and workload configuration (which you should) across all the devices in the set, you must run a restripefs --balance command against the file system.
This command takes some performance away from the file system to initiate (up to about 12%) during redistribution. For huge file systems, it can take a long time to complete. In most cases, it must be done to maintain balance and linearly improve performance.
Most clients choose to start this process on Friday evening and allow it to run all weekend to complete. However, the operation does not disrupt services or create a need to bring any services down. It is a nondisruptive service. Again, adding NSDs to the file system provides immediate relief to capacity issues, and restriping later redistributes for optimal load balancing. This is the preferred practice for adding capacity.
Figure 5-10 on page 166 shows a small lab system taking 2 hours and 30 minutes to redistribute data balance for 2 terabyte (TB) of data across the old and the added devices.
Viewing file system capacity
You can use the GUI to determine the available capacity of a file system. This task cannot be performed with a CLI command.
GUI usage
1. Log in to the GUI.
2. Click Monitoring → Capacity.
3. Ensure that the File System tab is selected.
4. Select the file system for which you want to view the system utilization. A chart that shows the system capacity in percentage displays.
The total selected capacity displays the highest daily amount of storage for the selected file system. If the amount of storage is decreased, the deleted amount does not display until the following day. See Figure 5-11.
Figure 5-11 Viewing capacity through the GUI
Perhaps a better way to look at capacity is to look at the file system statistics.
By expanding the tiers under each file system, you can see capacity that is related to the storage pools behind the file system. This solution is much more detailed (see Figure 5-12).
Figure 5-12 File system capacity view in the SONAS GUI
By using the GUI to review file set capacity, you also get to see the inode consumption and availability. Remember every file system (root file set) and independent file set displays capacity and inode consumption in this view (see Figure 5-13).
Figure 5-13 File set Capacity View: GUI
CLI listing of file system attributes and capacity
This section describes how to use the verbose option of the lsfs command:
lsfs -v
This is the same example as shown previously, with the verbose option added. The following columns are added:
Min.frag.size
Inode size
Ind.blocksize
Max.Inodes
Locking type
ACL type
Version
Logf. size
Mountpoint
Type
Remote device
atime
mtime
logplacement
state
snapdir
Information about the lsfset -v command is shown in Figure 5-14.
Figure 5-14 Sample output from the lsfset command with -v
Summary of preferred practices for file system creation
Here are preferred practice guidelines for file system creation:
When naming file systems, keep the names short (fewer than 63 characters), simple, and descriptive for your purposes. Do not include spaces or special characters that can complicate access or expand in the shell.
Before provisioning storage, the RAID and RAID segment size and volume should be properly calculated to provide the best performance and capacity efficiency for the file system block size that is planned and the file data workload characteristics.
Before assigning NSDs to the file system, make sure that they are in sufficient quantity to maximize I/O resources of the storage node HBA ports, busses, channels, and back-end storage controller channels. Also ensure that the NSDs are evenly spread across storage node preferences and two failure groups (except for single XIV applications).
NSDs that are assigned to file systems (regardless of disk or pool type), should be assigned in groups of four per storage node (16 NSDs is the optimal minimum per storage node pair for most non-XIV type storage, and eight or 16 is optimal for XIV applications).
NSDs must be of a consistent size and multipath structure when used in the same SONAS GPFS storage pool for the same usage type.
The file system block size and allocation type are best planned against the average file size, the quantity or percentage of small files, and workload access patterns (random or sequential). This must be carefully considered before back-end volume provisioning and file system creation is applied.
For small file random workloads, ensure that the file system is created with a 256 KB block size and scatter allocation type, and ensure that the underlying block device creation is assembled with appropriate configuration. Follow the guidelines in Chapter 4, “Storage configuration” on page 95.
Using large block size definitions in file system creation when there are lots of small files wastes space. Consider this before creation.
For large file sequential workloads, consider a 1 MB block size and general scatter allocation type unless the file system will be small and access will be sequential. In that case, choose the cluster allocation type.
Using 4 MB limits backup capabilities, so only use the 4 MB block size when instructed by development.
Use the striped file system logging (also described as log placement). Remember that stripe is the default.
When you are in doubt, use the default DMAPI enabled setting for file system creation for best support with backup and special DMAPI enabled applications.
When you set quota enablement, use -q fileset as the defined option, rather than the default file system. For capturing dependent file sets, run snapshots against the root file set of a file system along with the independent file sets, and not the file system (because that captures all dependent and independent file set data, which is not the preferred method of capturing true file set granularity.
When you add storage for capacity and performance, it is best to add it when capacity is less than 75% used. Force a restripe when you apply a restripefs command if your capacity is over 80% when you decide to expand.
Monitor front-end and back-end performance with a solid baseline, when your storage is set up and running well, to have something to compare to weekly or monthly snapshots as your services grow.
Monitor file system, file set capacity, and inode consumption regularly and add allocated inodes or capacity when systems reach management action thresholds.
If your file systems are built on gateway storage, it is important to monitor back-end storage with SONAS reports daily, and manage negative events aggressively.
Work with your storage and support team to prepare a GPFS file system management and diagnostics readiness program.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.139.8