Chapter 2. Planning

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Planning

Planning can both minimize costly errors during installation and shorten the lead time to get a new system up and running for production. Thorough planning is imperative to the successful installation of IBM Machine Learning for z/OS, which consists of integrated systems and services on different platforms. This chapter can help you develop a high-level, actionable plan that includes essential tasks, from obtaining product installers and allocating system capacity to provisioning network ports.

This chapter includes the following topics:

•2.1, “Product installers” on page 10

•2.2, “Hardware and software requirements” on page 10

•2.3, “System capacity” on page 12

•2.4, “Installation options on z/OS” on page 14

•2.5, “User IDs and permissions” on page 17

•2.6, “Networks, ports, and firewall configuration” on page 19

•2.7, “Firewall configuration” on page 21

2.1 Product installers

Machine Learning for z/OS comes with installers or installation files for both z/OS and Linux or Linux on IBM Z (hereafter, Linux on Z) systems. Depending on your business need, choose one of the following combinations of system environment for installation:

•Installation on z/OS and Linux

•Installation on z/OS and Linux on Z

Your planning activity should start with obtaining all the required installation materials based on your decision.

Upon receiving your purchase order from IBM Shopz, verify that it includes the following materials:

•SMP/E for z/OS image

•Program Directory for IBM Machine Learning for z/OS

•License Information for IBM Machine Learning for z/OS DVD

•Accessing Machine Learning Services on Linux DVD

•All available maintenance packages

Maintenance packages are version-specific and are posted as they become available. Make sure that you have all the updates for the version of Machine Learning for z/OS that you install by checking the IBM Support Customer access portal for IBM Machine Learning for z/OS.

The SMP/E image and the maintenance packages, if any, are only part of the installation materials. You need to download the remaining installers and scripts from the IBM Web Membership (IWM) site.

The Accessing Machine Learning Services on Linux DVD includes a “Memo to Users.” The memo contains the product access key and the full URL to the IBM Web Membership site, where you will see the following installation files:

•IBM_Machine_Learning_Installer_v1.1.0.5_Linux_x86-64 (for installation on Linux)

•IBM_Machine_Learning_Installer_v1.1.0.5_Linux_s390x (for installation on Linux on Z)

•ITOA-Health-Tree-v1.1.0.5.tar (for installation of ITOA Health Tree application on Linux)

•ITOA-Health-Tree-v1.1.0.5-s390x.tar (for installation of ITOA Health Tree application on Linux on Z)

•iml_utilities-v1.1.0.5.tar (for SSL certificate generation on Linux or Linux on Z)

Download the installation files for installing Machine Learning for z/OS in the system environments that you decided.

2.2 Hardware and software requirements

Machine Learning for z/OS uses both IBM proprietary and open source technologies and requires the installation of various hardware and software products in the z/OS and Linux or Linux on Z environments. Make sure that you procure all the prerequisite products for installation on the systems that you selected.

Important: Make sure that you install and configure the prerequisite products you select and acquire, with the exception of IBM Open Data Analytics for z/OS. The next chapter guides you through the installation and configuration of the Open Data Analytics for z/OS components.

2.2.1 Prerequisites for z/OS

The following hardware and software are required for installing Machine Learning for z/OS in the Z environment:

•z14, IBM z13®, or IBM zEnterprise® EC12 system

•z/OS 2.1 or later

•Db2 10 for z/OS (with APAR PI13725 applied) or later

•z/OS Integrated Cryptographic Service Facility (ICSF)

•IBM CICS Transaction Server for z/OS 5.4.0 (or 5.3.0 with APAR PI63005 applied)

•IBM Open Data Analytics for z/OS 1.1.0

•z/OS Spark 2.1.1 (FMID HSPK120)

•z/OS Anaconda (FMID HANA110)

•z/OS Mainframe Data Service 1.1 (FMID HMDS120)

•IBM Tivoli® Directory Server for z/OS LDAP

•IBM 64-bit SDK for z/OS, Java Technology Edition, v8 (with Refresh 4 Fix Pack 10) or later

•Gzip 1.6

CICS is required only if you want to install and run Machine Learning for z/OS scoring services in a CICS region. Also, be aware that Java 8 SR5 has a known issue with batch processing during the start-up, such as the start-master and start-slave process. To avoid the problem, plan to use Java 8 SR5 with FP7 or later.

2.2.2 Prerequisites for Linux

The following hardware and software are required for installing Machine Learning for z/OS in the Linux environment:

•Three x86 64-bit servers

•Red Hat Enterprise Linux Server 7.2 or later

•Open JDK 1.8.0 or later

2.2.3 Prerequisites for Linux on Z

The following hardware and software are required for installing Machine Learning for z/OS in the Linux on Z environment:

•Three s390x 64-bit server that runs on an LPAR of a z14, z13, IBM z13s®, zEnterprise EC12, zEnterprise BC12, LinuxOne Emperor, or LinuxOne Rockhopper system

•Red Hat 7.2 or later

•Open JDK 1.8.0 or later

2.3 System capacity

The correct system capacity for the correct workload is quintessential to maximize the value of Machine Learning for z/OS. A workload is typically defined by the number of concurrent model creation jobs run by the Jupyter Notebook or the Machine Learning for z/OS visual model builder, the size of modeling training data set (in GB), and the number of model training data features. Each model creation job includes the tasks for data loading, data transformation, data visualization, feature extraction, feature transformation, model training, and model evaluation. Carefully plan adequate system capacity in hardware, processing power, and disk storage based on the anticipated needs of your enterprise workload.

2.3.1 Basic system capacity

The scoring and training services of Machine Learning for z/OS run on z/OS, and its management services, user interface, and administration dashboard run on Linux or Linux on Z. These services require a minimum of system capacity. Although you can use the basic system capacity to run any reasonable workload, the rule of thumb is that the heavier the workload is, the more capacity you need to allocate.

If you choose the combination of z/OS and Linux for installation, ensure that the systems have the basic capacity listed in Table 2-1.

Table 2-1 Basic system capacity for installation on z/OS and Linux

Hardware	Number of LPAR/Server	CPU (Per LPAR/Server)	Memory (GB) (Per LPAR/Server)	DASD/Disk Space (GB) (Per LPAR/server)
IBM z Systems®	1 LPAR	4 zIIP processors, 1 general purpose processor	100	50
Linux system	3 x86 64-bit servers	8 cores	48	250 (plus a minimum of 650 GB secondary storage for each server)

If you choose the combination of z/OS and Linux on Z for installation, ensure that the systems have the basic capacity listed in Table 2-2.

Table 2-2 Basic system capacity for installation on z/OS and Linux on Z

Hardware	Number of LPAR/Server	CPU (Per LPAR/Server)	Memory (GB) (Per LPAR/Server)	DASD/Disk Space (GB) (Per LPAR/server)
z Systems	1 LPAR	4 zIIP processors, 1 general purpose processor	100	50
Linux on Z system	3 s390x 64-bit servers	3 IFL processors	48	250 (plus a minimum of 650 GB secondary storage for each server)

For best performance in either installation scenario, consider dedicating the LPAR for Machine Learning for z/OS. Also, allocate a secondary storage to each Linux or Linux on Z server and configure it with two mount points in XFS format with the ftype option enabled. Make sure that one mount point is appropriated a minimum of 300 GB for installer files and the other a minimum of 350 GB for data storage.

2.3.2 Capacity considerations for training services

Machine learning models are trained with data and algorithms. In general, model training is CPU intensive and can consume most of the CPU available on a given LPAR. The heavier the training workload is, the more CPU is needed. So, allocate enough processors based on your projected workload for the LPAR where Machine Learning for z/OS training services run.

The type of models and algorithms also affects the type of processors you need for model training on z/OS. For example, Spark and MLeap models are typically trained on zIIP processors, and Scikit-learn models are processed primarily on the general processors. Increase the number of processors to process the type of models you build and the type of algorithms you plan to use.

Last but not least, the size of the training data itself constitutes a significant factor in memory usage. Results of repeated tests indicate that memory usage tends to be two to three times of the size of the training data, and that number bumps up when training jobs are executed concurrently. Therefore, the preferred practice is to allocate adequate memory based on both the size of your training data and the number of concurrent training jobs.

2.3.3 Capacity considerations for scoring services

Machine Learning for z/OS processes scoring requests on different processors depending on the type of models. For example, while scoring requests for Scikit-learn models are processed on the general processor, those for Spark, MLeap, and PMML models are handled on zIIP processors. So, take into account the type of models that you develop and allocate the appropriate type of processors for the LPAR where scoring services run.

If high availability is essential to your business, consider using a scoring service cluster. In such a cluster, multiple instances of a scoring service share the same URI, with each running on a different LPAR of a sysplex. The cluster uses a round-robin algorithm of the sysplex distributor (SD) to dispatch scoring requests across the LPARs. The cluster processes all the scoring requests as long as one of the LPARs is operational.

2.3.4 Capacity considerations for performance

System response time is a key performance indicator in machine learning operations. The response time of Machine Learning for z/OS services generally corresponds to the availability of system capacity, as evidenced by test results in the following example (see Table 2-3).

Table 2-3 System response time corresponds to availability of system capacity

Processors	System response time (minutes)
	Number of concurrent model creation jobs				Size of data set (GB)	Number of data features
	8	16	32	64	Size of data set (GB)	Number of data features
4 zIIP 1 GCP	13	27	56	135	2	100
4 zIIP 1 GCP	21	46	94	238	4	200
8 zIIP 1 GCP	9	15	31	68	2	100
8 zIIP 1 GCP	14	21	44	101	4	200

In this example, if four zIIPs and one GCP are allocated to run the workload of 16 concurrent jobs, one training data set of 2 GB in size, and 100 data features, it can take the system up to 27 minutes to complete the jobs. If eight zIIPs are allocated for the same workload, system response time can be reduced to 15 minutes. Although the actual results of your system performance can vary, the example demonstrates the positive correlation between system capacity and system response time. In other words, adequate allocation of system capacity improves system response time when the same or similar workload is being processed. Consider increasing your system capacity to improve the response time and thus the overall performance of Machine Learning for z/OS services.

2.4 Installation options on z/OS

Machine Learning for z/OS offers flexibility in terms of where to install the training and scoring services on z/OS. Depending on your business need and the system capacity you plan to allocate, carefully assess the following installation options and choose one that satisfies your machine learning workload while achieving the best performance without exceeding system capacity:

•Training and scoring services on the same LPAR

•Training and scoring services on different LPARs

•Training services on an LPAR and scoring services on an LPAR cluster

Machine Learning for z/OS uses z/OS Mainframe Data Service (MDS) as both a data connector and a data source. MDS must be on the same LPAR where the machine learning training and scoring services run. If you use MDS in your setup, make sure that MDS and the training and scoring services are installed on the same LPAR or sysplex.

2.4.1 Option 1: Training and scoring services on the same LPAR

This option suggests the installation of both training and scoring services on a single LPAR that is dedicated to Machine Learning for z/OS. The sysplex in Figure 2-1 on page 15 includes multiple LPARs with one handing exclusively machine learning workload and others executing existing applications for production. At run time, data is ingested from one or more production systems to the dedicated LPAR for training and scoring services.

Figure 2-1 Installing training and scoring services on the same LPAR

There are several advantages to this option. The installation is straightforward with all the machine learning component systems and services going to the same location on z/OS. Also, the option does not impact the performance of the existing production systems in the sysplex. Most importantly, with careful workload balancing for training and scoring requests, the services can share and maximize the use of system resources for better performance.

The disadvantage of this option is the potentially negative impact on the performance of scoring services. Both data ingestion for training and scoring requests come from other production systems, and heavy network traffic between the LPARs might slow down the responses of those services. Consider this option if your machine learning workload is not heavy and if you want to keep the production systems for machine learning and other applications separate.

2.4.2 Option 2: Training and scoring services on different LPARs

This option suggests the installation of machine learning training and scoring services on separate LPARs with existing production systems. Figure 2-2 shows an example of this installation option.

Figure 2-2 Figure 2: Installing training and scoring services on different LPARs

In this example, the installation of training and scoring services spreads across three different LPARs in the sysplex. All of them coexist with other applications on their respective production systems where data lives. The training services run on the same LPAR along with Db2, IMS, or VSAM which holds the data for model training. The scoring services run on the same LPARs where scoring requests originate and can respond to those requests with minimal performance impact.

This installation option addresses the shortcomings in the first option. The biggest upside is that it uses and optimizes the use of existing system resources on each LPAR while eliminating the potential performance impact due to heavy network traffic. Consider the option particularly when fast elapsed time for both scoring and training services is essential to the operation of your production systems.

2.4.3 Option 3: Training services that are on an LPAR and scoring services that are on an LPAR cluster

This option is similar to the second one in terms of installing the training and scoring services on separate LPARs. The difference, which is significant, lies in the suggestion that the scoring services be installed on an LPAR cluster. Figure 2-3 on page 17 shows the layout of this installation option.

Figure 2-3 Installing training services on an LPAR and scoring services on an LPAR cluster

In this example, the training services run on a dedicated LPAR in a sysplex, and the scoring services are installed on multiple LPARs in another sysplex which is configured as a scoring service cluster. The sysplex distributor (SD) is used to balance and distribute scoring workloads among multiple instances of scoring services in the cluster. All scoring requests are processed as long as one LPAR in the cluster is up and running. This option delivers high availability and scalability of Machine Learning for z/OS services. Consider this option if your machine learning workload is significantly heavy and high availability and stability are top priorities of your business.

2.5 User IDs and permissions

The Linux or Linux on Z installer of Machine Learning for z/OS uses the default user of each node to install component systems and services but requires user-defined IDs with proper permissions for installation on z/OS. Dedicated user IDs are also required for Machine Learning for z/OS to access Db2 for z/OS and z/OS LDAP with the SDBM backend. Make sure that you identify or create all the required IDs and assign them sufficient privileges, as listed in Table 2-4, before you start the installation.

Table 2-4 User IDs and permissions required for installing Machine Learning for z/OS

Type of User ID	Description	Required Privileges or Permission
Db2 for z/OS authorization (<db2_auth_id>)	This authorization ID is used by the Machine Learning services to access Db2 for z/OS.	DBADM authority, which is granted when you run the ALNMLEN sample JCL job
z/OS LDAP user ID (<zldap_userid>)	This user ID is used by the Machine Learning services to access z/OS LDAP.	RACF SPECIAL authority for validating a new user that you want to add
z/OS Spark, Jupyter kernel gateway, Apache Toree, and MLz operation handling service user ID (<spark_jupyter_toree_userid>)	This user ID is used for installing and configuring z/OS Spark, Jupyter kernel gateway, and Apache Toree. This ID is also used for creating, configuring, and starting the operation handling service on z/OS.	•Member of IBM RACF® user group <spark-GRP>. •$SPARK_HOME and $SPARK_OPTS ($SPARK_OPTS="—master spark://<ip_address>:<port>") Environment variables included in the user’s profile ($HOME/.profile) •$IML_HOME environment variable included in the user’s profile, which points to <install_dir_zos>. •Inclusion of the following environment variables in the user’s profile: export ANACONDA_ROOT="<install_dir_anaconda>" export PATH=$ANACONDA_ROOT/bin:$PATH export PYTHONHOME=$ANACONDA_ROOT export FFI_LIB=$PYTHONHOME/lib/ffi export LIBPATH=$PYTHONHOME/lib:$LIBPATH •Permission to read and write to <install_dir_zos>/configuration and subdirectories •Permission to read and write to <install_dir_zos>/iml-library/tmp •Permission to read and write to <install_dir_zos>/imlpython and subdirectories •Permission to read and write to <install_dir_zos>/ophandling and subdirectories •Permission to write to <install_dir_zos>/iml-library/output •Permission to read <install_dir_zos>/iml-library •Permission to read <install_dir_zos>/iml-library/brunel •Permission to write to <install_dir_anaconda>
CICS region owner or user ID (<cics_region_userid>)	This user ID is used to start and run the scoring service in a CICS region.	•Permission to read and write to <install_dir_zos>/cics-scoring and subdirectories •<JVMPROFILEDIR>/ALNSCSER.jvmprofile
Machine Learning scoring service user ID (<mlz_scoring_userid>)	This user ID is used for installing and configuring the scoring service and for starting the service servers.	•Member of RACF user group <spark-GRP> •$SPARK_HOME and $SPARK_CONF_DIR environment variables included in the user's profile •$PYTHONHOME environment variable included in the user profile •$JAVA_HOME/bin defined in the $PATH environment variable in the user’s profile •READ access to BPX.FILEATTR.APF and BPX.FILEATTR.PROGCTL facilities •Permission to write to <install_dir_zos>

For ease of installation and post-installation access control, consider using the same user ID for installing Machine Learning for z/OS operation handling services, z/OS Spark, z/OS Anaconda, Jupyter kernel gateway, and Apache Toree. If you prefer to use different IDs, consider applying the same naming convention, such as MLZ(TYPE), to create MLZSPARK, MLZLDAP, and MLZSCORS. The naming convention helps make it easier to administer and monitor the activities of these IDs.

2.6 Networks, ports, and firewall configuration

Machine Learning for z/OS implements SSL/TLS protocols to secure network communications across component systems and uses Kubernetes to manage security policies in a cluster. The networks use dedicated ports, some of which are predefined. Make sure that you reserve the required ports for Machine Learning for z/OS and configure your network firewall accordingly.

2.6.1 Network requirements

The Linux or Linux on Z installer sets up a Kubernetes cluster. The cluster is configured to provide high availability to Machine Learning for z/OS services, including the primary web user interface and the administration dashboard. Make sure that you meet the following network requirements for this cluster:

•All nodes in the cluster run in the same subnet, with each assigned a private static IP address.

•Each node is associated with a gateway within the subnet, regardless whether or not the gateway allows outbound network access.

•The subnet itself is assigned a private static IP address that is to be used as a proxy server address. The IP address must be offline during the installation.

•The SELinux module on each node is set to “permissive” or “enforcing” (SELINUX=permissive or SELINUX=enforcing) in the /etc/selinux/config file. Restart the node after any change to the setting.

•The cluster requires two unique IP ranges in CIDR format, one to be used by the Kubernetes service network and the other by the cluster overlay network.

– Kubernetes service network: A Kubernetes service is an abstraction which defines a logical set of pods and a policy. It redirects the network traffic to each of the pods at the service's backend. Kubernetes manages the IP range and assigns an IP address to each service. You need to assign an IP range for the Kubernetes service network.

– Cluster overlay network: A pod is the basic building block of Kubernetes, which encapsulates an application container. Kubernetes relies on an overlay network to manage how groups of pods are allowed to communicate with each other and other endpoints. You need to assign an IP range for the cluster overlay network.

Make sure that the IP ranges are represented by a CIDR notation. CIDR specifies an IP address range by the combination of an IP address and its associated network mask. Take the range of 192.168.0.0/16 as an example. Although 192.168.0.0 is the network IPv4 address itself, the number 16 indicates that the first 16 bits are the network part of the address, and the remaining 16 bits are for host addresses. If the subnet mask is 255.255.0.0, the range can start from 192.168.0.0 to 192.168.255.255.

Carefully select the required IP ranges. The ranges must not overlap with each other. The IP addresses in the ranges must not conflict with those used by the Machine Learning for z/OS proxy server or your local networks.

Table 2-5 shows an example for selecting an internal IP range.

Table 2-5 Example of internal IP ranges

	Host Network/IP	Cluster Overlay Network	Kubernetes Service Network
Host has a single IP	172.16.x.x	192.168.0.0/16	10.0.0.0/16
Host IP conflicts with the overlay network default	192.168.x.x	172.16.0.0/16	10.0.0.0/16
Host has more than one IP address	192.168.x.x, 10.3.x.x	172.16.0.0/16	172.17.0.0/16

2.6.2 Ports

Machine Learning for z/OS requires dedicated ports for network communication across component systems and services. Some ports are predefined, and others can be user defined. Make sure that you configure the required ports and open them in your firewall, as listed in Table 2-6.

Table 2-6 Ports for systems and services on z/OS and Linux or Linux on Z

System or Service	Port Number	Outbound	Inbound	Note
Db2 for z/OS	User defined	Linux or Linux on Z system	Db2 subsystem	The assignment of this port depends on your Db2 configuration.
LDAP	User defined default: 636	Linux or Linux on Z system	z/OS system
z/OS Spark Master	User defined default: 7077	Linux or Linux on Z system	z/OS system
z/OS Spark Master REST API	User defined default: 6066	Linux or Linux on Z system	z/OS system
Operation Handling Service	User defined default: 10080	Linux system	z/OS system
Scoring Service	User defined	Linux or Linux on Z system	Liberty Profile for z/OS system	The assignment of this port depends on the configuration of the Liberty Profile server and the scoring service by default.
Jupyter kernel gateway	1 user defined default: 8889	Linux or Linux on Z system	Apache Toree kernel
Apache Toree kernel	User defined (A range of port numbers in consecutive order)	None	z/OS system	Each Toree kernel must be assigned 5 port numbers in consecutive order. All port numbers in the range must be in consecutive order. Example: If you use eight Toree kernels in your setup, you must prepare a total of 40 ports starting from the first port number.
Repository service	12501	Linux system, Liberty Profile for z/OS system	Linux system
Deployment service	14150	Linux system, Liberty Profile for z/OS system, Python run time for z/OS	Linux system
Batch scoring service	12200	Linux system, z/OS Spark system	Linux system
RabbitMQ service	5671, 5672	Linux system	Linux system
Kubernetes ETCD	2379	Linux system	Linux system
Feedback service	14350	Linux system	Linux system
Ingestion service	13100	Linux system	Linux system
Pipeline service	13300	Linux system	Linux system
Machine Learning for z/OS UI	443	Your network	Linux system

2.7 Firewall configuration

Instead of a traditional server firewall, Kubernetes uses IP tables for cluster communication. So, disable the cluster firewall. If an extra firewall must be in place, set it up around the cluster, and open the ports in your local network that need to interact with the cluster, such as port 443 for web access.

Ensure that every node in the cluster has a single local host entry in the /etc/hosts file that corresponds to the 127.0.0.1 address. Do not allow any daemon or script process or any cron job to modify the hosts file, IP tables, routing rules, or firewall settings during or after the installation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2. Planning

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 2. Planning