Planning
Planning can both minimize costly errors during installation and shorten the lead time to get a new system up and running for production. Thorough planning is imperative to the successful installation of IBM Machine Learning for z/OS, which consists of integrated systems and services on different platforms. This chapter can help you develop a high-level, actionable plan that includes essential tasks, from obtaining product installers and allocating system capacity to provisioning network ports.
This chapter includes the following topics:
2.1 Product installers
Machine Learning for z/OS comes with installers or installation files for both z/OS and Linux or Linux on IBM Z (hereafter, Linux on Z) systems. Depending on your business need, choose one of the following combinations of system environment for installation:
Installation on z/OS and Linux
Installation on z/OS and Linux on Z
Your planning activity should start with obtaining all the required installation materials based on your decision.
Upon receiving your purchase order from IBM Shopz, verify that it includes the following materials:
SMP/E for z/OS image
Program Directory for IBM Machine Learning for z/OS
License Information for IBM Machine Learning for z/OS DVD
Accessing Machine Learning Services on Linux DVD
All available maintenance packages
Maintenance packages are version-specific and are posted as they become available. Make sure that you have all the updates for the version of Machine Learning for z/OS that you install by checking the IBM Support Customer access portal for IBM Machine Learning for z/OS.
The SMP/E image and the maintenance packages, if any, are only part of the installation materials. You need to download the remaining installers and scripts from the IBM Web Membership (IWM) site.
The Accessing Machine Learning Services on Linux DVD includes a “Memo to Users.” The memo contains the product access key and the full URL to the IBM Web Membership site, where you will see the following installation files:
IBM_Machine_Learning_Installer_v1.1.0.5_Linux_x86-64 (for installation on Linux)
IBM_Machine_Learning_Installer_v1.1.0.5_Linux_s390x (for installation on Linux on Z)
ITOA-Health-Tree-v1.1.0.5.tar (for installation of ITOA Health Tree application on Linux)
ITOA-Health-Tree-v1.1.0.5-s390x.tar (for installation of ITOA Health Tree application on Linux on Z)
iml_utilities-v1.1.0.5.tar (for SSL certificate generation on Linux or Linux on Z)
Download the installation files for installing Machine Learning for z/OS in the system environments that you decided.
2.2 Hardware and software requirements
Machine Learning for z/OS uses both IBM proprietary and open source technologies and requires the installation of various hardware and software products in the z/OS and Linux or Linux on Z environments. Make sure that you procure all the prerequisite products for installation on the systems that you selected.
 
Important: Make sure that you install and configure the prerequisite products you select and acquire, with the exception of IBM Open Data Analytics for z/OS. The next chapter guides you through the installation and configuration of the Open Data Analytics for z/OS components.
2.2.1 Prerequisites for z/OS
The following hardware and software are required for installing Machine Learning for z/OS in the Z environment:
z14, IBM z13®, or IBM zEnterprise® EC12 system
z/OS 2.1 or later
Db2 10 for z/OS (with APAR PI13725 applied) or later
z/OS Integrated Cryptographic Service Facility (ICSF)
IBM CICS Transaction Server for z/OS 5.4.0 (or 5.3.0 with APAR PI63005 applied)
IBM Open Data Analytics for z/OS 1.1.0
z/OS Spark 2.1.1 (FMID HSPK120)
z/OS Anaconda (FMID HANA110)
z/OS Mainframe Data Service 1.1 (FMID HMDS120)
IBM Tivoli® Directory Server for z/OS LDAP
IBM 64-bit SDK for z/OS, Java Technology Edition, v8 (with Refresh 4 Fix Pack 10) or later
Gzip 1.6
CICS is required only if you want to install and run Machine Learning for z/OS scoring services in a CICS region. Also, be aware that Java 8 SR5 has a known issue with batch processing during the start-up, such as the start-master and start-slave process. To avoid the problem, plan to use Java 8 SR5 with FP7 or later.
2.2.2 Prerequisites for Linux
The following hardware and software are required for installing Machine Learning for z/OS in the Linux environment:
Three x86 64-bit servers
Red Hat Enterprise Linux Server 7.2 or later
Open JDK 1.8.0 or later
2.2.3 Prerequisites for Linux on Z
The following hardware and software are required for installing Machine Learning for z/OS in the Linux on Z environment:
Three s390x 64-bit server that runs on an LPAR of a z14, z13, IBM z13s®, zEnterprise EC12, zEnterprise BC12, LinuxOne Emperor, or LinuxOne Rockhopper system
Red Hat 7.2 or later
Open JDK 1.8.0 or later
2.3 System capacity
The correct system capacity for the correct workload is quintessential to maximize the value of Machine Learning for z/OS. A workload is typically defined by the number of concurrent model creation jobs run by the Jupyter Notebook or the Machine Learning for z/OS visual model builder, the size of modeling training data set (in GB), and the number of model training data features. Each model creation job includes the tasks for data loading, data transformation, data visualization, feature extraction, feature transformation, model training, and model evaluation. Carefully plan adequate system capacity in hardware, processing power, and disk storage based on the anticipated needs of your enterprise workload.
2.3.1 Basic system capacity
The scoring and training services of Machine Learning for z/OS run on z/OS, and its management services, user interface, and administration dashboard run on Linux or Linux on Z. These services require a minimum of system capacity. Although you can use the basic system capacity to run any reasonable workload, the rule of thumb is that the heavier the workload is, the more capacity you need to allocate.
If you choose the combination of z/OS and Linux for installation, ensure that the systems have the basic capacity listed in Table 2-1.
Table 2-1 Basic system capacity for installation on z/OS and Linux
Hardware
Number of LPAR/Server
CPU
(Per LPAR/Server)
Memory (GB)
(Per LPAR/Server)
DASD/Disk Space (GB) 
(Per LPAR/server)
IBM z Systems®
1 LPAR
4 zIIP processors, 
1 general purpose processor
100
50
Linux system
3 x86 64-bit servers
8 cores
48
250 
(plus a minimum of 650 GB secondary storage for each server)
If you choose the combination of z/OS and Linux on Z for installation, ensure that the systems have the basic capacity listed in Table 2-2.
Table 2-2 Basic system capacity for installation on z/OS and Linux on Z
Hardware
Number of LPAR/Server
CPU
(Per LPAR/Server)
Memory (GB)
(Per LPAR/Server)
DASD/Disk Space (GB) 
(Per LPAR/server)
z Systems
1 LPAR
4 zIIP processors, 
1 general purpose processor
100
50
Linux on Z system
3 s390x 64-bit servers
3 IFL processors
48
250 
(plus a minimum of 650 GB secondary storage for each server)
For best performance in either installation scenario, consider dedicating the LPAR for Machine Learning for z/OS. Also, allocate a secondary storage to each Linux or Linux on Z server and configure it with two mount points in XFS format with the ftype option enabled. Make sure that one mount point is appropriated a minimum of 300 GB for installer files and the other a minimum of 350 GB for data storage.
2.3.2 Capacity considerations for training services
Machine learning models are trained with data and algorithms. In general, model training is CPU intensive and can consume most of the CPU available on a given LPAR. The heavier the training workload is, the more CPU is needed. So, allocate enough processors based on your projected workload for the LPAR where Machine Learning for z/OS training services run.
The type of models and algorithms also affects the type of processors you need for model training on z/OS. For example, Spark and MLeap models are typically trained on zIIP processors, and Scikit-learn models are processed primarily on the general processors. Increase the number of processors to process the type of models you build and the type of algorithms you plan to use.
Last but not least, the size of the training data itself constitutes a significant factor in memory usage. Results of repeated tests indicate that memory usage tends to be two to three times of the size of the training data, and that number bumps up when training jobs are executed concurrently. Therefore, the preferred practice is to allocate adequate memory based on both the size of your training data and the number of concurrent training jobs.
2.3.3 Capacity considerations for scoring services
Machine Learning for z/OS processes scoring requests on different processors depending on the type of models. For example, while scoring requests for Scikit-learn models are processed on the general processor, those for Spark, MLeap, and PMML models are handled on zIIP processors. So, take into account the type of models that you develop and allocate the appropriate type of processors for the LPAR where scoring services run.
If high availability is essential to your business, consider using a scoring service cluster. In such a cluster, multiple instances of a scoring service share the same URI, with each running on a different LPAR of a sysplex. The cluster uses a round-robin algorithm of the sysplex distributor (SD) to dispatch scoring requests across the LPARs. The cluster processes all the scoring requests as long as one of the LPARs is operational.
2.3.4 Capacity considerations for performance
System response time is a key performance indicator in machine learning operations. The response time of Machine Learning for z/OS services generally corresponds to the availability of system capacity, as evidenced by test results in the following example (see Table 2-3).
Table 2-3 System response time corresponds to availability of system capacity
Processors
System response time (minutes)
Number of concurrent model creation jobs
Size of data set (GB)
Number of data features
8
16
32
64
4 zIIP
1 GCP
13
27
56
135
2
100
21
46
94
238
4
200
8 zIIP
1 GCP
9
15
31
68
2
100
14
21
44
101
4
200
In this example, if four zIIPs and one GCP are allocated to run the workload of 16 concurrent jobs, one training data set of 2 GB in size, and 100 data features, it can take the system up to 27 minutes to complete the jobs. If eight zIIPs are allocated for the same workload, system response time can be reduced to 15 minutes. Although the actual results of your system performance can vary, the example demonstrates the positive correlation between system capacity and system response time. In other words, adequate allocation of system capacity improves system response time when the same or similar workload is being processed. Consider increasing your system capacity to improve the response time and thus the overall performance of Machine Learning for z/OS services.
2.4 Installation options on z/OS
Machine Learning for z/OS offers flexibility in terms of where to install the training and scoring services on z/OS. Depending on your business need and the system capacity you plan to allocate, carefully assess the following installation options and choose one that satisfies your machine learning workload while achieving the best performance without exceeding system capacity:
Training and scoring services on the same LPAR
Training and scoring services on different LPARs
Training services on an LPAR and scoring services on an LPAR cluster
Machine Learning for z/OS uses z/OS Mainframe Data Service (MDS) as both a data connector and a data source. MDS must be on the same LPAR where the machine learning training and scoring services run. If you use MDS in your setup, make sure that MDS and the training and scoring services are installed on the same LPAR or sysplex.
2.4.1 Option 1: Training and scoring services on the same LPAR
This option suggests the installation of both training and scoring services on a single LPAR that is dedicated to Machine Learning for z/OS. The sysplex in Figure 2-1 on page 15 includes multiple LPARs with one handing exclusively machine learning workload and others executing existing applications for production. At run time, data is ingested from one or more production systems to the dedicated LPAR for training and scoring services.
Figure 2-1 Installing training and scoring services on the same LPAR
There are several advantages to this option. The installation is straightforward with all the machine learning component systems and services going to the same location on z/OS. Also, the option does not impact the performance of the existing production systems in the sysplex. Most importantly, with careful workload balancing for training and scoring requests, the services can share and maximize the use of system resources for better performance.
The disadvantage of this option is the potentially negative impact on the performance of scoring services. Both data ingestion for training and scoring requests come from other production systems, and heavy network traffic between the LPARs might slow down the responses of those services. Consider this option if your machine learning workload is not heavy and if you want to keep the production systems for machine learning and other applications separate.
2.4.2 Option 2: Training and scoring services on different LPARs
This option suggests the installation of machine learning training and scoring services on separate LPARs with existing production systems. Figure 2-2 shows an example of this installation option.
Figure 2-2 Figure 2: Installing training and scoring services on different LPARs
In this example, the installation of training and scoring services spreads across three different LPARs in the sysplex. All of them coexist with other applications on their respective production systems where data lives. The training services run on the same LPAR along with Db2, IMS, or VSAM which holds the data for model training. The scoring services run on the same LPARs where scoring requests originate and can respond to those requests with minimal performance impact.
This installation option addresses the shortcomings in the first option. The biggest upside is that it uses and optimizes the use of existing system resources on each LPAR while eliminating the potential performance impact due to heavy network traffic. Consider the option particularly when fast elapsed time for both scoring and training services is essential to the operation of your production systems.
2.4.3 Option 3: Training services that are on an LPAR and scoring services that are on an LPAR cluster
This option is similar to the second one in terms of installing the training and scoring services on separate LPARs. The difference, which is significant, lies in the suggestion that the scoring services be installed on an LPAR cluster. Figure 2-3 on page 17 shows the layout of this installation option.
Figure 2-3 Installing training services on an LPAR and scoring services on an LPAR cluster
In this example, the training services run on a dedicated LPAR in a sysplex, and the scoring services are installed on multiple LPARs in another sysplex which is configured as a scoring service cluster. The sysplex distributor (SD) is used to balance and distribute scoring workloads among multiple instances of scoring services in the cluster. All scoring requests are processed as long as one LPAR in the cluster is up and running. This option delivers high availability and scalability of Machine Learning for z/OS services. Consider this option if your machine learning workload is significantly heavy and high availability and stability are top priorities of your business.
2.5 User IDs and permissions
The Linux or Linux on Z installer of Machine Learning for z/OS uses the default user of each node to install component systems and services but requires user-defined IDs with proper permissions for installation on z/OS. Dedicated user IDs are also required for Machine Learning for z/OS to access Db2 for z/OS and z/OS LDAP with the SDBM backend. Make sure that you identify or create all the required IDs and assign them sufficient privileges, as listed in Table 2-4, before you start the installation.
Table 2-4 User IDs and permissions required for installing Machine Learning for z/OS
Type of User ID
Description
Required Privileges or Permission
Db2 for z/OS authorization (<db2_auth_id>)
This authorization ID is used by the Machine Learning services to access Db2 for z/OS.
DBADM authority, which is granted when you run the ALNMLEN sample JCL job
z/OS LDAP user ID (<zldap_userid>)
This user ID is used by the Machine Learning services to access z/OS LDAP.
RACF SPECIAL authority for validating a new user that you want to add
z/OS Spark, Jupyter kernel gateway, Apache Toree, and MLz operation handling service user ID (<spark_jupyter_toree_userid>)
This user ID is used for installing and configuring z/OS Spark, Jupyter kernel gateway, and Apache Toree. This ID is also used for creating, configuring, and starting the operation handling service on z/OS.
Member of IBM RACF® user group <spark-GRP>.
$SPARK_HOME and $SPARK_OPTS ($SPARK_OPTS="—master spark://<ip_address>:<port>")
Environment variables included in the user’s profile ($HOME/.profile)
$IML_HOME environment variable included in the user’s profile, which points to <install_dir_zos>.
Inclusion of the following environment variables in the user’s profile:
export ANACONDA_ROOT="<install_dir_anaconda>"
export PATH=$ANACONDA_ROOT/bin:$PATH
export PYTHONHOME=$ANACONDA_ROOT
export FFI_LIB=$PYTHONHOME/lib/ffi
export LIBPATH=$PYTHONHOME/lib:$LIBPATH
Permission to read and write to <install_dir_zos>/configuration and subdirectories
Permission to read and write to <install_dir_zos>/iml-library/tmp
Permission to read and write to <install_dir_zos>/imlpython and subdirectories
Permission to read and write to <install_dir_zos>/ophandling and subdirectories
Permission to write to <install_dir_zos>/iml-library/output
Permission to read <install_dir_zos>/iml-library
Permission to read <install_dir_zos>/iml-library/brunel
Permission to write to <install_dir_anaconda>
CICS region owner or user ID
(<cics_region_userid>)
This user ID is used to start and run the scoring service in a CICS region.
Permission to read and write to <install_dir_zos>/cics-scoring and subdirectories
<JVMPROFILEDIR>/ALNSCSER.jvmprofile
Machine Learning scoring service user ID (<mlz_scoring_userid>)
This user ID is used for installing and configuring the scoring service and for starting the service servers.
Member of RACF user group <spark-GRP>
$SPARK_HOME and $SPARK_CONF_DIR environment variables included in the user's profile
$PYTHONHOME environment variable included in the user profile
$JAVA_HOME/bin defined in the $PATH environment variable in the user’s profile
READ access to BPX.FILEATTR.APF and BPX.FILEATTR.PROGCTL facilities
Permission to write to <install_dir_zos>
For ease of installation and post-installation access control, consider using the same user ID for installing Machine Learning for z/OS operation handling services, z/OS Spark, z/OS Anaconda, Jupyter kernel gateway, and Apache Toree. If you prefer to use different IDs, consider applying the same naming convention, such as MLZ(TYPE), to create MLZSPARK, MLZLDAP, and MLZSCORS. The naming convention helps make it easier to administer and monitor the activities of these IDs.
2.6 Networks, ports, and firewall configuration
Machine Learning for z/OS implements SSL/TLS protocols to secure network communications across component systems and uses Kubernetes to manage security policies in a cluster. The networks use dedicated ports, some of which are predefined. Make sure that you reserve the required ports for Machine Learning for z/OS and configure your network firewall accordingly.
2.6.1 Network requirements
The Linux or Linux on Z installer sets up a Kubernetes cluster. The cluster is configured to provide high availability to Machine Learning for z/OS services, including the primary web user interface and the administration dashboard. Make sure that you meet the following network requirements for this cluster:
All nodes in the cluster run in the same subnet, with each assigned a private static IP address.
Each node is associated with a gateway within the subnet, regardless whether or not the gateway allows outbound network access.
The subnet itself is assigned a private static IP address that is to be used as a proxy server address. The IP address must be offline during the installation.
The SELinux module on each node is set to “permissive” or “enforcing” (SELINUX=permissive or SELINUX=enforcing) in the /etc/selinux/config file. Restart the node after any change to the setting.
The cluster requires two unique IP ranges in CIDR format, one to be used by the Kubernetes service network and the other by the cluster overlay network.
 – Kubernetes service network: A Kubernetes service is an abstraction which defines a logical set of pods and a policy. It redirects the network traffic to each of the pods at the service's backend. Kubernetes manages the IP range and assigns an IP address to each service. You need to assign an IP range for the Kubernetes service network.
 – Cluster overlay network: A pod is the basic building block of Kubernetes, which encapsulates an application container. Kubernetes relies on an overlay network to manage how groups of pods are allowed to communicate with each other and other endpoints. You need to assign an IP range for the cluster overlay network.
Make sure that the IP ranges are represented by a CIDR notation. CIDR specifies an IP address range by the combination of an IP address and its associated network mask. Take the range of 192.168.0.0/16 as an example. Although 192.168.0.0 is the network IPv4 address itself, the number 16 indicates that the first 16 bits are the network part of the address, and the remaining 16 bits are for host addresses. If the subnet mask is 255.255.0.0, the range can start from 192.168.0.0 to 192.168.255.255.
Carefully select the required IP ranges. The ranges must not overlap with each other. The IP addresses in the ranges must not conflict with those used by the Machine Learning for z/OS proxy server or your local networks.
Table 2-5 shows an example for selecting an internal IP range.
Table 2-5 Example of internal IP ranges
 
Host Network/IP
Cluster Overlay Network
Kubernetes Service Network
Host has a single IP
172.16.x.x
192.168.0.0/16
10.0.0.0/16
Host IP conflicts with the overlay network default
192.168.x.x
172.16.0.0/16
10.0.0.0/16
Host has more than one IP address
192.168.x.x, 10.3.x.x
172.16.0.0/16
172.17.0.0/16
2.6.2 Ports
Machine Learning for z/OS requires dedicated ports for network communication across component systems and services. Some ports are predefined, and others can be user defined. Make sure that you configure the required ports and open them in your firewall, as listed in Table 2-6.
Table 2-6 Ports for systems and services on z/OS and Linux or Linux on Z
System or Service
Port Number
Outbound
Inbound
Note
Db2 for z/OS
User defined
Linux or Linux on Z system
Db2 subsystem
The assignment of this port depends on your Db2 configuration.
LDAP
User defined
default: 636
Linux or Linux on Z system
z/OS system
 
z/OS Spark Master
User defined
default: 7077
Linux or Linux on Z system
z/OS system
 
z/OS Spark Master REST API
User defined
default: 6066
Linux or Linux on Z system
z/OS system
 
Operation Handling Service
User defined
default: 10080
Linux system
z/OS system
 
Scoring Service
User defined
Linux or Linux on Z system
Liberty Profile for z/OS system
The assignment of this port depends on the configuration of the Liberty Profile server and the scoring service by default.
Jupyter kernel gateway
1 user defined
default: 8889
Linux or Linux on Z system
Apache Toree kernel
 
Apache Toree kernel
User defined
(A range of port numbers in consecutive order)
None
z/OS system
Each Toree kernel must be assigned 5 port numbers in consecutive order.
All port numbers in the range must be in consecutive order.
Example: If you use eight Toree kernels in your setup, you must prepare a total of 40 ports starting from the first port number.
Repository service
12501
Linux system, Liberty Profile for z/OS system
Linux system
 
Deployment service
14150
Linux system, Liberty Profile for z/OS system, Python run time for z/OS
Linux system
 
Batch scoring service
12200
Linux system, z/OS Spark system
Linux system
 
RabbitMQ service
5671, 5672
Linux system
Linux system
 
Kubernetes ETCD
2379
Linux system
Linux system
 
Feedback service
14350
Linux system
Linux system
 
Ingestion service
13100
Linux system
Linux system
 
Pipeline service
13300
Linux system
Linux system
 
Machine Learning for z/OS UI
443
Your network
Linux system
 
2.7 Firewall configuration
Instead of a traditional server firewall, Kubernetes uses IP tables for cluster communication. So, disable the cluster firewall. If an extra firewall must be in place, set it up around the cluster, and open the ports in your local network that need to interact with the cluster, such as port 443 for web access.
Ensure that every node in the cluster has a single local host entry in the /etc/hosts file that corresponds to the 127.0.0.1 address. Do not allow any daemon or script process or any cron job to modify the hosts file, IP tables, routing rules, or firewall settings during or after the installation.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.255.139