Chapter 4. Data transformation

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Data transformation

This chapter explains in-database transformation (IDT), part of the real-time analytics solution using SPSS Modeler and IBM DB2 Analytics Accelerator for z/OS capabilities.

This chapter contains the following topics:

•Introduction

•SQL pushback in SPSS Modeler

•Nodes supporting SQL generation for DB2 Accelerator

•In-database analytics Processing effort by components

•SPSS data transformation stream

•Stream messages

•Data transformation using DataStage

Note: IBM DB2 Analytics Accelerator for z/OS, DB2 Appliance, and the term Accelerator are used interchangeably through out this book.

4.1 Introduction

Data scientists need to work with good data to ensure their predictive models are accurate. However, data used to create models is often not readily consumable by data scientists. The data might contain missing or invalid entries, thus requiring some form of transformation to prepare the data before a model can be built. Data aggregation or generic transformation might be required before a data mining algorithm can be used. Accordingly, a data scientist will apply an appropriate method to evaluate the data and transform the data as required for the later steps in model creation and in model scoring.

After all the source data is loaded in to the Accelerator, the data is denormalized, cleansed, and prepared for modeling by using in-database transformation.

Figure 4-1 depicts the schematic of data load and data transformation processes. The DB2 Analytics Accelerator Loader for z/OS tool is discussed in Chapter 3, “Data integration using IBM DB2 Analytics Accelerator Loader for z/OS” on page 41.

Figure 4-1 does not show DataStage, which can also be used to perform extract, transform and load (ETL) and extract, load, and transform (ELT) from external data sources and use in-database transformation on DB2 Analytics Accelerator.

Figure 4-1 Data load and data transformation processes

This chapter explains how the data transformation phase of the analytics lifecycle can take advantage of the in-database transformation on IBM DB2 Analytics Accelerator for z/OS to transform data from one or more DB2 for z/OS tables (and external data through one or more intermediate tables to an accelerator-only table (AOT)) that are optimized for analytical processing.

4.1.1 Accelerator-only table (AOT)

The data preparation phase of the analytics lifecycle requires the presence of an analytics work environment, in which the data scientists can work with data and perform training and validation of predictive analytics models.

On platforms other than z/OS, the data preparation phase usually involves extracting data from OLTP data sources (often on z/OS), then loading and transforming (ELT or ETL) the data into the work environment. But, in a z/OS environment, performing in-database transformation (IDT) in the database accelerator (if the data source is on DB2 for z/OS) is possible. IBM DB2 Analytics Accelerator for z/OS allows data scientists to create the work environment on the DB2 appliance through the AOTs. This is done transparently and the data scientists can be blissfully ignorant of this work environment.

AOTs are tables that do not originate from DB2 base tables. Their data exists only on the Accelerator and not in DB2 for z/OS.

As Figure 4-1 on page 84 shows the following information:

•Data from other sources including VSAM and IMS on z/OS platform and also Social media, Oracle, SQL Server and other databases on other platforms can be extracted and loaded into the AOTs with the use of DB2 Analytics Accelerator Loader for z/OS tool.

•The data load and data transformation steps involved in our proposed solution, where data from external sources can be loaded into the AOT with the help of DB2 Analytics Accelerator Loader for z/OS tool. Those AOTs can then be joined with a replica of DB2 for z/OS tables, which coexists in the Accelerator (also called as Accelerated Tables). The joined result set can also go through the data preparation and data transformation on the Accelerator and eventually results in a single AOT that can be used as the input to the analytics modeling algorithms.

AOT versus accelerated replica tables

When a new AOT is created, an entry is added to the DB2 for z/OS catalog table SYSIBM.SYSTABLES to distinguish an AOT from regular tables and views by setting the TYPE column value to “D” (definition-only). The term proxy table on DB2 for z/OS is used to signify that only the definition exists on DB2 for z/OS and the corresponding data exists only on the Accelerator.

An AOT is created when the CREATE TABLE statement is issued in DB2 with the IN ACCELERATOR <name> clause where <name> determines the name of the Accelerator. If the IN ACCELERATOR clause is not specified, then that table is created only in DB2 for z/OS. If you need to accelerate a table that is only in DB2, you need to add it to the accelerator and then load it up with the data on the base table. This accelerated replica table on DB2 Analytics Accelerator can be kept in sync with the base DB2 for z/OS table using change data capture (CDC) based replication, which is part of the DB2 Analytics Accelerator solution. But, the AOT table has no matching data on DB2 for z/OS so you are not able to use CDC-based replication on AOTs.

For additional information, see Chapter 2 of Accelerating Data Transformation with IBM DB2 Analytics Accelerator for z/OS, SG24-8314.

The AOTs are used by SPSS Modeler in the data transformation phase in performing the following tasks:

•In-database caching, which can be used to improve the performance in certain data transformation scenarios

•Storing the transformed data (temporarily), which can then be used as an input to build the predictive model

Note: The temporary AOTs created during the data transformation phase are automatically deleted as soon as the predictive model is built successfully.

4.1.2 Enabling in-database processing on SPSS Modeler client

Completing the user preference setup on the SPSS modeler client to enable in-database analytics takes about 30 seconds. By changing the EnableIDAA parameter to true in the user.prefs (PREFS file), you can enable the in-database transformation and in-database modeling features on your SPSS Modeler.

A sample user preferences listing with the EnableIDAA=true setting on the user.prefs PREFS file (located in the users home directory in the directory path shown) is in 2.2.3, “SPSS Modeler client” on page 25.

This way allows the SPSS Modeler to execute all qualifying nodes directly on IBM DB2 Analytics Accelerator for z/OS. How to check whether the nodes in your SPSS stream can qualify for in-database execution are described in 4.5.1, “SQL Preview option to verify in-database processing” on page 100.

4.2 SQL pushback in SPSS Modeler

Wherever an IBM SPSS Modeler stream reads data from DB2 Accelerator to process the data, it can improve the efficiency of this operation by pushing back the SQL statements to execute in the Accelerator database.

SQL pushback feature of the SPSS Modeler basically generates SQL statements that can be pushed back to (that is, executed in) the DB2 for z/OS Accelerator database.

The DB2 Accelerator can then perform data sampling, cleansing, filtering, aggregating, joining, and so on. This is commonly referred to as in-database transformation (IDT) or in this case, accelerated in-database transformation.

4.2.1 How SQL generation works

The initial fragments of a stream leading from the database source nodes are the main targets for SQL generation. When a node is encountered that cannot be compiled to SQL, the data is extracted from the database and subsequent processing is performed by IBM SPSS Modeler server.

During stream preparation and prior to execution, the SQL generation process happens as follows:

1. The server reorders streams to move downstream nodes into the “SQL zone” where it can be proven safe to do so. (This feature can be disabled on the server.)

2. Working from the source nodes toward the terminal nodes, SQL expressions are constructed incrementally. This phase stops when a node is encountered that cannot be converted to SQL or when the terminal node is converted to SQL. At the end of this phase, each node is labeled with an SQL statement if the node and its predecessors have an SQL equivalent.

3. Working from the nodes with the most complicated SQL equivalents back toward the source nodes, the SQL is checked for validity. If the database does not return an error code on the prepare operation, the SQL is executed. Otherwise, the SPSS Modeler might try different SQL. If the SPSS Modeler is not able to generate SQL that can be processed by the database, it gives up by enumerating the source tables (or intermediate results from accelerator-only tables, in case up-stream nodes were already processed correctly) and then perform the operation, which it failed to do in the SQL directly in the SPSS Modeler server.

4. Nodes for which all operations have generated SQL are highlighted in purple on the stream canvas. For more information, read 4.5.1, “SQL Preview option to verify in-database processing” on page 100.

5. Based on the results, you might want to further reorganize your stream where appropriate to take full advantage of database execution. For instance, you might have to enable node caching in some cases to take advantage of in-database transformation.

SQL generation starts from source node and moves downstream; if any node in middle of the stream fails to generate SQL, the SQL generation stops for all the downstream nodes beginning from that unsuccessful node. The processing continues on the SPSS Modeler server for those downstream nodes.

4.2.2 Where improvements can occur with IDT using Accelerator

Products like DataStage and Cognos can use in-database processing using DB2 Analytics Accelerator to accelerate data transformation and report visualization queries respectively. In addition, SQL optimization in SPSS improves performance in a number of data operations by taking advantage of in-database transformation in the following operations:

•Joins (merge by key): Join operations can increase optimization within databases.

•Aggregation: The aggregate nodes use aggregation to produce their results. Summarized data uses considerably less bandwidth than the original data.

•Selection: Choosing records based on certain criteria reduces the quantity of records.

•Sorting: Sorting records is a resource-intensive activity that is performed more efficiently in a database.

•Field derivation: New fields are generated more efficiently in a database.

•Field projection: IBM SPSS Modeler server extracts only fields that are required for subsequent processing from the database, which minimizes bandwidth and memory requirements. The same is also true for superfluous fields in flat files: although the server must read the superfluous fields, it does not allocate any storage for them.

•Scoring: SQL can be generated from decision trees, linear regression models and so on.

In general, the fact that IDT allows minimizing data movement leads to the result that IDT streams outperform non-IDT (that is, traditional) streams. It allows pushing the transformation operations to the database, and on top of it; IDT takes advantage of the massively parallel architecture of the Accelerator. But, if most of your stream does not qualify for IDT then the transformation performance might not be as good as one expects.

4.3 Nodes supporting SQL generation for DB2 Accelerator

The nodes are displayed at the bottom of the stream canvas on the SPSS Modeler. Each palette tab contains a collection of related nodes used for the phases of stream operations. This section describes all the nodes that might qualify for in-database execution on DB2 for z/OS with DB2 Analytics Accelerator. The nodes are tabulated for each palette tab.

4.3.1 Source palette tab

The nodes under this palette bring data into IBM SPSS Modeler. Table 4-1 lists the only source palette node that may qualify for in-database execution on DB2 for z/OS with DB2 Analytics Accelerator for z/OS.

Table 4-1 Source palette node that supports in-database transformation

Node	Node description
	Database node: This node is for specifying tables and views to be used for further analysis. It is the only node on the sources palette that can enable SQL pushback into DB2 Analytics Accelerator for in-database processing.

If you want to use any other data source, consider using other external tools like DataStage or DB2 Analytics Accelerator Loader to load that source data into an AOT and then use this Database node to perform your analysis. Also possible is to add generic SQL to this node.

Database source node: SQL query as input

With the SPSS Modeler Database source node, the only possibility is to add external database tables to the stream that are later on processed using a graphical, flow-oriented programming style. Also possible is to add a SQL SELECT statement directly, as shown in Figure 4-2 on page 89.

The steps are as follows:

1. Double-click Database Source Node.

2. Click the Data tab.

3. Select SQL Query as the mode.

4. Type (or paste) the SQL SELECT statement into the box below the Data source area. You can then act on that query using the buttons shown in Figure 4-2 on page 89, Such actions include Save Query, Apply, or OK.

Perhaps you are more comfortable expressing something in SQL directly instead of using multiple SPSS Modeler nodes, which can be handy. Also, if you have a third-party tool that can generate SQL, pasting such generated SQL might be a useful integration point with SPSS Modeler.

Figure 4-2 Database source node with SQL Query input

4.3.2 Record Ops palette tab

The nodes under this palette perform operations on data records, such as selecting, merging, sampling, and appending. Table 4-2 lists the Record Ops palette nodes that may qualify for in-database execution on DB2 for z/OS with DB2 Analytics Accelerator.

Table 4-2 Record Ops nodes that support in-database transformation

Node	Node description
	Select node: This node supports generation only if SQL generation for the select expression itself is supported.
	Sample node: This node is used to sample the input rows randomly. It supports SQL generation in DB2 Analytics Accelerator by using RAND() function in the WHERE clause of the generated SQL.
	Sort node: This node is used to sort the input rows. It supports SQL generation in DB2 Analytics Accelerator by using the ORDER BY clause on the generated SQL.
	Distinct node: This node is used to sort the input rows. It supports SQL generation in DB2 Analytics Accelerator by using the DISTINCT clause on the SELECT statement. A Distinct node with (default) mode of Create a composite record for each group selected does not support SQL optimization. Note: If you have a Sort node followed by the Distinct node, then the Distinct node cannot be accelerated.
	Aggregate node: SQL generation support for aggregation depends on the data storage type.
	RFM Aggregate node: SQL optimization is only enabled when 'Read from Bin Values tab if available' is used. Does not support generation if saving the date of the second or third most recent transactions, or if only including recent transactions. However, including recent transactions does work if the datetime_date(YEAR,MONTH,DAY) function is pushed back.
	Merge node: Supports SQL generation for merge by condition. It generates a JOIN construct. Non-matching input fields can be renamed by means of a Filter node, or the Filter tab of a source node.
	Append node: Supports SQL generation if inputs are unsorted. Note: SQL optimization is possible only when your inputs have the same number of columns.

4.3.3 Field Ops palette tab

The nodes under this palette perform operations on data fields, such as filtering, deriving new fields, and determining the measurement level for given fields. Table 4-3 lists all Field Ops palette nodes that may qualify for in-database execution on DB2 for z/OS with DB2 Analytics Accelerator. All these nodes can generate SQL, if some restrictions are met and the SQL is accepted by DB2.

Table 4-3 Field Ops palette nodes that support in-database transformation

Node	Node description
	Auto Data Prep node: Before you can build a model, you need to specify which fields you want to use as targets and as inputs. This node prepares a field for analysis, it creates a new field containing the adjustments or transformations, rather than replacing the existing values and properties of the old field. The old field is not used in further analysis; its role is set to None. This node automatically creates Accelerator-only tables to accomplish the accepted data preparation changes. For more details read “Auto Data Prep Node” on page 98.
	Filter node: Allows fields to be renamed or removed.
	Derive node: Allows new fields to be generated based on existing fields. Supports SQL generation only if SQL generated for the derive expression is supported.
	Filler node: Allows values in existing fields to be replaced by new values. For instance, if NULL is replaced with average.
	Reclassify node: Recategorizes set values. Restrictions might apply.
	Binning node: Use this to automatically create new nominal fields based on the values of one or more existing continuous (numeric range) fields. For example, you can transform a continuous income field into a new categorical field (naive Bayes algorithm requires categorical inputs) containing income groups of equal width, or as deviations from the mean. Sensitive personal information, such as salaries, may be reported in ranges rather than actual salary figures in order to protect data privacy.
	RFM Analysis node: The Recency, Frequency, Monetary (RFM) Analysis node enables you to determine quantitatively which customers are likely to be the best ones by examining how recently they last purchased from you (recency), how often they purchased (frequency), and how much they spent over all transactions (monetary).
	Partition node: Used to generate a partition field that splits the data into separate subsets or samples for the training, testing, and validation stages of model building. By using one sample to generate the model and a separate sample to test it, you can get a good indication of how well the model will generalize to larger data sets that are similar to the current data. Partitioning must be enabled on the Model Options tab.
	Set to Flag node: Used to derive flag fields based on the categorical values defined for one or more nominal fields. For example, your data set might contain a nominal field, BP (blood pressure), with the values High, Normal, and Low. For easier data manipulation, you might create a flag field for high blood pressure, which indicates whether or not the patient has high blood pressure.
	Restructure node: Although similar to that of the Set to Flag node, it offers more flexibility. It allows you to create fields of any type (including numeric flags), using the values from another field. You can then perform aggregation or other manipulations with other nodes downstream. The Set to Flag node lets you aggregate fields in one step, which might be convenient if you are creating flag fields.

4.3.4 Graphs palette tab

The nodes under this palette graphically display data before and after modeling. Graphs include plots, histograms, web nodes, and evaluation charts. Table 4-4 lists all the Graphs palette nodes that may qualify for in-database execution on DB2 for z/OS with DB2 Analytics Accelerator.

Table 4-4 Graphs palette nodes that support in-database transformation

Node	Node description
	Graphboard node: SQL generation is supported for the following graph types: Area, 3-D Area, Bar, 3-D Bar, Bar of Counts, Heat map, Pie, 3-D Pie, Pie of Counts. For Histograms, SQL generation is supported for categorical data only. SQL generation is not supported for Animation in Graphboard.
	Distribution node: Shows the distribution of symbolic values in a data set. This node is frequently used before manipulation nodes to explore the data and correct any imbalances. The Distribution node is unusual in that it produces both a graph and a table to analyze your data.
	Web node: Shows the strength of relationships between values of two or more symbolic fields. The graph displays connections using varying types of lines to indicate connection strength.
	Evaluation node: Offers an easy way to evaluate and compare predictive models to choose the best model for your application. Evaluation charts show how models perform in predicting particular outcomes.

4.3.5 Database Modeling (Nuggets) palette tab

By definition all the Database Modeling nodes in the Database Modeling palette qualify for in-database processing. The modeling nuggets on other Modeling palette may not qualify for offloading to the Accelerator. Each Database Modeling node listed in Table 4-5 will invoke a set of IBM Netezza Analytics stored procedures, which can run in DB2 Analytics Accelerator to build the requested model. This can be considered as an extension to SQL pushback in the sense that the stored procedures are pushed down to the DB2 Analytics Accelerator. Read more about these notes in “IBM Netezza Analytics stored procedures” on page 190.

Table 4-5 Database Modeling palette nodes that support in-database modeling

Node	Node description
	K-Means is a clustering algorithm to partition a set of entities into a number of clusters.
	Naive Bayes is a supervised learning classification algorithm applying Bayes theorem.
	Decision Tree is a supervised learning algorithm to create a decision tree model. Decision trees usually have the advantage that the found rules are easy to understand by humans.
	Regression Tree is similar to Decision Tree except the target is not a discrete value; instead it is a continuos real number.
	TwoStep Cluster node provides a form of cluster analysis. It can be used to cluster the data set into distinct groups when you do not know what those groups are at the beginning.

4.3.6 Output palette tab

The nodes under this palette produce a variety of output for data, charts, and model results that can be viewed in IBM SPSS Modeler.

4.3.7 Export palette tab

The nodes under this palette produce a variety of output that can be viewed in external applications, such as IBM Cognos for z/OS, IBM SPSS Data Collection or Excel. Table 4-6 lists all the Export palette nodes that may qualify for in-database execution on DB2 for z/OS with DB2 Analytics Accelerator.

With a database export node, you can export data to either a persistent table in DB2 z/OS or a persistent table on the Accelerator. If the used data source has an associated accelerator, a table on that accelerator is extended or created. If the used data source does not have an associated accelerator, the data is persisted in DB2 for z/OS. For example, this is useful if some of the transformation done for predictive model creation is also required for predictive model scoring, and the scoring should be done in-database, in DB2 z/OS, with same qualities of service as everything else in DB2 for z/OS transaction processing (IBM GDPS®, WLM, DB2 backup, and so on).

Table 4-6 Export palette nodes that support in-database transformation

Node	Node description
	Database node: Used to specify tables and views to be used for further analysis. This can be used to either insert new rows into an existing AOT or create a new AOT and insert rows.
	IBM Cognos BI Export node: Used to export data to Cognos BI Server.

4.4 In-database analytics Processing effort by components

Assuming that the accelerated tables are loaded on DB2 Analytics Accelerator, Table 4-7 gives an overview of the technical components involved and what processing effort they incur in general.

Table 4-7 In-database analytics processing effort by component tasks

Component	In-database tasks	In-database processing requirement
SPSS Modeler client	GUI to generate SPSS stream.	Minimal processing that occurs only on client workstation.
SPSS Modeler server	1. Convert SPSS stream to SQL. 2. Monitor SQL execution.	Low server CPU resource consumption (if everything is processed in-database).
DB2 for z/OS (with Analytics Accelerator)	1. Check security. 2. Maintain metadata (catalog). 3. Query rewrite. 4. Create wrapper stored procedures.	Minimal, mainly for security checking and some simple forwarding of slightly rewritten SQL.
Netezza-based DB2 Analytics Accelerator for z/OS	1. Do the actual data transformation. 2. Perform the actual queries. 3. Create the analytical models.	High massively parallel architecture with FPGAs and so on; optimized for exactly that but practically zero MIPS usage on the z/OS side.

The process interaction between the components, listed in Table 4-7, and others involved in in-database transformation is also shown in Figure 2-23 on page 37.

4.4.1 SPSS Modeler client (running on Windows)

The SPSS Modeler client is the GUI used by the data scientist to create SPSS streams. From the Modeler client, you connect to the Modeler server in order to perform SQL Preview and other activities.

4.4.2 SPSS Modeler server

Regardless of whether you are deploying the SPSS Modeler server on Linux on z Systems or a Windows environment, if the streams are executed 100% in-database then all that Modeler server does is transform the stream into optimized SQL and supervise the execution, trying to slightly change the generated SQL if the underlying DBMS returns an error condition.

ODBC driver

SPSS Modeler uses the locally installed ODBC driver to connect to the z/OS DB2 database subsystem over TCP/IP network.

4.4.3 DB2 for z/OS with Analytics Accelerator

The DB2 optimizer understands whether a given SQL statement can be pushed down to the Accelerator, and does a SQL statement rewrite in case acceleration is possible. For the sample stream in this book, everything can run in the Accelerator. When executing a remote procedure like IBM Netezza Analytics stored procedures, DB2 Analytics Accelerator guarantees in conjunction with DB2 that the IBM Netezza Analytics procedure on the Accelerator accesses only data the caller is authorized to access in DB2. The privileges are verified with DB2 z/OS.

4.4.4 Netezza based DB2 Analytics Accelerator for z/OS

In short, the Accelerator does all the real work. The DBMS inside the Accelerator must do all the real transformation on the accelerated and accelerator-only tables. For model creation, the installed IBM Netezza Analytics libraries are used (massively parallel).

4.5 SPSS data transformation stream

This section describes the design of the SPSS data transformation streams that can be used in a real-time analytics solution.

Figure 4-3 shows the data transformation stream used on the stream canvas pertaining to our sample scenario.

Figure 4-3 Sample data transformation stream

The first node in the data transformation stream is the Source Database node, which is pointed to by the CARDUSR.PAYHISTORY table stored in the DB2 Analytics Accelerator (Accelerated DB2 for z/OS table).

Double-click the Database node icon, then select the cataloged DSN source from the Data source drop-down list. It is listed as your TSOUSERID@DSNALIAS, shown in Figure 4-4 as name@Data Source.

Figure 4-4 Source database node edit panel

From the database node panel shown in Figure 4-4, you can click Preview to view a sample of the data contained in the CARDUSR.PAYHISTORY table in DB2 Analytics Accelerator. The sample data is displayed in a new window (with 10 rows) as shown in Figure 4-5.

Figure 4-5 Data set Node Preview result set with 10 rows from the DB2 Analytics Accelerator table

Auto Data Prep Node

The second node in the stream, shown in Figure 4-3 on page 96, is the Auto Data Prep (ADP) node, which can help cleanse the data. Several options are offered on the Objectives tab, as shown in Figure 4-6.

Figure 4-6 Objectives tab of the Auto Data Prep Node

The Auto Data Prep node was not initialized before it was used for further SQL generation. The two ways to initialize the ADP node are as follows:

•Open the node, click Analyze Data, select the Analysis tab, and make sure the summaries were generated.

•Run the stream twice, which forces the ADP node to be initialized.

The Auto Data Prep operation adds the suffix _transformed to each affected field name, as shown in Figure 4-7. This suffix might cause the length of the field name to be extended beyond the 30-byte limit allowed in IBM DB2 11 for z/OS.

Figure 4-7 Preview window showing the transformed field names from the Auto Data Prep node

Because of the extended length, another data cleansing step is needed to shorten the names of the transformed fields. To rename the transformed field, use a filter and then edit the field names:

1. Double-click the Filter node icon (shown as the third node in Figure 4-3 on page 96).

2. The Filter window opens (Figure 4-8). Double-click the field names in the right column to edit them.

Figure 4-8 Filter node edit panel to rename the transformed fields

The last node is from the Export palette and in our case it is the Database node, which is the CLEANSED_PAYHISTORY table, and is also stored in the DB2 Analytics Accelerator. Double-click the Export node icon, then select the source from the Data source Drop-down list as shown in Figure 4-4 on page 97. The CLEANSED_PAYHISTORY table contains the output of the data transformation stream.

4.5.1 SQL Preview option to verify in-database processing

To verify whether your stream will run in-database or not without executing the stream, you can use the SQL Preview icon (purple hexagon) on your SPSS Modeler client main window, as shown in Figure 4-9.

Figure 4-9 SQL Preview icon to check in-database processing prior to stream execution

This hexagonal shaped icon is enabled automatically when you select a terminal node on your stream canvas, otherwise is disabled. Terminal node can be any node from output, graph, Database Modeling, or export palette. In the sample stream shown in Figure 4-9, the Database Node (CLEANSED_PAYHISTORY table) is taken from the Export palette. Because this table is an AOT, the entire stream qualifies for in-database transformation (Figure 4-10).

4.5.2 Why are my nodes purple

The nodes that qualify for in-database execution will change their color to purple during the execution of the stream (or during SQL Preview operation as discussed in 4.5.1, “SQL Preview option to verify in-database processing” on page 100). The nodes in Figure 4-3 on page 96 changes its color to purple as shown in Figure 4-10 when the underlying processes can run or are executed in-database (that is, in DB2 Analytics Accelerator in this case).

Figure 4-10 Stream execution in DB2 Analytics Accelerator highlighting in-database nodes

For high level information about what those underlying processes (or SQL operations) are that are performed in-database for the sample stream, see 4.6, “Stream messages” on page 101.

4.5.3 Enable cache at node level

For streams run in a database, data can be cached midstream to a temporary table in the Accelerator (AOT). When combined with SQL optimization, this can result in significant gains in performance. For example, the output from a stream that merges multiple tables to create a data mining view can be cached and reused as needed. By automatically generating SQL for all downstream nodes, performance can be further improved.

To take advantage of database caching, both SQL optimization and database caching must be enabled. Note that server optimization settings override those on the client.

For more information about setting optimization options for streams, select Help → Help Topics in the IBM SPSS Modeler Tool bar. A web browser opens to the local IBM SPSS Modeler Help page (no Internet connection is necessary). From there, go to the Table of Contents on the left side and select User’s Guide → Building Streams → Building Data Streams → Working with Streams → Setting Options for Streams.

With database caching enabled, right-click any non-terminal node to cache data at that point, and the cache will be created automatically directly in the database the next time the stream is run. If database caching or SQL optimization is not enabled, the cache will be written to the file system instead, which is not desirable with in-database analytics.

In some situations, caching of a node can enable SQL generation with complex streams. For example, DB2 z/OS does not support more than 500 tables in a single query but we have successfully run client streams with several thousand tables using in-database caching in a proof of concept.

Regardless of whether CDC replication is used on the accelerated tables (source data) in a given stream, any subsequent data changes to the underlying tables in the upstream are not propagated to the cached AOTs.

4.6 Stream messages

The stream messages will give you a clear idea of what is happening when in-database analytics is in effect.

The stream messages can be viewed from the SPSS Modeler client through the Tools menu by navigating through Stream Properties as shown in Figure 4-11.

Figure 4-11 Tools menu navigation to view stream execution messages

The Message section of the Messages tab lists the steps processed behind the scenes when a stream is executed. For illustration, the stream shown in Figure 4-10 on page 100 has a message summary shown in Figure 4-12.

Figure 4-12 Sample stream messages tab from SPSS client

The letter “i” to the left of each message means that these messages are informational. The message steps are as follows:

1. Message one shows the I/P address of the SPSS server that you are connected to and the automatically generated session identifier.

2. Message two says that the stream execution started.

3. Message three through six indicate that the SQL Modeler connected to the database (in our case, DB2 for z/OS with DB2 Analytics Accelerator) is doing these SQL operations:

a. Execute dropping (DROP) the output table, which is an AOT, essentially to make sure that the structure and data on the output table at the end of stream execution will be that of this stream’s.

b. Execute creating (CREATE) a table.

c. Preview inserting (INSERT).

d. Execute inserting (INSERT).

4. Message four indicates the creation of the output table as an AOT using the DDL shown in the Example 4-1, which you can view on the bottom part of the Messages tab when you click the summary message.

Example 4-1 CREATE AOT statement

CREATE TABLE "CLEANSED_PAYHISTORY" ( "UNIQUE_ID_transformed" DOUBLE,"MDM_CHILDREN_CT_transformed" DOUBLE,"ACAUREQ_AUREQ_ENV_M_CMONNM_t" VARCHAR(8),"A_AUREQ_ENVT_POI_SYSNM_t" VARCHAR(4),"A_AUREQ_ENV_C_CARDPDCTP_t" VARCHAR(4),"A_AUREQ_ENV_C_CARDBRND_t" VARCHAR(4),"A_AUREQ_TX_MRCHNTCTGYCD_t" VARCHAR(8),"A_AUREQ_TX_DT_ACCTTP_t" VARCHAR(4),"MDM_CLIENT_IMP_TP_CD_t" VARCHAR(4),"A_AUREQ_ENV_P_CRDHLDRV_t" VARCHAR(4),"ACAUREQ_AUREQ_ENV_P_ONL_t" VARCHAR(4),"A_AUREQ_ENV_CRAUTHNMTD_t" VARCHAR(4),"A_AUREQ_ENV_CRAUTHNNTTY_t" VARCHAR(4) ) IN ACCELERATOR "SBSDEVV4" CCSID UNICODE

5. Message five indicates that the SQL preview is done for the consolidated transformed data to be inserted into the output AOT. The sample SQL statement for this step, as generated by the SPSS Modeler is shown in Example 4-2.

Example 4-2 Sample SQL statement that creates and populates an AOT

INSERT INTO "CLEANSED_PAYHISTORY" ("UNIQUE_ID_transformed","MDM_CHILDREN_CT_transformed","ACAUREQ_AUREQ_ENV_M_CMONNM_t","A_AUREQ_ENVT_POI_SYSNM_t","A_AUREQ_ENV_C_CARDPDCTP_t","A_AUREQ_ENV_C_CARDBRND_t","A_AUREQ_TX_MRCHNTCTGYCD_t","A_AUREQ_TX_DT_ACCTTP_t","MDM_CLIENT_IMP_TP_CD_t","A_AUREQ_ENV_P_CRDHLDRV_t","ACAUREQ_AUREQ_ENV_P_ONL_t","A_AUREQ_ENV_CRAUTHNMTD_t","A_AUREQ_ENV_CRAUTHNNTTY_t")

SELECT ((DOUBLE(1.9030898725523110e-19) * DOUBLE((CAST(T0."UNIQUE_ID" AS FLOAT) - -6.9156586044167104e+16))) + 0.0000000000000000e+00) AS "UNIQUE_ID_transformed",((DOUBLE(7.3055302837382196e-08) * DOUBLE(({fn CONVERT(T0."MDM_CHILDREN_CT",SQL_FLOAT)} - 1.6773860575300001e+07))) + 0.0000000000000000e+00) AS "MDM_CHILDREN_CT_transformed",(CASE WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Pins and Balls') THEN '00' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'One More Drink') THEN '01' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Don''t be alone') THEN '02' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Fair Rides') THEN '03' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Motor Parts') THEN '04' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'FourSeasons') THEN '05' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Allegro') THEN '06' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Donna Affascinante') THEN '07' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Donna Elegante') THEN '08' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Shoes') THEN '09' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Uomo Elegante') THEN '10' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Rinascimento') THEN '11' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'M and W') THEN '12' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Family Clothing') THEN '13' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'LittleAdult') THEN '14' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'OminodiFerro') THEN '15' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'SmartKids') THEN '16' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Restaurant') THEN '17' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Eat and Go') THEN '18' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'ToBeOrNotTobe') THEN '19' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Books & Paper') THEN '20' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Musicae') THEN '21' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'LP and CD') THEN '22' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'House of Drapery') THEN '23' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'All for your House') THEN '24' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Home of Crystal') THEN '25' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Hardware and Equipment') THEN '26' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'My Beatiful House') THEN '27' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Plumbing and Heating') THEN '28' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Electric Parts Equ') THEN '29' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Masonry and Plaster') THEN '30' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'AllFood') THEN '31' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Incrocio') THEN '32' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'TuttoePiu') THEN '33' WHEN (T0."ACAUREQ_AUREQ_ENV_M_CMONNM" = 'Movies') THEN '34' ELSE NULL END) AS "ACAUREQ_AUREQ_ENV_M_CMONNM_t",(CASE WHEN (T0."ACAUREQ_AUREQ_ENVT_POI_SYSNM" = 'POI13') THEN '0' WHEN (T0."ACAUREQ_AUREQ_ENVT_POI_SYSNM" = 'POI11') THEN '1' WHEN (T0."ACAUREQ_AUREQ_ENVT_POI_SYSNM" = 'POI14') THEN '2' WHEN (T0."ACAUREQ_AUREQ_ENVT_POI_SYSNM" = 'POI12') THEN '3' ELSE NULL END) AS "A_AUREQ_ENVT_POI_SYSNM_t",(CASE WHEN (T0."ACAUREQ_AUREQ_ENV_C_CARDPDCTP" = '0006') THEN '0' WHEN (T0."ACAUREQ_AUREQ_ENV_C_CARDPDCTP" = '0005') THEN '1' WHEN (T0."ACAUREQ_AUREQ_ENV_C_CARDPDCTP" = '0004') THEN '2' WHEN (T0."ACAUREQ_AUREQ_ENV_C_CARDPDCTP" = '0003') THEN '3' ELSE NULL END) AS "A_AUREQ_ENV_C_CARDPDCTP_t",(CASE WHEN (T0."ACAUREQ_AUREQ_ENV_C_CARDBRND" = 'Amex Platinum') THEN '0' WHEN (T0."ACAUREQ_AUREQ_ENV_C_CARDBRND" = 'Visa Gold') THEN '1' WHEN (T0."ACAUREQ_AUREQ_ENV_C_CARDBRND" = 'Credit Card') THEN '2' WHEN (T0."ACAUREQ_AUREQ_ENV_C_CARDBRND" = 'Debit Card') THEN '3' ELSE NULL END) AS "A_AUREQ_ENV_C_CARDBRND_t",(CASE WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '7933') THEN '00' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5813') THEN '01' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '7273') THEN '02' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '7996') THEN '03' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5031') THEN '04' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5631') THEN '05' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5621') THEN '06' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5611') THEN '07' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5661') THEN '08' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5651') THEN '09' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5812') THEN '10' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5814') THEN '11' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '7922') THEN '12' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5311') THEN '13' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5192') THEN '14' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5733') THEN '15' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5735') THEN '16' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5641') THEN '17' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5714') THEN '18' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5722') THEN '19' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5950') THEN '20' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5072') THEN '21' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5719') THEN '22' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '1711') THEN '23' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5065') THEN '24' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '1740') THEN '25' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '7832') THEN '26' WHEN (T0."ACAUREQ_AUREQ_TX_MRCHNTCTGYCD" = '5711') THEN '27' ELSE NULL END) AS "A_AUREQ_TX_MRCHNTCTGYCD_t",(CASE WHEN (T0."ACAUREQ_AUREQ_TX_DT_ACCTTP" = 'CRDT') THEN '0' WHEN (T0."ACAUREQ_AUREQ_TX_DT_ACCTTP" = 'CHCK') THEN '1' ELSE NULL END) AS "A_AUREQ_TX_DT_ACCTTP_t",(CASE WHEN (T0."MDM_CLIENT_IMP_TP_CD" = '1') THEN '0' WHEN (T0."MDM_CLIENT_IMP_TP_CD" = '2') THEN '1' WHEN (T0."MDM_CLIENT_IMP_TP_CD" = '3') THEN '2' WHEN (T0."MDM_CLIENT_IMP_TP_CD" = '4') THEN '3' ELSE NULL END) AS "MDM_CLIENT_IMP_TP_CD_t",(CASE WHEN ((CASE WHEN (T0."ACAUREQ_AUREQ_ENV_P_CRDHLDRV" IS NULL) THEN 'MNSG' ELSE T0."ACAUREQ_AUREQ_ENV_P_CRDHLDRV" END) = 'MNVR') THEN '0' WHEN ((CASE WHEN (T0."ACAUREQ_AUREQ_ENV_P_CRDHLDRV" IS NULL) THEN 'MNSG' ELSE T0."ACAUREQ_AUREQ_ENV_P_CRDHLDRV" END) = 'MNSG') THEN '1' ELSE NULL END) AS "A_AUREQ_ENV_P_CRDHLDRV_t",(CASE WHEN ((CASE WHEN (T0."ACAUREQ_AUREQ_ENV_P_ONL" IS NULL) THEN 'OFLN' ELSE T0."ACAUREQ_AUREQ_ENV_P_ONL" END) = 'SMON') THEN '0' WHEN ((CASE WHEN (T0."ACAUREQ_AUREQ_ENV_P_ONL" IS NULL) THEN 'OFLN' ELSE T0."ACAUREQ_AUREQ_ENV_P_ONL" END) = 'OFLN') THEN '1' ELSE NULL END) AS "ACAUREQ_AUREQ_ENV_P_ONL_t",(CASE WHEN ((CASE WHEN (T0."ACAUREQ_AUREQ_ENV_CRAUTHNMTD" IS NULL) THEN 'BYPS' ELSE T0."ACAUREQ_AUREQ_ENV_CRAUTHNMTD" END) = 'MERC') THEN '0' WHEN ((CASE WHEN (T0."ACAUREQ_AUREQ_ENV_CRAUTHNMTD" IS NULL) THEN 'BYPS' ELSE T0."ACAUREQ_AUREQ_ENV_CRAUTHNMTD" END) = 'BYPS') THEN '1' ELSE NULL END) AS "A_AUREQ_ENV_CRAUTHNMTD_t",(CASE WHEN ((CASE WHEN (T0."ACAUREQ_AUREQ_ENV_CRAUTHNNTTY" IS NULL) THEN 'AGNT' ELSE T0."ACAUREQ_AUREQ_ENV_CRAUTHNNTTY" END) = 'MERC') THEN '0' WHEN ((CASE WHEN (T0."ACAUREQ_AUREQ_ENV_CRAUTHNNTTY" IS NULL) THEN 'AGNT' ELSE T0."ACAUREQ_AUREQ_ENV_CRAUTHNNTTY" END) = 'AGNT') THEN '1' ELSE NULL END) AS "A_AUREQ_ENV_CRAUTHNNTTY_t"

FROM CARDUSR.PAYHISTORY T0

6. Message six indicates the SQL shown in the preview step is executed to create and populate the output AOT.

Note: If you enable cache on any of the data transformation nodes in your stream then it would essentially result in creation and population of an AOT. For more information about caching, see 4.5.3, “Enable cache at node level” on page 100.

7. Message seven shows you the run-time details like total elapsed time and CPU time consumption on the SPSS Server to execute the stream.

4.7 Data transformation using DataStage

If you are using a non z/OS distributed platform to perform your predictive analytics modeling on DB2 for z/OS operational data now, and the process to build your external tables already exists as DataStage jobs, you can use the Accelerator to eliminate this data sprawl.

To use in-database transformation as efficiently as possible in DataStage, make the target of the transformation a DB2 table (which can be optimized to an accelerator-only table) to push transformation logic (SQL) into the accelerator.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4. Data transformation

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 4. Data transformation