Chapter 8. Using R with the IBM DB2 Analytics Accelerator

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Using R with the IBM DB2 Analytics Accelerator

This chapter introduces the use of the R programming language with the IBM DB2 Analytics Accelerator. It discusses the installation of necessary components and the various techniques for using analytical functionality.

This chapter contains the following topics:

•Introduction to the R programming language

•Prerequisites

•Using RJDBC with the IBM DB2 Analytics Accelerator

•Using RODBC with the IBM DB2 Analytics Accelerator

•Using accelerator-only tables (AOTs) with R

•Using Netezza Analytics stored procedures

•Performance considerations

8.1 Introduction to the R programming language

The R programming language first appeared 1993. It is based on the S programming language, developed at Bell Laboratories, and was created at The University of Auckland in Auckland, New Zealand. The programming language is widely used among statisticians and data scientists because of its wide variety of built-in statistical functions. R can be extended easily by using Comprehensive R Archive Network (CRAN). CRAN contains various functions for standard tasks. R functions operate on data frames, the standard data representation within R, similar to a table within a database. In a data frame, data is organized in named columns and rows, where the combination of a specific column and row denotes a cell.

Because most of the operational data resides in databases, such as IBM DB2 for z/OS, combining the powerful language features of R with the business data in the database that has a similar format as R data frames makes sense. Working with the data can be done in various ways:

•Data can be extracted from the operational system, transformed, and loaded into an R compatible format (for example, in an extract, transform, or load (ETL) program).

•R can directly access the database data through various interfaces such as these:

– RJDBC: This is an R package that implements JDBC.

– RODBC: This is an R package that implements ODBC.

When R directly accesses the database data through the interfaces, projections, aggregates, joins, and other operations can be directly performed by the database (for example, IBM DB2 for z/OS). If the IBM DB2 Analytics Accelerator is attached to the system, this process is faster and introduces more analytical capabilities.

8.2 Prerequisites

To use R with the IBM DB2 Analytics Accelerator, you must obtain either the source code or a precompiled binary for your operating system. If you use the UNIX or Linux operating system, R is usually available through the integrated package manager. Windows users can download an installer from the CRAN project website:

https://cran.r-project.org/bin/windows/

After you install the R base system, continue with the installation of RJDBC or RODBC:

•To install RJDBC, see 8.3, “Using RJDBC with the IBM DB2 Analytics Accelerator” on page 151.

•To install RODBC, see 8.4, “Using RODBC with the IBM DB2 Analytics Accelerator” on page 154.

Note: Using the CRAN modules might require that you compile the contents of the module. The compile process requires a recent compiler on the system and common build utilities. If building a module fails, try to install the required development headers. OS X and Windows users can use prebuilt binary modules provided by CRAN if they exist; Linux and BSD users can find prebuilt R modules within the system repositories.

Connecting to the IBM DB2 Analytics Accelerator requires a valid Time Sharing Option (TSO) user ID and password combination for an IBM DB2 for z/OS system. The DB2 subsystem must be paired with an accelerator before queries can be offloaded to the Accelerator.

For more details about setting up and pairing the Accelerator, see the IBM DB2 Analytics Accelerator for z/OS documentation in the IBM Knowledge Center:

https://www.ibm.com/support/knowledgecenter/SS4LQ8/welcome

8.3 Using RJDBC with the IBM DB2 Analytics Accelerator

RJDBC is an easy-to-use implementation of the R standardized database interface (DBI) using Java Database Connectivity (JDBC) as a back end. Users who are familiar with other DBI-database drivers for R such as RSQLite, RPostgreSQL, or RMySQL profit from this compatibility. As RJDBC suggests, the classes use JDBC drivers and therefore require a Java Runtime Environment¹ to be present.

To install RJDBC open an R shell and type the following line:

install.packages("RJDBC", dependencies=TRUE)

You might be required to adjust the Java path after installation if the auto-detection mechanism is not working correctly. Use the R built-in javareconf command with a correctly set JAVA_HOME environment variable. The following example uses a Linux shell with an OpenJDK:

$ export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/

$ R CMD javareconf

If you are using a different operating system or Java version, you must adjust the path accordingly. The output of the javareconf utility will indicate the Java setup and update the Java configuration variables.

After that, you can use the RJDBC library within R to connect to your IBM DB2 for z/OS database. Example 8-1 shows how to connect to the IBM DB2 for z/OS database using RJDBC and query a basic table. Note that the path to the db2jcc4.jar file must be adjusted to the corresponding path. In addition, the database host, port, and name must be adjusted and so must the user name and password.

Note: The db2jcc_license_cisuz.jar might also have to be in the same directory as the db2jcc4.jar file or in the default java class path. Alternatively you can force load the JAR file by using the following command before initializing the RJDBC object:

library(rJava)

.jaddClassPath("/opt/ibm/jdbc/db2jcc_license_cisuz.jar")

Example 8-1 Connecting to the IBM DB2 for z/OS database

library(RJDBC)

drv <- JDBC("com.ibm.db2.jcc.DB2Driver","/opt/ibm/jdbc/db2jcc4.jar")

conn <- dbConnect( drv,

"jdbc:db2://myhost:1011/mydb",

user="dbadm",

password="dbadm")

The next step is to enforce query acceleration for the currently open connection. Do this by sending the following query to the connection:

dbSendQuery(conn, "SET CURRENT QUERY ACCELERATION = ALL")

After the query is sent, all further queries sent to DB2 for z/OS are accelerated. If a query cannot run on the accelerator, the query will fail. A preferable approach might be to try acceleration first and then retry on DB2 for z/OS, in case of failures, by using this command:

dbSendQuery(conn, "SET CURRENT QUERY ACCELERATION = ENABLE WITH FAILBACK")

After setting the query acceleration, the DBI implemented by RJDBC can be used normally. No additional specific database commands must be performed in order to use RJDBC with the IBM DB2 Analytics Accelerator.

Example 8-2 demonstrates a case that uses the processing capabilities of the Accelerator to increase the speed of query processing in combination with RJDBC. The data used here is provided by the United States Environmental Protection Agency and contains metrics about cars such as fuel consumption, the manufacturers, and other parameters. A copy of the data is at the following US Department of Energy web page:

https://www.fueleconomy.gov/feg/download.shtml

Example 8-2 Using the accelerator in combination with RJDBC

library(RJDBC)

drv <- JDBC("com.ibm.db2.jcc.DB2Driver","/opt/ibm/jdbc/db2jcc4.jar")

conn <- dbConnect( drv,

"jdbc:db2://myhost:1011/mydb",

user="dbadm",

password="dbadm")

dbSendQuery(conn, "SET CURRENT QUERY ACCELERATION = ENABLE WITH FAILBACK")

rs <- dbSendQuery(conn, "SELECT * FROM fueldata WHERE VClass="Two Seaters" AND make in ("BMW","Ferrari","Subaru","Toyota")")

df.cars <- fetch(rs,-1)

boxplot(barrels08~make,df.cars, main="Fuel consumption by make",ylab="Fuel consumption in barrels")

Figure 8-1 shows the results of that query.

Figure 8-1 Fuel consumption by make

8.3.1 RJDBC functions

In addition to the functions introduced in the examples, other relevant RJDBC functions can automate interaction with RJDBC and the IBM DB2 Analytics Accelerator.

For the JDBCConnection object

The following methods are for the class JDBCConnection in the RJDBC package:

•dbCommit(JDBCConnection)

Commits the current unit of work performed.

•dbDisconnect(JDBCConnection)

Disconnects properly from the database with which the driver is connected. If the connection is not closed using this function call, it is terminated when the R application terminates.

•dbGetTables(JDBCConnection)

Returns a data frame with all information about the tables on the database. The output can be used to automate tasks and locate specific tables.

•dbSendUpdate(JDBCConnection, statement)

This function is similar to dbSendQuery(JDBCConnection, statement) but does not return a result set. It is designed for Data Definition Language (DDL) operations or Data Manipulation Language (DML) operations that do not return a result set.

For the JDBCResult object

The following methods are for the class JDBCResult in the RJDBC package:

•dbClearResult(JDBCResult)

Clears or releases the result- set. It frees the memory used to store result values but does not release the memory of data frames that inherit from a fetched JDBCResult.

•dbColumnInfo(JDBCResult)

Returns a data frame with information about the columns in the result including column names and data types.

8.4 Using RODBC with the IBM DB2 Analytics Accelerator

RODBC allows you to connect to databases using R and ODBC database connectivity. To use RODBC with the IBM DB2 Analytics Accelerator you must first install the RODBC package from CRAN. When the package is available through your systems package manager, the module supplied with package manager is the preferred way of installation. To use the version from CRAN, enter the following command in an R shell:

install.packages("RODBC", dependencies=TRUE)

You might need to install the appropriate build tools and header files to run this command. To do this, you will need at least a compiler and all sources to compile the ODBC drivers.

Two methods are available to connect to IBM DB2 on z/OS using the RODBC driver:

•Example 8-3 shows setting the source connection in the ODBC driver manager.

Example 8-3 Using the ODBC driver manager predefined profile

library(RODBC)

conn <- odbcConnect(dsn="PROFILENAME",uid="USERNAME",pwd="PASSWORD")

•Example 8-4 shows setting the source connection directly by specifying the connection string.

Example 8-4 Using the ODBC DSN

library(RODBC)data

# initialize variables

driver.name <- "{IBM DB2 ODBC DRIVER}"

db.name <- "SAMPLEDB"

host.name <- "sampledb2zos.example.ibm.com"

host.port <- "10011"

user.name <- "myuser"

user.pwd <- "supersecret"

# generate connection text

con.text <- paste( "DRIVER=",driver.name,

";Database=",db.name,

";Hostname=",host.name,

";Port=",host.port,

";PROTOCOL=TCPIP",

";UID=", user.name,

";PWD=",user.pwd,sep="")

conn <- odbcDriverConnect(con.text)

After successfully connecting to the specified database, be sure that query acceleration is enforced. To do this, use the function shown in Example 8-5, which takes the connection function, the query text, and additional fields as parameters. Setting the parameter errors to false ensures that errors are reported if they exist.

Example 8-5 Enforce query acceleration

res <- sqlQuery(conn,"SET CURRENT QUERY ACCELERATION = ALL", errors=FALSE)

if (res == -1){

cat ("An error has occurred. ")

msg <- odbcGetErrMsg(con1)

print (msg)

} else {

cat ("Query acceleration successfully forced. ")

}

To query the data, use the same function used in Example 8-5 to enforce the query acceleration. Example 8-6 uses the same data as the RJDBC example and aims to determine which car manufacturer produces cars with low fuel consumption.

Example 8-6 Query the data and enforce query acceleration

library(RODBC)data

# initialize variables

driver.name <- "{IBM DB2 ODBC DRIVER}"

db.name <- "SAMPLEDB"

host.name <- "sampledb2zos.example.ibm.com"

host.port <- "10011"

user.name <- "myuser"

user.pwd <- "supersecret"

# generate connection text

con.text <- paste( "DRIVER=",driver.name,

";Database=",db.name,

";Hostname=",host.name,

";Port=",host.port,

";PROTOCOL=TCPIP",

";UID=", user.name,

";PWD=",user.pwd,sep="")

conn <- odbcDriverConnect(con.text)

res <- sqlQuery(conn,"SET CURRENT QUERY ACCELERATION = ALL", errors=FALSE)

if (res == -1){

cat ("An error has occurred. ")

msg <- odbcGetErrMsg(con1)

print (msg)

} else {

cat ("Query acceleration successfully forced. ")

}

res <- sqlQuery(conn, "SELECT * FROM fueldata WHERE VClass="Two Seaters" AND make in ("BMW","Ferrari","Subaru","Toyota")")

boxplot(barrels08~make,res, main="Fuel consumption by make",ylab="Fuel consumption in barrels")

The output (Figure 8-2) is the same as in the RJDBC example (Figure 8-1 on page 153).

Figure 8-2 Fuel consumption by make

8.4.1 RODBC functions

Two groups of functions are provided in RODBC:

•RODBC low-level functions

•RODBC high-level functions

RODBC low-level functions

These functions are mainly internal odbc* commands that implement low-level access to ODBC functions with similar names:

•odbcDataSources()

Lists all known ODBC data sources or drivers that are currently present on the system. Use this function to verify whether your IBM DB2 for z/OS driver is properly installed and accessible by RODBC.

•odbcClose(channel)

Closes the specified channel or connection and frees up resources. Closing connections should be performed before the program terminates or no more work with the database is performed. Connections are forcefully terminated when the R program terminates.

•odbcCloseAll()

Closes all currently open ODBC connections. It exposes the same behavior as odbcClose(channel) but iterates through all open connections.

•odbcSetAutoCommit(channel, autoCommit=TRUE)

You specify whether all SQL statements should be automatically committed as individual transactions. When large amounts of data are ingested, a tremendous performance difference exists when autoCommit=FALSE. You must manually send a COMMIT query to the channel using the low-level odbcQuery() method or the high-level sqlQuery() method. It usually works best, if a COMMIT is performed after every 1000 - 10000 rows.

RODBC high-level functions

These sql* functions operate at a higher level to read, save, copy, and manipulate data between data frames and SQL tables:

•sqlSave(channel, dataframe, tablename, ….)

Saves the specified data frame to the table specified in tablename on the given channel or connection. If the table does not exist, it gets created. Consider carefully the parameters of the function because they control the error behavior and performance. This function cannot be used to create an AOT, because the AOT creation syntax is not supported by RODBC. Instead create the AOT manually and then use sqlSave to ingest the data.

•sqlUpdate(channel, dataframe, tablename, ….)

Updates the rows in the specified tablename with the data in the data frame. This function is used best if a data frame is fetched using sqlQuery() and then altered using R; then, it must be saved again.

•sqlFetch(channel, tablename, ….)

Fetches the contents of the specified table into a data frame. Additional arguments can be used to limit the rows and columns that are fetched.

•sqlTables(channel, ….)

Returns a data frame with information about schema and table names. This function can be used to automate certain processes, such as scanning for data in all available tables.

8.5 Using accelerator-only tables (AOTs) with R

An Accelerator-only table (AOT) is a table that exists only on the IBM DB2 Analytics Accelerator and not on IBM DB2 for z/OS. The feature places a proxy table on the DB2 for z/OS and that points at the corresponding table on the Accelerator. AOTs can be used starting with IBM DB2 Analytics Accelerator Version 4.1 PTF5 and DB2 PUTLEVEL 1504.

Use an AOT for the following reasons:

•An AOT is the required output table type for IBM Netezza Analytics Server stored procedures (see also 8.6, “Using Netezza Analytics stored procedures” on page 159). When you execute such stored procedures, results are written and made available through an AOT. This way prevents the copying of redundant data to the DB2 system and saves time and cost. Results are immediately available in the AOT for other programs without having the need to copy data.

•An AOT is excellent for storing temporary data such as additional external data (for example, relational data needed for your algorithms). The creation is fast and no disk space is taken on the original DB2. After the processing, the AOT can be dropped easily. When transformations are done in an ETL process, AOT can help to speed this process.

•An AOT can be used to quickly store intermediate results, for example when performing long-running jobs. If the job fails at some point, it can safely be resumed when programs abort for whatever reason.

•AOT can be used to store results and thereby make them available to other applications for further processing and usage by other tools and utilities.

To create an AOT, the CREATE TABLE statement must be extended by using the following keyword:

IN ACCELERATOR <ACCELERATORNAME>

Example 8-7 shows an SQL statement that creates an AOT in the Accelerator named ACCEL1 and the database named DBTEMP.

Example 8-7 Create an AOT

CREATE TABLE TEMP1 (

C11 CHAR(2), ...)

IN ACCELERATOR ACCEL1

IN DATABASE DBTEMP;

After an AOT is created it can be used within RJDBC or RODBC like any other table. If you need to programmatically determine available accelerators to the subsystem you are currently connected to, you can query SYSACCEL.SYSACCELERATORS for details. The ACCELERATORNAME column contains valid names for the IN ACCELERATOR <ACCELERATORNAME> keyword to generate an AOT within that accelerator.

Example 8-8 creates an AOT to store processing results so it can be available to other applications based on the TPC-DS model.

Example 8-8 Creating an AOT to store processing results

library(RJDBC)

drv <- JDBC("com.ibm.db2.jcc.DB2Driver","/opt/ibm/jdbc/db2jcc4.jar")

conn <- dbConnect( drv,

"jdbc:db2://myhost:1011/mydb",

user="dbadm",

password="dbadm")

res <- dbSendQuery(conn, “CREATE TABLE TPCDS.RESULTS(

D_DATE_SK INTEGER,

CA_STATE CHAR(2),

AVG_INCOME_BOUND DOUBLE)

IN ACCELERATOR ACCEL1

IN DATABASE DBTEMP”)

res <- dbSendQuery(conn, “INSERT INTO TPCDS.RESULTS

SELECT

a11.D_DATE_SK D_DATE_SK,

a12.CA_STATE CA_STATE,

AVG(a13.IB_UPPER_BOUND) AVG_INCOME_BOUND,

FROM

DATE_DIM a11,

CUSTOMER_ADDRESS a12,

INCOME_BAND a13

GROUP BY

a11.D_DATE_SK,

a12.CA_STATE”)

AOTs are subject to constraints, as described at the following web page:

http://www.ibm.com/support/knowledgecenter/SS4LQ8_5.1.0/com.ibm.datatools.aqt.doc/gui/references/r_idaa_restrictions_aots.html

The most notable constraints are described in the following list:

•Insert from select statements (such as INSERT INTO <AOT> SELECT …. FROM ….) are supported only if the FROM clause references only tables that are currently accelerated on the accelerator. In an environment with multiple accelerators, all tables that are referenced including the AOT must be present on the accelerator that is running the query. For instance, if you use the following statement, the AOT named MYAOT, and also TABLE1 and TABLE2 must all be on the accelerator.

INSERT INTO MYAOT (SELECT * FROM TABLE1, TABLE2 WHERE TABLE1.ID = TABLE2.ID)

•You cannot run UPDATE statements for columns that define the distribution key. If you created an AOT with a non-random distribution key you cannot update the column that serves as a distribution key. The technical reason for this limitation is that an update of a distribution key requires the relocation of the updated row, which is not supported. As a workaround, you can split the UPDATE operation into a transaction with a DELETE and INSERT statement. Example 8-9 shows an SQL used to transform an UPDATE operation.

Example 8-9 Transform UPDATE

# TABLE DEFINITION:

CREATE TABLE UDEMO (

ID INT,

NAME CHAR(20),

COUNTRY CHAR(20))

IN ACCELERATOR ACCEL1

IN DATABASE DBTEMP;

# INSERT BASE DATA

INSERT INTO UDEMO (ID,NAME,COUNTRY) VALUES (1,”SEBASTIAN”,”GERMANY”);

# UPDATE STATEMENT THAT IS UNSUPPORTED

# UPDATE UDEMO SET ID=2 WHERE NAME=”SEBASTIAN”

# TRANSFORMED STATEMENTS

DELETE FROM UDEMO WHERE ID=1;

INSERT INTO UDEMO (ID,NAME,COUNTRY) VALUES (2,”SEBASTIAN”,GERMANY);

COMMIT;

8.6 Using Netezza Analytics stored procedures

The IBM Netezza Analytics stored procedures are a set of analytical algorithms to perform research on data. They enable data mining capabilities for various business applications and predictive analytics based on different analytical models. The stored procedures are directly executed on the Accelerator and thereby reduce data movement. In addition, the algorithms are optimized for speed because they directly work on the Accelerator where the data resides.

To use the IBM Netezza Analytics with the IBM DB2 Analytics Accelerator, at least version 5.1 GA of the Accelerator software is required. In addition, the Analytics extension is required for this feature to work.

See the following resources:

•For Analytics Server documentation, see the following web page:

https://www.ibm.com/support/knowledgecenter/SS4LQ8_5.1.0/com.ibm.datatools.aqt.doc/sp_msg/SPs/sp_idaa_wrapper.html

•For detailed information about the stored procedures, see Appendix B, “Job to install wrapper stored procedures” on page 179.

•IBM Netezza In-Database Analytics Reference Guide (release 3.0.1):

https://ibm.biz/BdrFpq

IBM Netezza Analytics stored procedures are wrapped inside IBM DB2 for z/OS stored procedures. Calling IBM Netezza Analytics stored procedures using the wrapper-stored procedures slightly differs from the way it is described in the reference guide.

8.6.1 Basic concept

The analytical algorithms introduced in this chapter share a similar concept. The start is usually a training data set to create or train a model. After the model is generated or trained, the model can be used to classify data, for instance to put a specific set of data into a specific group or cluster based on certain characteristics. Models can be further enhanced by training them with additional training data.

Choosing the correct algorithm for the desired use-case is important. Although different algorithms can produce similar models, be sure you understand how the algorithms are supposed to work. Also, the quality of the resulting model heavily depends on the quality of the training data set. For example, if you want a classification model that can distinguish between ten categories, the training data set should contain all of the categories with significant data in it. If only two categories are present, an accurate model cannot be computed. In addition, the quality of the model might vary depending on the parameters supplied to the modeling algorithm. Be sure to understand the documentation about the algorithm to fully understand the influence of each parameter.

Calling IBM Netezza Analytics stored procedures using R programming language is simple. You can use RJDBC or RODBC to call them using simple SQL queries. Note that the location of the stored procedure might be different on your DB2 subsystem. For the JCL installation job for the stored procedures, see Appendix B, “Job to install wrapper stored procedures” on page 179.

The following generic syntax is for calling stored procedures:

CALL <schema>.<procedure>(<accelerator>,<parameterString>,<returnValue>,<message>)

In the stored procedure call, <accelerator> specifies the accelerator on which the procedure is executed. The Analytics Server packages must be installed on the accelerator, otherwise the stored procedure will fail. In addition, all tables referenced in <parameterString>² must be present on the specified accelerator. The <returnValue> depends on the stored procedure executed, alternatively a question mark (?) can be used as a placeholder.

Example 8-10 shows how to call a stored procedure using RODBC.

Example 8-10 Using RODBC to call a stored procedure

res <- sqlQuery(conn,"CALL INZA.KMEANS ('ACCEL01', 'model=customer_model,

intable=TPCH.CUSTOMER, outtable=TPCH.CUSTOUT,

id=C_CUSTKEY, target=C_NATIONKEY, transform=S,

distance=euclidean, k=3, maxiter=5,

randseed=12345, idbased=false',

?,'')”, errors=FALSE)

Example 8-11 shows how to call a stored procedure using RJDBC.

Example 8-11 Using RJDBC to call a stored procedure

res <- dbSendQuery(conn, "CALL INZA.KMEANS ('ACCEL01', 'model=customer_model,

intable=TPCH.CUSTOMER, outtable=TPCH.CUSTOUT,

id=C_CUSTKEY, target=C_NATIONKEY, transform=S,

distance=euclidean, k=3, maxiter=5,

randseed=12345, idbased=false',

?,'')”)

8.6.2 Helper stored procedures

In contrast to the analytical modeling stored procedures, the following IBM Netezza Analytics stored procedures help to prepare data and manage existing models on the accelerator:

•SPLIT_DATA

Can be used to split data into two sets, for example, into a training data set and a test data set. The R equivalent for the stored procedure is the sample(x, size, …) function.

•MODEL_EXISTS

Can be used to determine whether a model with the name given in the parameter string exists on the accelerator. Using this function is advised in multiple accelerator environments.

•LIST_MODELS

Lists all models available on the accelerator specified.

•DROP_MODEL

Drops a model from an accelerator. This process cannot be undone, so be careful.

•PMML_MODEL

Returns the PMML representation of the model to the output table specified. The output type can be either standard PMML format or an SPSS specific format. The contents of the output table is compatible with the pmml package present in CRAN.

Note: Importing PMML models is not supported. Therefore, models modified with R utilities cannot be used with the IBM Netezza Analytics Server.

8.6.3 Decision trees

IBM Netezza Analytics stored procedures for decision trees (classification trees) are used for modeling decisions and their possible consequences. A typical decision tree contains the probability of each outcome or class in each leaf, and branches according to specific criteria:

•DECTREE

Is used to build a decision tree based on the parameter string supplied. The most important parameters in the parameter string are the input columns, the ID column, and the stop criteria. In contrast to GROW_DECTREE, this stored procedure first grows a decision tree and then prunes it after.

The “party” library from CRAN with the ctree(...) function or the rpart library with the rpart(class,...) function provides similar functionality.

•GROW_DECTREE

Grows a decision tree according to the parameter string supplied. If you want to immediately prune the decision tree after, use DECTREE() instead.

•PRUNE_DECTREE

Prunes a decision tree with the name specified in the parameter string. Note that pruning a decision tree will replace the model.

•PREDICT_DECTREE

Applies the decision tree model specified in the parameter string to the input table specified. The resulting prediction will be saved in the specified outtable. When outtableprob is set in the parameter string, the probability predictions for each possible output class will be stored in the output table specified.

Hint: Data that is written to the table specified for outtableprob is useful when a simple decision is not sufficient. For instance in a scenario where customer demand is analyzed knowing the two or three most likely needs can be helpful.

8.6.4 Regression trees

The following IBM Netezza Analytics stored procedures are for regression trees, which are similar to decision trees but the predicted outcome is a real number instead of a class to which the data belongs:

•REGTREE

Builds a regression tree. The use is similar to the decision tree stored procedure, therefore it creates the regression tree and prunes it directly after creation. The procedure provide similar functionality to the rpart(anova,..) function from the rpart CRAN module or the standard functionality of the “tree” package.

•GROW_REGTREE

Grows a regression tree based on the parameters in the parameter string.

•PRUNE_REGTREE

Prunes a regression tree based on the parameters in the parameter string. When building a new model, be sure to perform REGTREE directly because it builds and prunes the resulting tree in one step.

•PREDICT_REGTREE

Applies the regression tree model specified in the parameter string to the input table specified. The resulting prediction will be saved in the specified outtable.

8.6.5 K-means clustering

The following IBM Netezza Analytics stored procedures are for K-means clustering. The algorithm uses a predefined number of clusters:

•KMEANS

Generates a K-means clustering model that clusters the data of the input table into a number (n) of clusters. The parameter string determines the number of clusters, the input table, and the input-columns. The stored procedure can replace the built-in k-means functionality of R.

Note: If K-means must be performed on large data frames that are not part of tables present on the accelerator, a faster way is to load the data into the accelerator through accelerator-only tables and use this stored procedure for K-Means clustering.

•PREDICT_KMEANS

Applies a previously created K-means clustering model to an input table and returns the results through an output table specified in the parameter string.

8.6.6 TwoStep clustering

The following TwoStep clustering stored procedures automatically estimate the number of clusters when no number of clusters is specified. Data is processed in two passes, in contrast to K-Means:

•TWOSTEP

Generates a TwoStep clustering model that distributes and compresses the input data into a set of subclusters using the data distance. Then it reduces the tree to the specified number of clusters using a hierarchical clustering method to merge the subclusters into larger clusters.

The behavior of the algorithm is controlled using the parameter string. The mclust package available in CRAN provides similar clustering capabilities.

•PREDICT_TWOSTEP

Applies the previously created TwoStep clustering model to the data to predict to which cluster the data belongs. Input and output tables are specified using the parameter string.

8.6.7 Naive Bayes

The following list describes the Naive Bayes modeling stored procedures. The Naive Bayes algorithm is a classification algorithm based on a vector of feature values:

•NAIVEBAYES

Generates a Naive Bayes model based on the input parameters in the parameter string. Because of the nature of the algorithm, the input string can become long because many parameters must be supplied. The klaR and e1071 modules from CRAN provide similar capabilities in R.

•PREDICT_NAIVEBAYES

applies the Naive Bayes model, given in the parameter string, to the specified input table to generate classification predictions. Similar to the DECTREE prediction, the outtableprob parameter can be used to get probabilities for all classes.

8.7 Performance considerations

To make the R program as fast as possible, consider this information when programming:

•Make sure that SET CURRENT QUERY ACCELERATION = ALL is the first SQL statement executed in the RJDBC or RODBC session. This way forces all queries in the session to execute in the Accelerator rather than in the IBM DB2 for z/OS database.

•If you need to perform projections, joins, aggregates or other operations that can be expressed with SQL, they should be performed by the database rather than the R program. Transferring large data sets from the database to the computer that runs the R program can be time-intensive and computation is usually much slower than in the database. Especially when forcing acceleration when one can make use of the Accelerator infrastructure that is highly optimized to perform those operations.

•Whenever an analytical or statistical function is available in the IBM Netezza Analytics Server, use the corresponding stored procedure rather than the R function to execute it. The analytical and statistical functions of IBM Netezza Analytics are executed directly on the accelerator in parallel. This is usually faster than fetching the data from the database and executing the R functions on the data.

•Using RODBC instead of RJDBC is usually faster. Fetching result sets with RJDBC can be enhanced by setting the number of rows to fetch with the fetch(JDBCResult, rows) function set to a -1 value. When set to the value of -1, the driver fetches chunks of up to 512K rows at one time.

¹ The JDK is required in order to build the module from CRAN if not installed by the systems package manager.

² For syntax and semantics of the parameter string, see the IBM Netezza In-Database Analytics Reference Guide: https://ibm.biz/BdrFpq

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8. Using R with the IBM DB2 Analytics Accelerator

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 8. Using R with the IBM DB2 Analytics Accelerator