Exporting models using PMML

Let's get started with PMML and the way models are exported using it.

What is PMML?

PMML is a standard for sharing predictive models across software. The standard has been developed and improved by the Data Mining Group since 1997. Using PMML, the user can notably build a model using one software package and use another software package for prediction. The export and/or import of models using PMML is currently supported by a wide range of solutions including (but not restricted to) R, Rapidminer, SAS Enterprise Miner, SPSS Modeler, and Weka.

Numerous algorithms are supported by PMML. The following table presents the list of algorithms we have explored for which models can be exported using the PMML package in R (actually, most of them). The function to generate the models, the package containing the function, and the chapter of this book where we have discussed it are also indicated:

ALGORITHM

FUNCTION

PACKAGE

CHAPTER

K-means clustering

kmeans

(stats)

4

Hierarchical clustering

hclust

(stats)

5

Association rule mining

apriori

arules

7

Linear regression

lm

(stats)

9

Naïve Bayes classification and regression

naiveBayes

e1071

10

Classification and regression trees

rpart

rpart

11

Random forest for classification and regression

randomForest

randomForest

11

Logistic regression

glm

(stats)

13

Classification with Support Vector Machines

svm

e1071

13

The PMML package also supports other models that we have not discussed here. The list can be found in the package documentation at http://cran.r-project.org/web/packages/pmml/pmml.pdf.

A brief description of the structure of PMML objects

PMML objects are generated using XML. The PMML translates a simple linear regression model (preceded by the R output). As always, we start by installing and loading the required package:

install.packages("pmml"); library(pmml)
model = lm (Sepal.Length ~ Sepal.Width, data = iris)
model

The model output is as follows:

Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = iris)

Coefficients:
(Intercept)  Sepal.Width  
     6.5262      -0.2234  

We now generate the PMML code for the model:

pmml(model)

The following output is displayed, in a commented form:

Information about the format of the document is first included, which is as follows:

<PMML version="4.2" xmlns="http://www.dmg.org/PMML-4_2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd">

The header, featuring details about the user, the software package, the date, and the algorithm that generated the model is then included:

<Header copyright="Copyright (c) 2015 mayore" description="Linear 
Regression Model">
<Extension name="user" value="mayore" extender="Rattle/PMML"/>
<Application name="Rattle/PMML" version="1.4"/>
<Timestamp>2015-02-18 13:36:46</Timestamp>
</Header>

Next is the data dictionary that describes the attributes included in the analysis:

<DataDictionary numberOfFields="2">
<DataField name="Sepal.Length" optype="continuous" dataType="double"/>
<DataField name="Sepal.Width" optype="continuous" dataType="double"/>
</DataDictionary>

Then comes information about the model, including the algorithm used, the role of the variables included in the analysis, and the generated output:

<RegressionModel modelName="Linear_Regression_Model" functionName="regression" algorithmName="least squares">
<MiningSchema>
<MiningField name="Sepal.Length" usageType="predicted"/><MiningField name="Sepal.Width" usageType="active"/>
</MiningSchema>
<Output>
<OutputField name="Predicted_Sepal.Length" feature="predictedValue"/>
</Output>
<RegressionTable intercept="6.52622255089448">
<NumericPredictor name="Sepal.Width" exponent="1" coefficient="-0.2233610611299"/>
</RegressionTable>
</RegressionModel>
</PMML>

The PMML code for different algorithms can include more or less information, but the structure is always quite similar to the one we presented previously.

Examples of predictive model exportation

In this section, we present some examples of model exportation using PMML. As we have already discovered using the linear regression example previously, the process is quite simple: it usually consists of presenting the model as an argument to the pmml() function. Here, we simply propose some very basic examples of exporting to PMML. If you want to know more about PMML, I suggest reading the book PMML in Action: Unleashing the Power of Open Standards for Data Mining and Predictive Analytics, Guazzelli, Wen-Ching, Tridivesh, CreateSpace Independent Publishing.

Exporting k-means objects

Here we start by creating a k - means model:

iris.kmeans = kmeans(iris[1:4],3)

We then export the model to PMML:

pmml_kmeans = pmml(iris.kmeans)

Next, save it as an XML file:

saveXML(pmml_kmeans, data=iris, "iris_kmeans.PMML")

Opening the document (see figure) allows us to check whether the model has been appropriately exported.

Exporting k-means objects

A snapshot of the content of the file with the PMML code

Hierarchical clustering

Exporting a hierarchical clustering model using PMML is a bit more complex. As an example, we first generate a data frame for hierarchical clustering:

DF = cbind(c(rep(1,4),rep(2,4),rep(3,4),rep(4,4)), rep(c(1,2,3,4),4),rep(c(rep(1,2),rep(2,2),rep(3,2),rep(4,2)),2))

We then create our hclust object using the default parameters:

DF.hclust = hclust(dist(DF))

We now want to export it to a PMML object. Note that this will convert the model to a kmeans representation, which is required to include the cluster centroids. So we need to determine the number of clusters. We discussed ways to do this in Chapter 5, Agglomerative Clustering Using hclust().

Here, we will simply plot the dendrogram and decide on this basis:

plot(DF.hclust)
Hierarchical clustering

Dendrogram for our dataset

In the figure, we can see that a two or four cluster model would describe the data well. The four cluster cut would have few data points, so we prefer the two cluster solution.

We now cut the tree using two clusters:

Cut = cutree(DF.hclust, k = 2)

Then, we extract centroids:

centroids = aggregate(DF, list(Cut), mean)

We can now export to PMML and save:

pmml_hclust = pmml(DF.hclust, centers = centroids)
saveXML(pmml_hclust, data=DF, "DF_hclust.PMML")

Exporting association rules (apriori objects)

It is easier to export association rules using PMML. We will use the Adult dataset that already contains the transactions for the AdultUCI data as follows:

library(arules)
data(Adult,AdultUCI)
names(AdultUCI)

As a reminder, the attributes are as follows:

[1]  age  workclass  fnlwgt  education  education-num
[6]  marital-status  occupation  relationship  race  sex
[11]  capita-gain  capital-loss  hours-per-week  native-country income

We generate the rules as follows:

Adult.Apriori = apriori(Adult)

Save the PMML code:

saveXML(pmml(Adult.Apriori), "Adult_Apriori.PMML")

Exporting Naïve Bayes objects

Here, first load the e1071 package containing the naiveBayes() function:

library(e1071)

We then build the classifier:

iris.NaiveBayes = naiveBayes(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)

Finally, we save to a file containing the PMML code:

saveXML(pmml(iris.NaiveBayes,dataset = iris, predictedField = "Species"), "iris_NaiveBayes.pmml")

Exporting decision trees (rpart objects)

We start by creating the classifier:

iris.rpart = rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)

The tree (not displayed here) is obtained by typing:

iris.rpart

We can now export it to PMML and save it to an XML file:

saveXML(pmml(iris.rpart), data = iris, "iris_rpart.pmml")
# typing pmml(iris.rpart) would display the pmml object

Exporting random forest objects

We start by loading the randomForest package:

library(randomForest)

We then grow the forest (we use default parameters here):

iris.RandomForest = randomForest(Species ~ Sepal.Length + Sepal.Width+ Petal.Length + Petal.Width, data = iris)

Finally, we export to PMML and save the file:

saveXML(pmml(iris.RandomForest), data = iris, "iris_randomForest.pmml")

Exporting logistic regression objects

We first generate some data:

set.seed(1234)
y = c(rep(0,50),rep(1,50))
x = rnorm(100)
x[51:100] = x[51:100] + 0.2

Then, we build the model (attribute y is the class):

glm.model = glm(y ~ x, family = "binomial")

Next, save it to a file containing the PMML code:

saveXML(pmml(glm.model), "glm_model.PMML")

Exporting support vector machine objects

We use the iris dataset again (Species is the class):

iris.svm = svm(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)

We now export to PMML and save:

saveXML(pmml(iris.svm), "iris_svm.PMML")
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.162.37